2023-06-23 12:42:30

by Bagas Sanjaya

[permalink] [raw]
Subject: Fwd: kernel fault on hibernation: get_zeroed_page/swsusp_write

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> page allocation error using kernel 6.3.7-desktop-1.mga9 #1 SMP PREEMPT_DYNAMIC, from Fr 09 Jun 2023 22:57:31, Key ID b742fa8b80420f66; see the backtrace in the dmesg
>> cat /proc/cpuinfo
> siblings : 4
> core id : 1
> cpu cores : 2
> ...
> type: regression, worked with the previous kernel, namely 6.3.6, Mo 05 Jun 2023 21:37:15, Key ID b742fa8b80420f66 before updating today

And then:

> The first hibernation attempt resulted in the backtrace you can see in the dmesg above, my second hibernation attempt from a text console (vt03 or so) has worked without errors and the third one I tried to do from the GUI/X11 again; see the debug options I had turned on). On the third attempt something strange did happen. It seemed to write to disk as it should, the screen turned black but the power led and button still stayed alighted. Waking up by pressing the power button did not yield any effect, nor the SysRq keys (alas forgot to write 511 to >/proc/sys/kernel/sysrq). After a hard power reset it booted as if not hibernated. On the first hibernation attempt I could see lengthy and intermittent disk access. On the third attempt I had waited for some considerable time.

See Bugzilla for the full thread and attached infos (dmesg, journalctl,
stack trace disassembly).

Unfortunately, the reporter can't provide /proc/kcore output
and haven't performed bisection yet (he can't build custom kernel).

Anyway, I'm adding it to regzbot (as stable-specific regression) for now:

#regzbot introduced: v6.3.6..v6.3.7 https://bugzilla.kernel.org/show_bug.cgi?id=217544
#regzbot title: page allocation error (kernel fault on hibernation involving get_zeroed_page/swsusp_write)
#regzbot link: https://bugs.mageia.org/show_bug.cgi?id=32044

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217544

--
An old man doll... just what I always wanted! - Clara


2023-06-23 16:44:22

by Elmar Stellnberger

[permalink] [raw]
Subject: Re: Fwd: kernel fault on hibernation: get_zeroed_page/swsusp_write

Hi all, Hi Bagas S.

As the issue didn't reproduce the way I would have liked (did not
reproduce at all here, not even with the same kernel version; no
further comment) I have now uploaded the /proc/kcore and the kernel
binaries and symbol files I still had on disk at
https://upload.elstel.info (This may move to something like
upload.elstel.info/bugs/kernpagealloc in the future)

Regards,
Elmar

Am Fri, Jun 23, 2023 at 07:36:21PM +0700 schrieb Bagas Sanjaya:
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
> > page allocation error using kernel 6.3.7-desktop-1.mga9 #1 SMP PREEMPT_DYNAMIC, from Fr 09 Jun 2023 22:57:31, Key ID b742fa8b80420f66; see the backtrace in the dmesg
> >> cat /proc/cpuinfo
> > siblings : 4
> > core id : 1
> > cpu cores : 2
> > ...
> > type: regression, worked with the previous kernel, namely 6.3.6, Mo 05 Jun 2023 21:37:15, Key ID b742fa8b80420f66 before updating today
>
> And then:
>
> > The first hibernation attempt resulted in the backtrace you can see in the dmesg above, my second hibernation attempt from a text console (vt03 or so) has worked without errors and the third one I tried to do from the GUI/X11 again; see the debug options I had turned on). On the third attempt something strange did happen. It seemed to write to disk as it should, the screen turned black but the power led and button still stayed alighted. Waking up by pressing the power button did not yield any effect, nor the SysRq keys (alas forgot to write 511 to >/proc/sys/kernel/sysrq). After a hard power reset it booted as if not hibernated. On the first hibernation attempt I could see lengthy and intermittent disk access. On the third attempt I had waited for some considerable time.
>
> See Bugzilla for the full thread and attached infos (dmesg, journalctl,
> stack trace disassembly).
>
> Unfortunately, the reporter can't provide /proc/kcore output
> and haven't performed bisection yet (he can't build custom kernel).
>
> Anyway, I'm adding it to regzbot (as stable-specific regression) for now:
>
> #regzbot introduced: v6.3.6..v6.3.7 https://bugzilla.kernel.org/show_bug.cgi?id=217544
> #regzbot title: page allocation error (kernel fault on hibernation involving get_zeroed_page/swsusp_write)
> #regzbot link: https://bugs.mageia.org/show_bug.cgi?id=32044
>
> Thanks.
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217544
>
> --
> An old man doll... just what I always wanted! - Clara

2023-06-24 01:44:49

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: Fwd: kernel fault on hibernation: get_zeroed_page/swsusp_write

On Fri, Jun 23, 2023 at 06:17:05PM +0200, Elmar Stellnberger wrote:
> Hi all, Hi Bagas S.
>
> As the issue didn't reproduce the way I would have liked (did not
> reproduce at all here, not even with the same kernel version; no
> further comment) I have now uploaded the /proc/kcore and the kernel
> binaries and symbol files I still had on disk at
> https://upload.elstel.info (This may move to something like
> upload.elstel.info/bugs/kernpagealloc in the future)
>

First, tl;dr:

> A: http://en.wikipedia.org/wiki/Top_post
> Q: Were do I find info about this thing called top-posting?
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
> A: No.
> Q: Should I include quotations after my reply?
>
> http://daringfireball.net/2007/07/on_top

Can you attach [1] to your Bugzilla report? Also, any report on bisection?

Also, you don't need to upload full kernel images instead; people can
grab /proc/config.gz you uploaded on Bugzilla and then `make olddefconfig`
from it.

Anyway, telling regzbot:

#regzbot link: https://upload.elstel.info/kcore.xz

Thanks.

[1]: https://upload.elstel.info/kcore.xz

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (1.31 kB)
signature.asc (235.00 B)
Download all attachments

2023-06-24 11:33:22

by Elmar Stellnberger

[permalink] [raw]
Subject: Re: Fwd: kernel fault on hibernation: get_zeroed_page/swsusp_write

Hi Bagas S., Hi all

concerns: Bug 217544 - kernel fault on hibernation: get_zeroed_page/swsusp_write
https://bugzilla.kernel.org/show_bug.cgi?id=217544

Bisection does not make sense here, since I can not reproduce the
issue. Packing the kernel binaries and symbol files was meant to invoke
gdb directly on the kcore:

> /usr/src/kernel-6.3.7-desktop586-1.mga9/scripts/extract-vmlinux vmlinuz-6.3.7-desktop-1.mga9 >vmlinux
> file vmlinuz-6.3.7-desktop-1.mga9
vmlinuz-6.3.7-desktop-1.mga9: Linux kernel x86 boot executable bzImage, version 6.3.7-desktop-1.mga9 ([email protected]) #1 SMP PREEMPT_DYNAMIC Fri Jun 9 17:47:53 UTC 2023, RO-rootFS, swap_dev 0X6, Normal VGA
> file vmlinux
vmlinux: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, BuildID[sha1]=942674511321671b33c739cceddb1e3a48a17895, stripped
> grep __alloc_pages /boot/System.map...
0xc03758... T __alloc_pages
> gdb vmlinux kcore
# x/5i 0xc03758..

Am Sat, Jun 24, 2023 at 08:25:39AM +0700 schrieb Bagas Sanjaya:
> Also, you don't need to upload full kernel images instead; people can
> grab /proc/config.gz you uploaded on Bugzilla and then `make olddefconfig`
> from it.
>
I would heavily doubt that the same symbols would get to be located at
the same address if you started to compile from source, even if you
applied all Mageia specific patches. We would need reproducible builds
for that. Nonetheless you are free to check whether the symbols will
reside at the same place in your System.map afterwards.
I wonder whether there is a way to convert the System.map text file
(which looks to me like the output of 'nm -S') back into an elf section
to be added to the stripped vmlinux with objcopy. Shouldn?t there be a
script/ for this?

:: Sometimes you have only one chance to catch a bug.

Cheers,
Elmar

Am Sat, Jun 24, 2023 at 08:25:39AM +0700 schrieb Bagas Sanjaya:
> On Fri, Jun 23, 2023 at 06:17:05PM +0200, Elmar Stellnberger wrote:
> Can you attach [1] to your Bugzilla report? Also, any report on bisection?

Pardon, what is [1]?


2023-06-24 12:47:52

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: Fwd: kernel fault on hibernation: get_zeroed_page/swsusp_write

On 6/24/23 17:21, Elmar Stellnberger wrote:
> Hi Bagas S., Hi all
>
> concerns: Bug 217544 - kernel fault on hibernation: get_zeroed_page/swsusp_write
> https://bugzilla.kernel.org/show_bug.cgi?id=217544
>
> Bisection does not make sense here, since I can not reproduce the
> issue. Packing the kernel binaries and symbol files was meant to invoke
> gdb directly on the kcore:
>

Thorsten: Should this be marked as invalid/inconclusive?

> Am Sat, Jun 24, 2023 at 08:25:39AM +0700 schrieb Bagas Sanjaya:
>> On Fri, Jun 23, 2023 at 06:17:05PM +0200, Elmar Stellnberger wrote:
>> Can you attach [1] to your Bugzilla report? Also, any report on bisection?
>
> Pardon, what is [1]?
>

Your kcore dump.

--
An old man doll... just what I always wanted! - Clara


2023-06-24 13:55:35

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Fwd: kernel fault on hibernation: get_zeroed_page/swsusp_write

On 24.06.23 14:15, Bagas Sanjaya wrote:
> On 6/24/23 17:21, Elmar Stellnberger wrote:
>> Hi Bagas S., Hi all
>>
>> concerns: Bug 217544 - kernel fault on hibernation: get_zeroed_page/swsusp_write
>> https://bugzilla.kernel.org/show_bug.cgi?id=217544
>>
>> Bisection does not make sense here, since I can not reproduce the
>> issue. Packing the kernel binaries and symbol files was meant to invoke
>> gdb directly on the kcore:
>
> Thorsten: Should this be marked as invalid/inconclusive?

Not as invalid, as there might be a real issue here; but it's hard to
say, as among others it also quite possible that something else went
wrong (compiler? hardware?). Someone would have to investigate. But
given the fact that this happened with a stable kernel[1] and is
impossible to reproduce, I suspect no developer will be motivated enough
to do so. Then it's not worth tracking[2]:

#regzbot inconclusive: impossible to reproduce

Elmar, that's nothing bad. In case this turns out to be something you
can reproduce and bisect, just let us know and we'll add it back.

[1] see the sections about stable kernels
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

[2] side note: due to limited resources I consider to stop tracking all
non-bisected issues in general (expect those that started to happen in
mainline since the last mainline release) – or put them in a special
category that signals "those are collected here JFYI until they are
bisected, as the regression tracker due to limited resources for now
can't keep a close eye on these"

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.