2022-09-23 16:36:10

by Conor Dooley

[permalink] [raw]
Subject: [resend][bug] low-probability console lockups since 5.19

Hey all,

Been bisecting a bug that is causing a boot failure in my CI & have
ended up here.. The bug in question is a low(ish) probability lock up
of the serial console, I would estimate about 1-in-5 chance on the
boards I could actually trigger it on which it has taken me so long
to realise that this was an actual problem. Thinking back on it, there
were other failures that I would retroactively attribute to this
problem too, but I had earlycon disabled and I chalked the hangs at
"Starting kernel" down to pre-release hardware since the other boots
in a CI job would not fail (we boot the same config multiple times),
Silly me I guess!

In a boot failure (with earlycon & keep_bootcon enabled) I see the
likes of [0]:
[ 0.485063] NFS: Registering the id_resolver key type
[ 0.491051] Key type id_resolver registered
[ 0.495743] Key type id_legacy registered
[ 0.500528] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
[ 0.508010] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
[ 0.516857] 9p: Installing v9fs 9p2000 file system support
[ 0.524032] NET: Registered PF_ALG protocol family
[ 0.529562] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
[ 0.537817] io scheduler mq-deadline registered
[ 0.542889] io scheduler kyber registered
[ 0.559682] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
[ 0.568431] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
[ 0.635633] microchip-pcie 2000000000.pcie: axi read request error
[ 0.642602] microchip-pcie 2000000000.pcie: axi read timeout
[ 0.776630] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.784133] microchip-pcie 2000000000.pcie: axi write request error
[ 0.791100] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.798543] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.805981] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.813421] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.820859] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 0.828295] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
[ 1.579668] Freeing initrd memory: 7316K

The sec error stuff is unrelated & happens on good boots too - although
there are far fewer lines of "sec error" there [1] (excuse the message
duplication from the bootcon).

I was able to bisect the behaviour to this set of reverts, but I have
low confidence as to which commit specifically caused it - it's hard
to know that doing 20+ reboots on every commmit I built is enough to
make sure I did not make a mistake. I had initially started the
bisection between v6.0-rc6 & v6.0-rc1, and that lead me down a sub-tree
of net which is why the bisect log starts at a "random" commit as that
merge commit was the one blamed by the inital bisection.

git bisect start
# status: waiting for both good and bad commits
# bad: [5e8379351dbde61ea383e514f0f9ecb2c047cf4e] Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect bad 5e8379351dbde61ea383e514f0f9ecb2c047cf4e
# good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
# good: [fbe86daca0ba878b04fa241b85e26e54d17d4229] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect good fbe86daca0ba878b04fa241b85e26e54d17d4229
# good: [2b8c612c6102f751e6e3e1bd425f64e9d3d3f638] kernel/reboot: Fix powering off using a non-syscall code paths
git bisect good 2b8c612c6102f751e6e3e1bd425f64e9d3d3f638
# good: [54c2cc79194c961a213c1d375fe3aa4165664cc4] Merge tag 'usb-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect good 54c2cc79194c961a213c1d375fe3aa4165664cc4
# skip: [8f7ac50c97d30ae5ae48da3b516510d05a67b9e5] Merge tag 'sound-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect skip 8f7ac50c97d30ae5ae48da3b516510d05a67b9e5
# good: [cdd85786f4b3b9273e4376e69aa95a2d71722764] x86/tdx: Clarify RIP adjustments in #VE handler
git bisect good cdd85786f4b3b9273e4376e69aa95a2d71722764
# good: [8cc683833726912498130a0130fc3bd0d169ef59] Merge branch 'net-dp83822-fix-interrupt-floods'
git bisect good 8cc683833726912498130a0130fc3bd0d169ef59
# bad: [0840a7914caa14315a3191178a9f72c742477860] Merge tag 'char-misc-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 0840a7914caa14315a3191178a9f72c742477860
# bad: [38bc4ac431684498126f9baa3a530e5a132f0173] Merge tag 'drm-fixes-2022-06-24' of git://anongit.freedesktop.org/drm/drm
git bisect bad 38bc4ac431684498126f9baa3a530e5a132f0173
# good: [399bd66e219e331976fe6fa6ab81a023c0c97870] Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good 399bd66e219e331976fe6fa6ab81a023c0c97870
# bad: [a237cfd6b7469d6f5eeaa45f30128ab78b5281a5] Merge tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block
git bisect bad a237cfd6b7469d6f5eeaa45f30128ab78b5281a5
# good: [92f20ff72066d8d7e2ffb655c2236259ac9d1c5d] Merge tag 'pm-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 92f20ff72066d8d7e2ffb655c2236259ac9d1c5d
# bad: [598f2404879336277a4320ac5000394b873e049a] Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block
git bisect bad 598f2404879336277a4320ac5000394b873e049a
# good: [386e4fb6962b9f248a80f8870aea0870ca603e89] io_uring: use original request task for inflight tracking
git bisect good 386e4fb6962b9f248a80f8870aea0870ca603e89
# good: [2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a] Revert "printk: extend console_lock for per-console locking"
git bisect good 2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a
# bad: [07a22b61946f0b80065b0ddcc703b715f84355f5] Revert "printk: add functions to prefer direct printing"
git bisect bad 07a22b61946f0b80065b0ddcc703b715f84355f5
# bad: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
git bisect bad 5831788afb17b89c5b531fb60cbd798613ccbb63
# first bad commit: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"

On 5831788afb17 ("Revert "printk: add kthread console printers"") or on
07a22b61946f ("Revert "printk: add functions to prefer direct printing"")
I was not able to reproduce the exact error conditions that I got on
the current HEAD or any of the commits along my bisection, but instead I
got something different [2]:

[ 0.471992] 9p: Installing v9fs 9p2000 file system support
[ 0.479049] NET: Registered PF_ALG protocol family
[ 0.484585] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
[ 0.492813] io scheduler mq-deadline registered
[ 0.497871] io scheduler kyber registered
[ 0.513472] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
[ 0.522194] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
[ 0.588518] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.596027] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 0.603467] microchip-pcie 2000000000.pcie: axi read request error
[ 0.610322] microchip-pcie 2000000000.pcie: axi read timeout
[ 0.616603] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.624025] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 0.631450] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.638877] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 0.646300] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 0.653725] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 1.516378] Freeing initrd memory: 7336K
[ 5.591881] mc_event_handler: 667402 callbacks suppressed
[ 5.591909] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
[ 5.605290] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
[ 5.612704] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer

<truncated>

[ 21.539878] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 21.546651] rcu: 0-...0: (1 GPs behind) idle=19f/1/0x4000000000000002 softirq=34/36 fqs=2626
[ 21.556177] (detected by 1, t=5256 jiffies, g=-1151, q=1143 ncpus=4)
[ 21.563293] Task dump for CPU 0:
[ 21.566860] task:swapper/0 state:R running task stack: 0 pid: 1 ppid: 0 flags:0x00000008
[ 21.577823] Call Trace:

Unfortunately I do not really have any more information than that for
you, I am hoping that you (plural) could be of some help as to where
I should start looking - although after spending every evening this
week bisecting/setting up so that I could automate my bisection it'll
be a few days before I could really focus on this again.

My system is an embedded RISC-V dev board (PolarFire SoC Icicle kit).

There is also a failing build in the ClangBuiltLinux CI [3] that is
seeing similar symptoms, but I was unable to reproduce this problem
doing 1000s of reboots of a known-bad commit in QEMU myself.

LMK if you need any more info,
Conor.

[0] https://gist.githubusercontent.com/ConchuOD/575f4bfe3d68b9b32227740365b04401/raw/c69dfffd21183898edaeb223a9455ebc28edcf65/gistfile1.txt
[1] https://gist.github.com/ConchuOD/575f4bfe3d68b9b32227740365b04401?permalink_comment_id=4313185#gistcomment-4313185
[2] https://gist.github.com/ConchuOD/575f4bfe3d68b9b32227740365b04401?permalink_comment_id=4313202#gistcomment-4313202
[3] https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/3090070308/jobs/4999131449#step:5:270


2022-09-26 13:10:49

by Petr Mladek

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Fri 2022-09-23 17:24:17, Conor Dooley wrote:
> Hey all,
>
> Been bisecting a bug that is causing a boot failure in my CI & have
> ended up here.. The bug in question is a low(ish) probability lock up
> of the serial console, I would estimate about 1-in-5 chance on the
> boards I could actually trigger it on which it has taken me so long
> to realise that this was an actual problem. Thinking back on it, there
> were other failures that I would retroactively attribute to this
> problem too, but I had earlycon disabled and I chalked the hangs at
> "Starting kernel" down to pre-release hardware since the other boots
> in a CI job would not fail (we boot the same config multiple times),
> Silly me I guess!
>
> In a boot failure (with earlycon & keep_bootcon enabled) I see the
> likes of [0]:
> [ 0.485063] NFS: Registering the id_resolver key type
> [ 0.491051] Key type id_resolver registered
> [ 0.495743] Key type id_legacy registered
> [ 0.500528] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
> [ 0.508010] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
> [ 0.516857] 9p: Installing v9fs 9p2000 file system support
> [ 0.524032] NET: Registered PF_ALG protocol family
> [ 0.529562] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
> [ 0.537817] io scheduler mq-deadline registered
> [ 0.542889] io scheduler kyber registered
> [ 0.559682] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
> [ 0.568431] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
> [ 0.635633] microchip-pcie 2000000000.pcie: axi read request error
> [ 0.642602] microchip-pcie 2000000000.pcie: axi read timeout
> [ 0.776630] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.784133] microchip-pcie 2000000000.pcie: axi write request error
> [ 0.791100] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.798543] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.805981] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.813421] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.820859] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.828295] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 1.579668] Freeing initrd memory: 7316K
>
> The sec error stuff is unrelated & happens on good boots too - although
> there are far fewer lines of "sec error" there [1] (excuse the message
> duplication from the bootcon).
>
> I was able to bisect the behaviour to this set of reverts, but I have
> low confidence as to which commit specifically caused it - it's hard
> to know that doing 20+ reboots on every commmit I built is enough to
> make sure I did not make a mistake. I had initially started the
> bisection between v6.0-rc6 & v6.0-rc1, and that lead me down a sub-tree
> of net which is why the bisect log starts at a "random" commit as that
> merge commit was the one blamed by the inital bisection.
>
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [5e8379351dbde61ea383e514f0f9ecb2c047cf4e] Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect bad 5e8379351dbde61ea383e514f0f9ecb2c047cf4e
> # good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
> git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
> # good: [fbe86daca0ba878b04fa241b85e26e54d17d4229] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> git bisect good fbe86daca0ba878b04fa241b85e26e54d17d4229
> # good: [2b8c612c6102f751e6e3e1bd425f64e9d3d3f638] kernel/reboot: Fix powering off using a non-syscall code paths
> git bisect good 2b8c612c6102f751e6e3e1bd425f64e9d3d3f638
> # good: [54c2cc79194c961a213c1d375fe3aa4165664cc4] Merge tag 'usb-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> git bisect good 54c2cc79194c961a213c1d375fe3aa4165664cc4
> # skip: [8f7ac50c97d30ae5ae48da3b516510d05a67b9e5] Merge tag 'sound-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> git bisect skip 8f7ac50c97d30ae5ae48da3b516510d05a67b9e5
> # good: [cdd85786f4b3b9273e4376e69aa95a2d71722764] x86/tdx: Clarify RIP adjustments in #VE handler
> git bisect good cdd85786f4b3b9273e4376e69aa95a2d71722764
> # good: [8cc683833726912498130a0130fc3bd0d169ef59] Merge branch 'net-dp83822-fix-interrupt-floods'
> git bisect good 8cc683833726912498130a0130fc3bd0d169ef59
> # bad: [0840a7914caa14315a3191178a9f72c742477860] Merge tag 'char-misc-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
> git bisect bad 0840a7914caa14315a3191178a9f72c742477860
> # bad: [38bc4ac431684498126f9baa3a530e5a132f0173] Merge tag 'drm-fixes-2022-06-24' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 38bc4ac431684498126f9baa3a530e5a132f0173
> # good: [399bd66e219e331976fe6fa6ab81a023c0c97870] Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect good 399bd66e219e331976fe6fa6ab81a023c0c97870
> # bad: [a237cfd6b7469d6f5eeaa45f30128ab78b5281a5] Merge tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> git bisect bad a237cfd6b7469d6f5eeaa45f30128ab78b5281a5
> # good: [92f20ff72066d8d7e2ffb655c2236259ac9d1c5d] Merge tag 'pm-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> git bisect good 92f20ff72066d8d7e2ffb655c2236259ac9d1c5d
> # bad: [598f2404879336277a4320ac5000394b873e049a] Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> git bisect bad 598f2404879336277a4320ac5000394b873e049a
> # good: [386e4fb6962b9f248a80f8870aea0870ca603e89] io_uring: use original request task for inflight tracking
> git bisect good 386e4fb6962b9f248a80f8870aea0870ca603e89
> # good: [2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a] Revert "printk: extend console_lock for per-console locking"
> git bisect good 2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a
> # bad: [07a22b61946f0b80065b0ddcc703b715f84355f5] Revert "printk: add functions to prefer direct printing"
> git bisect bad 07a22b61946f0b80065b0ddcc703b715f84355f5
> # bad: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
> git bisect bad 5831788afb17b89c5b531fb60cbd798613ccbb63
> # first bad commit: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
>
> On 5831788afb17 ("Revert "printk: add kthread console printers"") or on
> 07a22b61946f ("Revert "printk: add functions to prefer direct printing"")
> I was not able to reproduce the exact error conditions that I got on
> the current HEAD or any of the commits along my bisection, but instead I
> got something different [2]:

This kind of problems became visible with introducing printk console
kthreads in 5.19-rc1. And it was actually the main reason why they
were reverted in 5.19-rc4.

The mainline merge commits when they were introduced and reverted:

+ introduced: commit 537e62c865dcb9b91 ("Merge tag 'printk-for-5.19'
of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")

+ reverted: commit 9d882352bac8f2ff375 ("Merge tag 'printk-for-5.19-rc4'
of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux


The problem might happen when console write() callback is not properly
serialized against other operations on the same port. It might usually
happen even without the kthreads. But the kthreads significantly
increased the chance to hit the problem.

The best solution is to fix the console driver. So far, we
noticed two typical problems:

1. Early console did not take port->lock. For example, see the commit
589f892ac8ef244e47c5a0 ("serial: meson: acquire port->lock in startup()").

2. Console driver expected that the write() callback would be called
with IRQs disabled. For example, see the commit aabdbb1b7a5819e18
("serial: msm_serial: disable interrupts in __msm_console_write().

Best Regards,
Petr

2022-09-26 15:55:59

by Conor Dooley

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Mon, Sep 26, 2022 at 12:32:55PM +0200, Petr Mladek wrote:
> On Fri 2022-09-23 17:24:17, Conor Dooley wrote:
> > Hey all,
> >
> > Been bisecting a bug that is causing a boot failure in my CI & have
> > ended up here.. The bug in question is a low(ish) probability lock up
> > of the serial console, I would estimate about 1-in-5 chance on the
> > boards I could actually trigger it on which it has taken me so long
> > to realise that this was an actual problem. Thinking back on it, there
> > were other failures that I would retroactively attribute to this
> > problem too, but I had earlycon disabled and I chalked the hangs at
> > "Starting kernel" down to pre-release hardware since the other boots
> > in a CI job would not fail (we boot the same config multiple times),
> > Silly me I guess!
> >
> > In a boot failure (with earlycon & keep_bootcon enabled) I see the
> > likes of [0]:
> > [ 0.485063] NFS: Registering the id_resolver key type
> > [ 0.491051] Key type id_resolver registered
> > [ 0.495743] Key type id_legacy registered
> > [ 0.500528] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
> > [ 0.508010] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
> > [ 0.516857] 9p: Installing v9fs 9p2000 file system support
> > [ 0.524032] NET: Registered PF_ALG protocol family
> > [ 0.529562] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
> > [ 0.537817] io scheduler mq-deadline registered
> > [ 0.542889] io scheduler kyber registered
> > [ 0.559682] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
> > [ 0.568431] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
> > [ 0.635633] microchip-pcie 2000000000.pcie: axi read request error
> > [ 0.642602] microchip-pcie 2000000000.pcie: axi read timeout
> > [ 0.776630] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.784133] microchip-pcie 2000000000.pcie: axi write request error
> > [ 0.791100] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.798543] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.805981] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.813421] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.820859] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 0.828295] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> > [ 1.579668] Freeing initrd memory: 7316K
> >
> > The sec error stuff is unrelated & happens on good boots too - although
> > there are far fewer lines of "sec error" there [1] (excuse the message
> > duplication from the bootcon).
> >
> > I was able to bisect the behaviour to this set of reverts, but I have
> > low confidence as to which commit specifically caused it - it's hard
> > to know that doing 20+ reboots on every commmit I built is enough to
> > make sure I did not make a mistake. I had initially started the
> > bisection between v6.0-rc6 & v6.0-rc1, and that lead me down a sub-tree
> > of net which is why the bisect log starts at a "random" commit as that
> > merge commit was the one blamed by the inital bisection.
> >
> > git bisect start
> > # status: waiting for both good and bad commits
> > # bad: [5e8379351dbde61ea383e514f0f9ecb2c047cf4e] Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > git bisect bad 5e8379351dbde61ea383e514f0f9ecb2c047cf4e
> > # good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
> > git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
> > # good: [fbe86daca0ba878b04fa241b85e26e54d17d4229] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > git bisect good fbe86daca0ba878b04fa241b85e26e54d17d4229
> > # good: [2b8c612c6102f751e6e3e1bd425f64e9d3d3f638] kernel/reboot: Fix powering off using a non-syscall code paths
> > git bisect good 2b8c612c6102f751e6e3e1bd425f64e9d3d3f638
> > # good: [54c2cc79194c961a213c1d375fe3aa4165664cc4] Merge tag 'usb-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> > git bisect good 54c2cc79194c961a213c1d375fe3aa4165664cc4
> > # skip: [8f7ac50c97d30ae5ae48da3b516510d05a67b9e5] Merge tag 'sound-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> > git bisect skip 8f7ac50c97d30ae5ae48da3b516510d05a67b9e5
> > # good: [cdd85786f4b3b9273e4376e69aa95a2d71722764] x86/tdx: Clarify RIP adjustments in #VE handler
> > git bisect good cdd85786f4b3b9273e4376e69aa95a2d71722764
> > # good: [8cc683833726912498130a0130fc3bd0d169ef59] Merge branch 'net-dp83822-fix-interrupt-floods'
> > git bisect good 8cc683833726912498130a0130fc3bd0d169ef59
> > # bad: [0840a7914caa14315a3191178a9f72c742477860] Merge tag 'char-misc-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
> > git bisect bad 0840a7914caa14315a3191178a9f72c742477860
> > # bad: [38bc4ac431684498126f9baa3a530e5a132f0173] Merge tag 'drm-fixes-2022-06-24' of git://anongit.freedesktop.org/drm/drm
> > git bisect bad 38bc4ac431684498126f9baa3a530e5a132f0173
> > # good: [399bd66e219e331976fe6fa6ab81a023c0c97870] Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > git bisect good 399bd66e219e331976fe6fa6ab81a023c0c97870
> > # bad: [a237cfd6b7469d6f5eeaa45f30128ab78b5281a5] Merge tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> > git bisect bad a237cfd6b7469d6f5eeaa45f30128ab78b5281a5
> > # good: [92f20ff72066d8d7e2ffb655c2236259ac9d1c5d] Merge tag 'pm-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> > git bisect good 92f20ff72066d8d7e2ffb655c2236259ac9d1c5d
> > # bad: [598f2404879336277a4320ac5000394b873e049a] Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> > git bisect bad 598f2404879336277a4320ac5000394b873e049a
> > # good: [386e4fb6962b9f248a80f8870aea0870ca603e89] io_uring: use original request task for inflight tracking
> > git bisect good 386e4fb6962b9f248a80f8870aea0870ca603e89
> > # good: [2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a] Revert "printk: extend console_lock for per-console locking"
> > git bisect good 2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a
> > # bad: [07a22b61946f0b80065b0ddcc703b715f84355f5] Revert "printk: add functions to prefer direct printing"
> > git bisect bad 07a22b61946f0b80065b0ddcc703b715f84355f5
> > # bad: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
> > git bisect bad 5831788afb17b89c5b531fb60cbd798613ccbb63
> > # first bad commit: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
> >
> > On 5831788afb17 ("Revert "printk: add kthread console printers"") or on
> > 07a22b61946f ("Revert "printk: add functions to prefer direct printing"")
> > I was not able to reproduce the exact error conditions that I got on
> > the current HEAD or any of the commits along my bisection, but instead I
> > got something different [2]:
>
> This kind of problems became visible with introducing printk console
> kthreads in 5.19-rc1. And it was actually the main reason why they
> were reverted in 5.19-rc4.

Right. I did notice that between rc1 & rc4 a lot of garbage messages
as two printers "competed".

> The mainline merge commits when they were introduced and reverted:
>
> + introduced: commit 537e62c865dcb9b91 ("Merge tag 'printk-for-5.19'
> of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")

Prior to this my console behaved normally. After it, it started having
the "competition" I mentioned above. Unfortunately I did not save any
logs from this.

> + reverted: commit 9d882352bac8f2ff375 ("Merge tag 'printk-for-5.19-rc4'
> of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

After this commit is where the lockups started, I cannot reproduce it
several dozens of reboots prior to v5.19-rc1. Maybe the behaviour was
already there, but I cannot trigger any misbehaviour prior to v5.19-rc1.

> The problem might happen when console write() callback is not properly
> serialized against other operations on the same port. It might usually
> happen even without the kthreads. But the kthreads significantly
> increased the chance to hit the problem.
>
> The best solution is to fix the console driver. So far, we
> noticed two typical problems:
>
> 1. Early console did not take port->lock. For example, see the commit
> 589f892ac8ef244e47c5a0 ("serial: meson: acquire port->lock in startup()").

If I don't have an early console, I lock up at "Starting kernel" & never
get any output. Does that rule out the early console?
>
> 2. Console driver expected that the write() callback would be called
> with IRQs disabled. For example, see the commit aabdbb1b7a5819e18
> ("serial: msm_serial: disable interrupts in __msm_console_write().

I'll take a look at these I suppose & the new threaded printer stuff,
and see if either makes a difference for me. Unfortunately, ~zero chance
that I get anothing done prior to the merge window, but this is broken
for me in v5.19 so I guess that makes no odds.

I guess we should also sic regzbot on this thread so it does not get
lost?

Thanks,
Conor.

2022-09-28 17:03:19

by Conor Dooley

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
> Hey all,
>
> Been bisecting a bug that is causing a boot failure in my CI & have
> ended up here.. The bug in question is a low(ish) probability lock up
> of the serial console, I would estimate about 1-in-5 chance on the
> boards I could actually trigger it on which it has taken me so long
> to realise that this was an actual problem. Thinking back on it, there
> were other failures that I would retroactively attribute to this
> problem too, but I had earlycon disabled and I chalked the hangs at
> "Starting kernel" down to pre-release hardware since the other boots
> in a CI job would not fail (we boot the same config multiple times),
> Silly me I guess!
>
> In a boot failure (with earlycon & keep_bootcon enabled) I see the
> likes of [0]:
> [ 0.485063] NFS: Registering the id_resolver key type
> [ 0.491051] Key type id_resolver registered
> [ 0.495743] Key type id_legacy registered
> [ 0.500528] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
> [ 0.508010] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
> [ 0.516857] 9p: Installing v9fs 9p2000 file system support
> [ 0.524032] NET: Registered PF_ALG protocol family
> [ 0.529562] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
> [ 0.537817] io scheduler mq-deadline registered
> [ 0.542889] io scheduler kyber registered
> [ 0.559682] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
> [ 0.568431] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
> [ 0.635633] microchip-pcie 2000000000.pcie: axi read request error
> [ 0.642602] microchip-pcie 2000000000.pcie: axi read timeout
> [ 0.776630] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.784133] microchip-pcie 2000000000.pcie: axi write request error
> [ 0.791100] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.798543] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.805981] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.813421] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.820859] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 0.828295] microchip-pcie 2000000000.pcie: sec error in axi2pcie buffer
> [ 1.579668] Freeing initrd memory: 7316K
>
> The sec error stuff is unrelated & happens on good boots too - although
> there are far fewer lines of "sec error" there [1] (excuse the message
> duplication from the bootcon).
>
> I was able to bisect the behaviour to this set of reverts, but I have
> low confidence as to which commit specifically caused it - it's hard
> to know that doing 20+ reboots on every commmit I built is enough to
> make sure I did not make a mistake. I had initially started the
> bisection between v6.0-rc6 & v6.0-rc1, and that lead me down a sub-tree
> of net which is why the bisect log starts at a "random" commit as that
> merge commit was the one blamed by the inital bisection.
>
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [5e8379351dbde61ea383e514f0f9ecb2c047cf4e] Merge tag 'net-5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect bad 5e8379351dbde61ea383e514f0f9ecb2c047cf4e
> # good: [4b0986a3613c92f4ec1bdc7f60ec66fea135991f] Linux 5.18
> git bisect good 4b0986a3613c92f4ec1bdc7f60ec66fea135991f
> # good: [fbe86daca0ba878b04fa241b85e26e54d17d4229] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> git bisect good fbe86daca0ba878b04fa241b85e26e54d17d4229
> # good: [2b8c612c6102f751e6e3e1bd425f64e9d3d3f638] kernel/reboot: Fix powering off using a non-syscall code paths
> git bisect good 2b8c612c6102f751e6e3e1bd425f64e9d3d3f638
> # good: [54c2cc79194c961a213c1d375fe3aa4165664cc4] Merge tag 'usb-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
> git bisect good 54c2cc79194c961a213c1d375fe3aa4165664cc4
> # skip: [8f7ac50c97d30ae5ae48da3b516510d05a67b9e5] Merge tag 'sound-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
> git bisect skip 8f7ac50c97d30ae5ae48da3b516510d05a67b9e5
> # good: [cdd85786f4b3b9273e4376e69aa95a2d71722764] x86/tdx: Clarify RIP adjustments in #VE handler
> git bisect good cdd85786f4b3b9273e4376e69aa95a2d71722764
> # good: [8cc683833726912498130a0130fc3bd0d169ef59] Merge branch 'net-dp83822-fix-interrupt-floods'
> git bisect good 8cc683833726912498130a0130fc3bd0d169ef59
> # bad: [0840a7914caa14315a3191178a9f72c742477860] Merge tag 'char-misc-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
> git bisect bad 0840a7914caa14315a3191178a9f72c742477860
> # bad: [38bc4ac431684498126f9baa3a530e5a132f0173] Merge tag 'drm-fixes-2022-06-24' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 38bc4ac431684498126f9baa3a530e5a132f0173
> # good: [399bd66e219e331976fe6fa6ab81a023c0c97870] Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect good 399bd66e219e331976fe6fa6ab81a023c0c97870
> # bad: [a237cfd6b7469d6f5eeaa45f30128ab78b5281a5] Merge tag 'block-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> git bisect bad a237cfd6b7469d6f5eeaa45f30128ab78b5281a5
> # good: [92f20ff72066d8d7e2ffb655c2236259ac9d1c5d] Merge tag 'pm-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> git bisect good 92f20ff72066d8d7e2ffb655c2236259ac9d1c5d
> # bad: [598f2404879336277a4320ac5000394b873e049a] Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block
> git bisect bad 598f2404879336277a4320ac5000394b873e049a
> # good: [386e4fb6962b9f248a80f8870aea0870ca603e89] io_uring: use original request task for inflight tracking
> git bisect good 386e4fb6962b9f248a80f8870aea0870ca603e89
> # good: [2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a] Revert "printk: extend console_lock for per-console locking"
> git bisect good 2d9ef940f89e0ab4fde7ba6f769d82f2a450c35a
> # bad: [07a22b61946f0b80065b0ddcc703b715f84355f5] Revert "printk: add functions to prefer direct printing"
> git bisect bad 07a22b61946f0b80065b0ddcc703b715f84355f5
> # bad: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
> git bisect bad 5831788afb17b89c5b531fb60cbd798613ccbb63
> # first bad commit: [5831788afb17b89c5b531fb60cbd798613ccbb63] Revert "printk: add kthread console printers"
>
> On 5831788afb17 ("Revert "printk: add kthread console printers"") or on
> 07a22b61946f ("Revert "printk: add functions to prefer direct printing"")
> I was not able to reproduce the exact error conditions that I got on
> the current HEAD or any of the commits along my bisection, but instead I
> got something different [2]:
>
> [ 0.471992] 9p: Installing v9fs 9p2000 file system support
> [ 0.479049] NET: Registered PF_ALG protocol family
> [ 0.484585] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247)
> [ 0.492813] io scheduler mq-deadline registered
> [ 0.497871] io scheduler kyber registered
> [ 0.513472] microchip-pcie 2000000000.pcie: host bridge /soc/pcie@2000000000 ranges:
> [ 0.522194] microchip-pcie 2000000000.pcie: MEM 0x2008000000..0x2087ffffff -> 0x0008000000
> [ 0.588518] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
> [ 0.596027] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
> [ 0.603467] microchip-pcie 2000000000.pcie: axi read request error
> [ 0.610322] microchip-pcie 2000000000.pcie: axi read timeout
> [ 0.616603] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
> [ 0.624025] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
> [ 0.631450] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
> [ 0.638877] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
> [ 0.646300] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
> [ 0.653725] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
> [ 1.516378] Freeing initrd memory: 7336K
> [ 5.591881] mc_event_handler: 667402 callbacks suppressed
> [ 5.591909] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
> [ 5.605290] microchip-pcie 2000000000.pcie: ded error in pcie2axi buffer
> [ 5.612704] microchip-pcie 2000000000.pcie: sec error in pcie2axi buffer
>
> <truncated>
>
> [ 21.539878] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 21.546651] rcu: 0-...0: (1 GPs behind) idle=19f/1/0x4000000000000002 softirq=34/36 fqs=2626
> [ 21.556177] (detected by 1, t=5256 jiffies, g=-1151, q=1143 ncpus=4)
> [ 21.563293] Task dump for CPU 0:
> [ 21.566860] task:swapper/0 state:R running task stack: 0 pid: 1 ppid: 0 flags:0x00000008
> [ 21.577823] Call Trace:
>
> Unfortunately I do not really have any more information than that for
> you, I am hoping that you (plural) could be of some help as to where
> I should start looking - although after spending every evening this
> week bisecting/setting up so that I could automate my bisection it'll
> be a few days before I could really focus on this again.
>
> My system is an embedded RISC-V dev board (PolarFire SoC Icicle kit).
>
> There is also a failing build in the ClangBuiltLinux CI [3] that is
> seeing similar symptoms, but I was unable to reproduce this problem
> doing 1000s of reboots of a known-bad commit in QEMU myself.
>
> LMK if you need any more info,
> Conor.
>
> [0] https://gist.githubusercontent.com/ConchuOD/575f4bfe3d68b9b32227740365b04401/raw/c69dfffd21183898edaeb223a9455ebc28edcf65/gistfile1.txt
> [1] https://gist.github.com/ConchuOD/575f4bfe3d68b9b32227740365b04401?permalink_comment_id=4313185#gistcomment-4313185
> [2] https://gist.github.com/ConchuOD/575f4bfe3d68b9b32227740365b04401?permalink_comment_id=4313202#gistcomment-4313202
> [3] https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/3090070308/jobs/4999131449#step:5:270
>

#regzbot introduced: 5831788afb17b89c5b531fb60cbd798613ccbb63 ^
Hopefully I did this correctly...

I picked that commit as that's where things start going haywire.

Thanks,
Conor.

2022-09-29 09:23:07

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

Hi Conor

On 28.09.22 18:55, Conor Dooley wrote:
> On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
>>
>> Been bisecting a bug that is causing a boot failure in my CI & have
>> ended up here.. The bug in question is a low(ish) probability lock up
>> of the serial console, I would estimate about 1-in-5 chance on the
>> boards I could actually trigger it on which it has taken me so long
>> to realise that this was an actual problem. Thinking back on it, there
>> were other failures that I would retroactively attribute to this
>> problem too, but I had earlycon disabled
> [...]
> #regzbot introduced: 5831788afb17b89c5b531fb60cbd798613ccbb63 ^
> Hopefully I did this correctly...

Yes, you did, thx for this. I already had been watching this thread
manually and was a bit unsure what to do with it.

> I picked that commit as that's where things start going haywire.

There is one thing I wonder when skimming this thread: was there maybe
some other change somewhere in the kernel between the introduction and
the revert of the printk console kthreads patches that is the real
culprit here that makes existing, older races easier to hit? But I guess
in the end that would be very hard to find and it's easier to fix the
problem in the console driver... :-/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

#regzbot backburner: tricky situation that might take some time to get
resolved

2022-09-29 10:13:00

by Conor Dooley

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Thu, Sep 29, 2022 at 11:06:01AM +0200, Thorsten Leemhuis wrote:
> Hi Conor
>
> On 28.09.22 18:55, Conor Dooley wrote:
> > On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
> >>
> >> Been bisecting a bug that is causing a boot failure in my CI & have
> >> ended up here.. The bug in question is a low(ish) probability lock up
> >> of the serial console, I would estimate about 1-in-5 chance on the
> >> boards I could actually trigger it on which it has taken me so long
> >> to realise that this was an actual problem. Thinking back on it, there
> >> were other failures that I would retroactively attribute to this
> >> problem too, but I had earlycon disabled
> > [...]
> > #regzbot introduced: 5831788afb17b89c5b531fb60cbd798613ccbb63 ^
> > Hopefully I did this correctly...
>
> Yes, you did, thx for this. I already had been watching this thread
> manually and was a bit unsure what to do with it.

Great, thanks.

>
> > I picked that commit as that's where things start going haywire.
>
> There is one thing I wonder when skimming this thread: was there maybe
> some other change somewhere in the kernel between the introduction and
> the revert of the printk console kthreads patches that is the real
> culprit here that makes existing, older races easier to hit? But I guess
> in the end that would be very hard to find and it's easier to fix the
> problem in the console driver... :-/

Entirely possible that something arrived in the middle, yeah. I've done
100s of reboots on that interim section, albeit with the threaded
printers enabled, as I restarted the bisection several times & never hit
this failure then.

I don't know anything about console/printk/serial drivers unfortunately
so I will almost certainly not be able to find the problem by
inspection. I'd rather submit patches than send reports, but I really
really need some help here. I looked at the two patterns Petr suggested,
but the former I am not sure applies since the issue is present even
when earlycon is disabled & the latter appears (to my untrained eye) to
be accounted for in the 8250 driver.

Thanks,
Conor.

2022-09-29 10:43:19

by Petr Mladek

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Thu 2022-09-29 10:29:05, Conor Dooley wrote:
> On Thu, Sep 29, 2022 at 11:06:01AM +0200, Thorsten Leemhuis wrote:
> > Hi Conor
> >
> > On 28.09.22 18:55, Conor Dooley wrote:
> > > On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
> > >>
> > >> Been bisecting a bug that is causing a boot failure in my CI & have
> > >> ended up here.. The bug in question is a low(ish) probability lock up
> > >> of the serial console, I would estimate about 1-in-5 chance on the
> > >> boards I could actually trigger it on which it has taken me so long
> > >> to realise that this was an actual problem. Thinking back on it, there
> > >> were other failures that I would retroactively attribute to this
> > >> problem too, but I had earlycon disabled
> >
> > There is one thing I wonder when skimming this thread: was there maybe
> > some other change somewhere in the kernel between the introduction and
> > the revert of the printk console kthreads patches that is the real
> > culprit here that makes existing, older races easier to hit? But I guess
> > in the end that would be very hard to find and it's easier to fix the
> > problem in the console driver... :-/
>
> Entirely possible that something arrived in the middle, yeah. I've done
> 100s of reboots on that interim section, albeit with the threaded
> printers enabled, as I restarted the bisection several times & never hit
> this failure then.

Interesting. I wonder if the used console was fixed during the window
when the kthreads were enabled.

> I don't know anything about console/printk/serial drivers unfortunately
> so I will almost certainly not be able to find the problem by
> inspection. I'd rather submit patches than send reports, but I really
> really need some help here. I looked at the two patterns Petr suggested,
> but the former I am not sure applies since the issue is present even
> when earlycon is disabled & the latter appears (to my untrained eye) to
> be accounted for in the 8250 driver.

The problem with the missing port->lock is visible only when the
early console is enabled. But It is really hard to hit without
the kthreads.

The problem with enabled IRQs was visible only with kthreads. The
original code called console->write() callback already with IRQs
disabled.

The kthreads called console->write() callback with IRQs enabled.
It made sense. They need to be disabled only when really needed
and the tested drivers did this correctly.

Best Reagrds,
Petr

2022-09-29 11:45:45

by Conor Dooley

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Thu, Sep 29, 2022 at 12:12:39PM +0200, Petr Mladek wrote:
> On Thu 2022-09-29 10:29:05, Conor Dooley wrote:
> > On Thu, Sep 29, 2022 at 11:06:01AM +0200, Thorsten Leemhuis wrote:
> > > Hi Conor
> > >
> > > On 28.09.22 18:55, Conor Dooley wrote:
> > > > On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
> > > >>
> > > >> Been bisecting a bug that is causing a boot failure in my CI & have
> > > >> ended up here.. The bug in question is a low(ish) probability lock up
> > > >> of the serial console, I would estimate about 1-in-5 chance on the
> > > >> boards I could actually trigger it on which it has taken me so long
> > > >> to realise that this was an actual problem. Thinking back on it, there
> > > >> were other failures that I would retroactively attribute to this
> > > >> problem too, but I had earlycon disabled
> > >
> > > There is one thing I wonder when skimming this thread: was there maybe
> > > some other change somewhere in the kernel between the introduction and
> > > the revert of the printk console kthreads patches that is the real
> > > culprit here that makes existing, older races easier to hit? But I guess
> > > in the end that would be very hard to find and it's easier to fix the
> > > problem in the console driver... :-/
> >
> > Entirely possible that something arrived in the middle, yeah. I've done
> > 100s of reboots on that interim section, albeit with the threaded
> > printers enabled, as I restarted the bisection several times & never hit
> > this failure then.
>
> Interesting. I wonder if the used console was fixed during the window
> when the kthreads were enabled.

I will, possibly tonight but probably not, run the bisection again with
the threaded printer merge reverted. Hopefully it is not filled with
conflicts if I go that way...

>
> > I don't know anything about console/printk/serial drivers unfortunately
> > so I will almost certainly not be able to find the problem by
> > inspection. I'd rather submit patches than send reports, but I really
> > really need some help here. I looked at the two patterns Petr suggested,
> > but the former I am not sure applies since the issue is present even
> > when earlycon is disabled & the latter appears (to my untrained eye) to
> > be accounted for in the 8250 driver.
>
> The problem with the missing port->lock is visible only when the
> early console is enabled. But It is really hard to hit without
> the kthreads.

Right, so sounds like that can be excluded since my CI was hitting it
with earlycon disabled. I'll triple check this, possibly later today.

>
> The problem with enabled IRQs was visible only with kthreads. The
> original code called console->write() callback already with IRQs
> disabled.
>
> The kthreads called console->write() callback with IRQs enabled.
> It made sense. They need to be disabled only when really needed
> and the tested drivers did this correctly.

And that sounds like it can be also excluded, since my issue started
post-revert. Unless there's still some kthreads code in there that was
not reverted?

Thanks,
Conor.


2022-09-29 15:00:01

by John Ogness

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On 2022-09-29, Conor Dooley <[email protected]> wrote:
> I will, possibly tonight but probably not, run the bisection again
> with the threaded printer merge reverted. Hopefully it is not filled
> with conflicts if I go that way...

It might be easier for you to just apply the following patch to run your
tests . This simple patch disables kthread usage.

John Ogness


diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b095fb5f5f61..9679887c0805 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3605,6 +3605,10 @@ static int __init printk_activate_kthreads(void)
{
struct console *con;

+ /* hack: never activate kthreads */
+ if (!printk_kthreads_available)
+ return 0;
+
console_lock();
printk_kthreads_available = true;
for_each_console(con)

2022-09-29 22:30:34

by Conor Dooley

[permalink] [raw]
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Thu, Sep 29, 2022 at 04:19:03PM +0206, John Ogness wrote:
> On 2022-09-29, Conor Dooley <[email protected]> wrote:
> > I will, possibly tonight but probably not, run the bisection again
> > with the threaded printer merge reverted. Hopefully it is not filled
> > with conflicts if I go that way...
>
> It might be easier for you to just apply the following patch to run your
> tests . This simple patch disables kthread usage.

Thanks a million John, restarted bisection from master prior to the
merge of the reverts & have ended up finding the actual issue thanks
to your hack - a RISC-V specific clocksource change. I'll close the
regzbot issue & open a new one against the actual source of my issues.

#regzbot invalid: bisection lead me up the garden path

It looks like I have an issue that's caused by the PCI controller driver
on this SoC* which triggers RCU stalls - but thanks to the clocksource
issue the RCU stalls never get reported & the system just locks up. The
system locked up in the same place on the kthreads branch so I guess I
just got unlucky with my bisection because of that.

Thanks again for the hack to bisect with,
Conor.

* likely due to running the wrong FPGA bitstream, the root port's
connection to the cpus is configured by the FPGA fabric.

>
> John Ogness
>
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index b095fb5f5f61..9679887c0805 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3605,6 +3605,10 @@ static int __init printk_activate_kthreads(void)
> {
> struct console *con;
>
> + /* hack: never activate kthreads */
> + if (!printk_kthreads_available)
> + return 0;
> +
> console_lock();
> printk_kthreads_available = true;
> for_each_console(con)