Hello Linus, Petr, John,
the series
https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
printers")
causes a regression on arm64 (Marvell CN9130-CRB board) where the
system boot freezes in most cases (and is unusable until restarted by
watchdog), or, in some cases boots, but the console output gets mangled
for a while (the serial console spits garbage characters).
The stall example:
...
[ 1.108141] io scheduler mq-deadline registered
[ 1.108143] io scheduler kyber registered
[ 1.163484] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
[
Or another:
...
[ 1.108155] io scheduler mq-deadline registered
[ 1.108156] io scheduler kyber registered
[ 1.162991] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
[ 1.163361] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
[ 1.180827] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
[
The garbage example:
...
[ 0.920951] raid6: using neon recovery algorithm
[ 0.921228] iommu: Default domain type: Translated
%
gb@k+cFL/[ 4.954974] DSA: tree 0 setup
[ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
I did not have time to look at the contents of the commit yet to try to
fix it myself, and won't have time for another two weeks. But I thought
I should let you know now.
Marek
Hi Marek,
On 2022-06-19, Marek Behún <[email protected]> wrote:
> the series
> https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
> merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
>
> and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
> printers")
>
> causes a regression on arm64 (Marvell CN9130-CRB board) where the
> system boot freezes in most cases (and is unusable until restarted by
> watchdog), or, in some cases boots, but the console output gets mangled
> for a while (the serial console spits garbage characters).
Could you provide your kernel config and a complete kernel boot log (at
least as much as you get)? I would like to investigate the serial
driver. What you are describing sounds more like a serial bug. The
kthread console series has already uncovered serial driver bugs in
multiple drivers.
Thanks for reporting this.
John Ogness
On 2022-06-20, John Ogness <[email protected]> wrote:
> On 2022-06-19, Marek Behún <[email protected]> wrote:
>> the series
>> https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
>> merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
>>
>> and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
>> printers")
>>
>> causes a regression on arm64 (Marvell CN9130-CRB board) where the
>> system boot freezes in most cases (and is unusable until restarted by
>> watchdog), or, in some cases boots, but the console output gets mangled
>> for a while (the serial console spits garbage characters).
Also, have you tried 5.19-rc3? There were several fixes that came in
after rc2.
John Ogness
On (22/06/19 20:49), Marek Beh?n wrote:
[..]
> causes a regression on arm64 (Marvell CN9130-CRB board) where the
> system boot freezes in most cases (and is unusable until restarted by
> watchdog), or, in some cases boots, but the console output gets mangled
> for a while (the serial console spits garbage characters).
Can you please try disabling console kthreads and see how the boot
process goes? Just `return 0` from printk_activate_kthreads() (I think
this should do the trick).
> The stall example:
>
> ...
> [ 1.108141] io scheduler mq-deadline registered
> [ 1.108143] io scheduler kyber registered
> [ 1.163484] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [
>
> Or another:
>
> ...
> [ 1.108155] io scheduler mq-deadline registered
> [ 1.108156] io scheduler kyber registered
> [ 1.162991] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [ 1.163361] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> [ 1.180827] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> [
>
> The garbage example:
>
> ...
> [ 0.920951] raid6: using neon recovery algorithm
> [ 0.921228] iommu: Default domain type: Translated
> %
>
> gb@k+cFL/[ 4.954974] DSA: tree 0 setup
> [ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
This is pretty suspicious. I don't see how console kthreads would
corrupt the output. I suspect that something else is going on, some
memory corruption, etc.
On Mon, 20 Jun 2022 00:29:16 +0206
John Ogness <[email protected]> wrote:
> Hi Marek,
>
> On 2022-06-19, Marek Behún <[email protected]> wrote:
> > the series
> > https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
> > merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
> >
> > and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
> > printers")
> >
> > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > system boot freezes in most cases (and is unusable until restarted by
> > watchdog), or, in some cases boots, but the console output gets mangled
> > for a while (the serial console spits garbage characters).
>
> Could you provide your kernel config and a complete kernel boot log (at
> least as much as you get)? I would like to investigate the serial
> driver. What you are describing sounds more like a serial bug. The
> kthread console series has already uncovered serial driver bugs in
> multiple drivers.
>
> Thanks for reporting this.
>
> John Ogness
Hi John,
attaching bootlogs and config.
Marek
On Mon, 20 Jun 2022 14:23:28 +0900
Sergey Senozhatsky <[email protected]> wrote:
> On (22/06/19 20:49), Marek Behún wrote:
> [..]
> > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > system boot freezes in most cases (and is unusable until restarted by
> > watchdog), or, in some cases boots, but the console output gets mangled
> > for a while (the serial console spits garbage characters).
>
> Can you please try disabling console kthreads and see how the boot
> process goes? Just `return 0` from printk_activate_kthreads() (I think
> this should do the trick).
This indeed makes the problem go away...
>
> > The stall example:
> >
> > ...
> > [ 1.108141] io scheduler mq-deadline registered
> > [ 1.108143] io scheduler kyber registered
> > [ 1.163484] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > [
> >
> > Or another:
> >
> > ...
> > [ 1.108155] io scheduler mq-deadline registered
> > [ 1.108156] io scheduler kyber registered
> > [ 1.162991] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.163361] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.180827] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > [
> >
> > The garbage example:
> >
> > ...
> > [ 0.920951] raid6: using neon recovery algorithm
> > [ 0.921228] iommu: Default domain type: Translated
> > %
> >
> > gb@k+cFL/[ 4.954974] DSA: tree 0 setup
> > [ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
>
> This is pretty suspicious. I don't see how console kthreads would
> corrupt the output. I suspect that something else is going on, some
> memory corruption, etc.
Maybe multiple threads are writing to serial registers, or something...
Marek
On Mon, 20 Jun 2022 00:37:34 +0206
John Ogness <[email protected]> wrote:
> On 2022-06-20, John Ogness <[email protected]> wrote:
> > On 2022-06-19, Marek Behún <[email protected]> wrote:
> >> the series
> >> https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
> >> merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
> >>
> >> and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
> >> printers")
> >>
> >> causes a regression on arm64 (Marvell CN9130-CRB board) where the
> >> system boot freezes in most cases (and is unusable until restarted by
> >> watchdog), or, in some cases boots, but the console output gets mangled
> >> for a while (the serial console spits garbage characters).
>
> Also, have you tried 5.19-rc3? There were several fixes that came in
> after rc2.
>
> John Ogness
rc3 still has the issue.
Marek
On (22/06/20 12:02), Marek Beh?n wrote:
> > On (22/06/19 20:49), Marek Beh?n wrote:
> > [..]
> > > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > > system boot freezes in most cases (and is unusable until restarted by
> > > watchdog), or, in some cases boots, but the console output gets mangled
> > > for a while (the serial console spits garbage characters).
> >
> > Can you please try disabling console kthreads and see how the boot
> > process goes? Just `return 0` from printk_activate_kthreads() (I think
> > this should do the trick).
>
> This indeed makes the problem go away...
Oh... OK. Didn't expect that :)
> > > The garbage example:
> > >
> > > ...
> > > [ 0.920951] raid6: using neon recovery algorithm
> > > [ 0.921228] iommu: Default domain type: Translated
> > > %
> > >
> > > gb@k+cFL/[ 4.954974] DSA: tree 0 setup
> > > [ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
> >
> > This is pretty suspicious. I don't see how console kthreads would
> > corrupt the output. I suspect that something else is going on, some
> > memory corruption, etc.
>
> Maybe multiple threads are writing to serial registers, or something...
That's possible. Console drivers usually should grab port->lock for
write(), but maybe something is missing in the driver you use. What
console driver are you using?
On Mon, 20 Jun 2022 19:13:37 +0900
Sergey Senozhatsky <[email protected]> wrote:
> On (22/06/20 12:02), Marek Behún wrote:
> > > On (22/06/19 20:49), Marek Behún wrote:
> > > [..]
> > > > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > > > system boot freezes in most cases (and is unusable until restarted by
> > > > watchdog), or, in some cases boots, but the console output gets mangled
> > > > for a while (the serial console spits garbage characters).
> > >
> > > Can you please try disabling console kthreads and see how the boot
> > > process goes? Just `return 0` from printk_activate_kthreads() (I think
> > > this should do the trick).
> >
> > This indeed makes the problem go away...
>
> Oh... OK. Didn't expect that :)
>
> > > > The garbage example:
> > > >
> > > > ...
> > > > [ 0.920951] raid6: using neon recovery algorithm
> > > > [ 0.921228] iommu: Default domain type: Translated
> > > > %
> > > >
> > > > gb@k+cFL/[ 4.954974] DSA: tree 0 setup
> > > > [ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
> > >
> > > This is pretty suspicious. I don't see how console kthreads would
> > > corrupt the output. I suspect that something else is going on, some
> > > memory corruption, etc.
> >
> > Maybe multiple threads are writing to serial registers, or something...
>
> That's possible. Console drivers usually should grab port->lock for
> write(), but maybe something is missing in the driver you use. What
> console driver are you using?
compatible = "snps,dw-apb-uart", so drivers/tty/serial/8250/8250_dw.c
seems that the function dw8250_serial_out() does not use the spinlock...
Marek
Hi Marek,
On Mon, 20 Jun 2022 at 03:51, Marek Behún <[email protected]> wrote:
>
> Hello Linus, Petr, John,
>
> the series
> https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
> merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
>
> and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
> printers")
>
> causes a regression on arm64 (Marvell CN9130-CRB board) where the
> system boot freezes in most cases (and is unusable until restarted by
> watchdog), or, in some cases boots, but the console output gets mangled
> for a while (the serial console spits garbage characters).
FWIW I have seen a similar issue to this on a totally different
machine/board on arm (Sigmastar SSD202D, "mstarv7" in the kernel..)
since rebasing my tree on 5.19.
I don't see lock ups all the time but every few boots the machine will
deadlock. Usually with garbage as the last message on serial.
With both serial and a display as consoles on the command line I could
sometimes see more output on the display but I didn't manage to debug
it.
Adding quiet to the command line made it go away so I've been doing
that as workaround.
Serial driver in question is 8250_dw.c
Cheers,
Daniel
+Cc: Ilpo, a new maintainer for the driver
On Mon, Jun 20, 2022 at 12:46 PM Marek Behún <[email protected]> wrote:
>
> On Mon, 20 Jun 2022 19:13:37 +0900
> Sergey Senozhatsky <[email protected]> wrote:
>
> > On (22/06/20 12:02), Marek Behún wrote:
> > > > On (22/06/19 20:49), Marek Behún wrote:
> > > > [..]
> > > > > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > > > > system boot freezes in most cases (and is unusable until restarted by
> > > > > watchdog), or, in some cases boots, but the console output gets mangled
> > > > > for a while (the serial console spits garbage characters).
> > > >
> > > > Can you please try disabling console kthreads and see how the boot
> > > > process goes? Just `return 0` from printk_activate_kthreads() (I think
> > > > this should do the trick).
> > >
> > > This indeed makes the problem go away...
> >
> > Oh... OK. Didn't expect that :)
> >
> > > > > The garbage example:
> > > > >
> > > > > ...
> > > > > [ 0.920951] raid6: using neon recovery algorithm
> > > > > [ 0.921228] iommu: Default domain type: Translated
> > > > > %
> > > > >
> > > > > gb@k+cFL/[ 4.954974] DSA: tree 0 setup
> > > > > [ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
> > > >
> > > > This is pretty suspicious. I don't see how console kthreads would
> > > > corrupt the output. I suspect that something else is going on, some
> > > > memory corruption, etc.
> > >
> > > Maybe multiple threads are writing to serial registers, or something...
> >
> > That's possible. Console drivers usually should grab port->lock for
> > write(), but maybe something is missing in the driver you use. What
> > console driver are you using?
>
> compatible = "snps,dw-apb-uart", so drivers/tty/serial/8250/8250_dw.c
>
> seems that the function dw8250_serial_out() does not use the spinlock...
>
> Marek
--
With Best Regards,
Andy Shevchenko
On Mon 2022-06-20 11:29:36, Marek Beh?n wrote:
> On Mon, 20 Jun 2022 00:29:16 +0206
> John Ogness <[email protected]> wrote:
> > On 2022-06-19, Marek Beh?n <[email protected]> wrote:
> > > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > > system boot freezes in most cases (and is unusable until restarted by
> > > watchdog), or, in some cases boots, but the console output gets mangled
> > > for a while (the serial console spits garbage characters).
>
> attaching bootlogs and config.
This is the log when the system booted:
> [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]
> [ 0.000000] Linux version 5.19.0-rc2-00410-g9776fe0f424b (kabel@dellmb) (aarch64-unknown-linux-gnu-gcc (Gentoo Hardened 10.3.1_p20211126 p0) 10.3.1 20211126, GNU ld (Gentoo 2.37_p1 p2) 2.37) #491 SMP Mon Jun 20 11:00:54 CEST 2022
> [ 0.000000] Machine model: Marvell Armada CN9130-CRB-B
> [ 0.000000] earlycon: uart8250 at MMIO32 0x00000000f0512000 (options '')
> [ 0.000000] printk: bootconsole [uart8250] enabled
Early console enabled.
> [ 0.000000] NUMA: No NUMA configuration found
[...]
> [ 0.062565] rcu: Hierarchical SRCU implementation.
> [ 0.062589] printk: bootconsole [uart8250] printing thread started
The early console started being handled by the kthread.
> [ 0.073843] smp: Bringing up secondary CPUs ...
> [ 0.074238] Detected PIPT I-cache on CPU1
[...]
> [ 1.067359] io scheduler kyber registered
> [ 1.120214] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [ 1.120577] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> [ 1.137980] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> [ 1.166562] printk:[ console [ttyS0] printing thread started
> [ 1.166564] printk: console [ttyS0] enabled
2nd console was added using the properly initialized serial port.
It should use the same physical port as the early console.
Both early console and proper console driver has its own kthread.
> 1.166486] f0512000.serial: ttyS0 at MMIO 0xf0512000 (irq = 22, base_baud = 12500000) is a 16550A
The line is malformed. I wonder if both early console and proper
console used the same port in parallel.
> [ 1.166567] printk: bootconsole [uart8250] disabled
> [ 1.185422] printk: bootconsole [uart8250] printing thread stopped
The early console was disabled. Only the properly initialized serial
console is used. All should be fine now.
> [ 1.188773] brd: module loaded
> [ 1.190567] loop: module loaded
[...]
> [ 5.316958] Freeing unused kernel memory: 2752K
> [ 5.364349] Run /sbin/init as init process
And I did not catch any further problem.
So, it looks like that con->write() code is not correctly serialized
between the early and normal console.
Now, let's see the last lines of failing logs:
> [ 1.071214] io scheduler kyber registered
> [ 1.124272] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [
> [ 1.067314] io scheduler kyber registered
> [ 1.120226] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [ 1.120603] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> [ 1.137975] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> [ 1.138248] mv_xor_v2 f0420000.xor: Marvell Version 2 XOR driver
> [ 1.
> [ 1.067214] io scheduler kyber registered
> [ 1.120098] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [ 1.120466] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> [ 1.137871] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> [ 1.138160] mv_xor_v2 f0420000.xor: Marvell Version 2 XOR driver
> [
All three logs end in the middle of a line. If you compare it with the
"working" log then the end 1-3 lines before the normal console was added.
The console output might is delayed because of the threads. Most
likely, the output ended when both early and normal console driver
started to use the same port.
I am going to check the driver...
Best Regards,
Petr
On Mon, 20 Jun 2022, Petr Mladek wrote:
> Resending with Ilpo, driver maintainer, in CC.
>
> On Mon 2022-06-20 13:44:17, Petr Mladek wrote:
> > On Mon 2022-06-20 11:29:36, Marek Beh?n wrote:
> > > On Mon, 20 Jun 2022 00:29:16 +0206
> > > John Ogness <[email protected]> wrote:
> > > > On 2022-06-19, Marek Beh?n <[email protected]> wrote:
> > > > > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > > > > system boot freezes in most cases (and is unusable until restarted by
> > > > > watchdog), or, in some cases boots, but the console output gets mangled
> > > > > for a while (the serial console spits garbage characters).
> > >
> > > attaching bootlogs and config.
> >
> > This is the log when the system booted:
> >
> > > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]
> > > [ 0.000000] Linux version 5.19.0-rc2-00410-g9776fe0f424b (kabel@dellmb) (aarch64-unknown-linux-gnu-gcc (Gentoo Hardened 10.3.1_p20211126 p0) 10.3.1 20211126, GNU ld (Gentoo 2.37_p1 p2) 2.37) #491 SMP Mon Jun 20 11:00:54 CEST 2022
> > > [ 0.000000] Machine model: Marvell Armada CN9130-CRB-B
> > > [ 0.000000] earlycon: uart8250 at MMIO32 0x00000000f0512000 (options '')
> > > [ 0.000000] printk: bootconsole [uart8250] enabled
> >
> > Early console enabled.
> >
> > > [ 0.000000] NUMA: No NUMA configuration found
> > [...]
> > > [ 0.062565] rcu: Hierarchical SRCU implementation.
> > > [ 0.062589] printk: bootconsole [uart8250] printing thread started
> >
> > The early console started being handled by the kthread.
> >
> > > [ 0.073843] smp: Bringing up secondary CPUs ...
> > > [ 0.074238] Detected PIPT I-cache on CPU1
> > [...]
> > > [ 1.067359] io scheduler kyber registered
> > > [ 1.120214] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > > [ 1.120577] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > > [ 1.137980] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > > [ 1.166562] printk:[ console [ttyS0] printing thread started
> > > [ 1.166564] printk: console [ttyS0] enabled
> >
> > 2nd console was added using the properly initialized serial port.
> > It should use the same physical port as the early console.
> >
> > Both early console and proper console driver has its own kthread.
> >
> > > 1.166486] f0512000.serial: ttyS0 at MMIO 0xf0512000 (irq = 22, base_baud = 12500000) is a 16550A
> >
> > The line is malformed. I wonder if both early console and proper
> > console used the same port in parallel.
> >
> > > [ 1.166567] printk: bootconsole [uart8250] disabled
> > > [ 1.185422] printk: bootconsole [uart8250] printing thread stopped
> >
> > The early console was disabled. Only the properly initialized serial
> > console is used. All should be fine now.
> >
> >
> > > [ 1.188773] brd: module loaded
> > > [ 1.190567] loop: module loaded
> > [...]
> > > [ 5.316958] Freeing unused kernel memory: 2752K
> > > [ 5.364349] Run /sbin/init as init process
> >
> > And I did not catch any further problem.
> >
> > So, it looks like that con->write() code is not correctly serialized
> > between the early and normal console.
> >
> >
> > Now, let's see the last lines of failing logs:
> >
> >
> > > [ 1.071214] io scheduler kyber registered
> > > [ 1.124272] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > > [
> >
> > > [ 1.067314] io scheduler kyber registered
> > > [ 1.120226] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > > [ 1.120603] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > > [ 1.137975] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > > [ 1.138248] mv_xor_v2 f0420000.xor: Marvell Version 2 XOR driver
> > > [ 1.
> >
> > > [ 1.067214] io scheduler kyber registered
> > > [ 1.120098] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > > [ 1.120466] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > > [ 1.137871] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > > [ 1.138160] mv_xor_v2 f0420000.xor: Marvell Version 2 XOR driver
> > > [
> >
> > All three logs end in the middle of a line. If you compare it with the
> > "working" log then the end 1-3 lines before the normal console was added.
> >
> > The console output might is delayed because of the threads. Most
> > likely, the output ended when both early and normal console driver
> > started to use the same port.
> >
> > I am going to check the driver...
Perhaps try reverting 8f3631f0f6eb ("serial/8250: Use fifo in 8250
console driver") if there's yet another unexplored corner hit.
--
i.
Resending with Ilpo, driver maintainer, in CC.
On Mon 2022-06-20 13:44:17, Petr Mladek wrote:
> On Mon 2022-06-20 11:29:36, Marek Beh?n wrote:
> > On Mon, 20 Jun 2022 00:29:16 +0206
> > John Ogness <[email protected]> wrote:
> > > On 2022-06-19, Marek Beh?n <[email protected]> wrote:
> > > > causes a regression on arm64 (Marvell CN9130-CRB board) where the
> > > > system boot freezes in most cases (and is unusable until restarted by
> > > > watchdog), or, in some cases boots, but the console output gets mangled
> > > > for a while (the serial console spits garbage characters).
> >
> > attaching bootlogs and config.
>
> This is the log when the system booted:
>
> > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]
> > [ 0.000000] Linux version 5.19.0-rc2-00410-g9776fe0f424b (kabel@dellmb) (aarch64-unknown-linux-gnu-gcc (Gentoo Hardened 10.3.1_p20211126 p0) 10.3.1 20211126, GNU ld (Gentoo 2.37_p1 p2) 2.37) #491 SMP Mon Jun 20 11:00:54 CEST 2022
> > [ 0.000000] Machine model: Marvell Armada CN9130-CRB-B
> > [ 0.000000] earlycon: uart8250 at MMIO32 0x00000000f0512000 (options '')
> > [ 0.000000] printk: bootconsole [uart8250] enabled
>
> Early console enabled.
>
> > [ 0.000000] NUMA: No NUMA configuration found
> [...]
> > [ 0.062565] rcu: Hierarchical SRCU implementation.
> > [ 0.062589] printk: bootconsole [uart8250] printing thread started
>
> The early console started being handled by the kthread.
>
> > [ 0.073843] smp: Bringing up secondary CPUs ...
> > [ 0.074238] Detected PIPT I-cache on CPU1
> [...]
> > [ 1.067359] io scheduler kyber registered
> > [ 1.120214] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.120577] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.137980] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > [ 1.166562] printk:[ console [ttyS0] printing thread started
> > [ 1.166564] printk: console [ttyS0] enabled
>
> 2nd console was added using the properly initialized serial port.
> It should use the same physical port as the early console.
>
> Both early console and proper console driver has its own kthread.
>
> > 1.166486] f0512000.serial: ttyS0 at MMIO 0xf0512000 (irq = 22, base_baud = 12500000) is a 16550A
>
> The line is malformed. I wonder if both early console and proper
> console used the same port in parallel.
>
> > [ 1.166567] printk: bootconsole [uart8250] disabled
> > [ 1.185422] printk: bootconsole [uart8250] printing thread stopped
>
> The early console was disabled. Only the properly initialized serial
> console is used. All should be fine now.
>
>
> > [ 1.188773] brd: module loaded
> > [ 1.190567] loop: module loaded
> [...]
> > [ 5.316958] Freeing unused kernel memory: 2752K
> > [ 5.364349] Run /sbin/init as init process
>
> And I did not catch any further problem.
>
> So, it looks like that con->write() code is not correctly serialized
> between the early and normal console.
>
>
> Now, let's see the last lines of failing logs:
>
>
> > [ 1.071214] io scheduler kyber registered
> > [ 1.124272] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > [
>
> > [ 1.067314] io scheduler kyber registered
> > [ 1.120226] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.120603] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.137975] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > [ 1.138248] mv_xor_v2 f0420000.xor: Marvell Version 2 XOR driver
> > [ 1.
>
> > [ 1.067214] io scheduler kyber registered
> > [ 1.120098] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.120466] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> > [ 1.137871] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> > [ 1.138160] mv_xor_v2 f0420000.xor: Marvell Version 2 XOR driver
> > [
>
> All three logs end in the middle of a line. If you compare it with the
> "working" log then the end 1-3 lines before the normal console was added.
>
> The console output might is delayed because of the threads. Most
> likely, the output ended when both early and normal console driver
> started to use the same port.
>
> I am going to check the driver...
>
> Best Regards,
> Petr
On Mon, Jun 20, 2022 at 6:44 AM Petr Mladek <[email protected]> wrote:
>
> Both early console and proper console driver has its own kthread.
>
> > 1.166486] f0512000.serial: ttyS0 at MMIO 0xf0512000 (irq = 22, base_baud = 12500000) is a 16550A
>
> The line is malformed. I wonder if both early console and proper
> console used the same port in parallel.
Honestly, I get the feeling that we need to just revert the whole
"console from thread" thing.
Because:
> So, it looks like that con->write() code is not correctly serialized
> between the early and normal console.
> [ ... ]
> I am going to check the driver...
We really cannot be in the situation that some random driver that used
to work no longer does, and causes oopses and/or memory corruption
just because it's now entered differently from how it traditionally
has been.
The traditional console write code has always been very careful to get
exclusive access, and it sounds like that is just plain broken now.
So I don't think this is a "driver is broken". Even if you find a fix
for it that makes that particular configuration happy, it sounds like
there is a much more fundamental issue at hand: the new printk code
just doesn't work well, and does things that are against the way
console printing has always worked in the past.
I realize that people have wanted to do printing from proper thread
context for ages and ages.
But maybe people should also just admit that "I wanted this" was not
"this was actually a good idea".
Linus
On Mon 2022-06-20 08:48:29, Linus Torvalds wrote:
> On Mon, Jun 20, 2022 at 6:44 AM Petr Mladek <[email protected]> wrote:
> >
> > Both early console and proper console driver has its own kthread.
> >
> > > 1.166486] f0512000.serial: ttyS0 at MMIO 0xf0512000 (irq = 22, base_baud = 12500000) is a 16550A
> >
> > The line is malformed. I wonder if both early console and proper
> > console used the same port in parallel.
>
> Honestly, I get the feeling that we need to just revert the whole
> "console from thread" thing.
>
> Because:
>
> > So, it looks like that con->write() code is not correctly serialized
> > between the early and normal console.
> > [ ... ]
> > I am going to check the driver...
>
> We really cannot be in the situation that some random driver that used
> to work no longer does, and causes oopses and/or memory corruption
> just because it's now entered differently from how it traditionally
> has been.
>
> The traditional console write code has always been very careful to get
> exclusive access, and it sounds like that is just plain broken now.
>
> So I don't think this is a "driver is broken".
I see what you think. There might be so many problems with the drivers
because they were never used this way. It looks like we opened a can
of worms. It is even more problematic because it causes silent boot
crashes and it is hard to debug.
I kind of agree with this and I have started looking at some more
generic solution.
All these boot crashes were in exactly the same situation when the
proper console was initialized and registered while there was
the early console used at the same time. It is a problem because
they use the same port.
The parallel use of different consoles should be much more
safe because they are much more independent.
There are the following possibilities:
1. Enable the kthreads later after the early consoles are gone.
This is easy and should fix all known boot problems.
2. Temporary stop the kthreads and use direct printing when
the proper consoles are registered. Well, this might be
more complicated because the port might be accessed
also before register_console() is called.
3. Another solution would be to use the global conosle_lock()
also to synchronize the kthreads against each other. But
it would be unfortunate.
I am going to prepare 1st solution.
Best Regards,
Petr
The console kthreads uncovered several races in console drivers.
All problems were in situation when a console was being properly
initialized and registered while an early console, using the same
port, was being used.
These problems are pretty hard to debug because they often result
into silent boot crashes. It would be nice to fix them but it
looks like a can of worms.
Prevent these problems by delaying the use of console kthreads
after all early consoles are gone. It might later be optimized.
But let's close this can of worms with a big hammer for now
so that they do not break first impression on the kthreads
that solve other real problems.
Link: https://lore.kernel.org/r/20220619204949.50d9154d@thinkpad
Link: https://lore.kernel.org/r/[email protected]
Reported=by: Marek Beh?n <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
---
kernel/printk/printk.c | 38 +++++++++++++++++++++++++-------------
1 file changed, 25 insertions(+), 13 deletions(-)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b095fb5f5f61..c0c5e2b6b91d 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3551,6 +3551,19 @@ void __init console_init(void)
}
}
+static int __init printk_activate_kthreads(void)
+{
+ struct console *con;
+
+ console_lock();
+ printk_kthreads_available = true;
+ for_each_console(con)
+ printk_start_kthread(con);
+ console_unlock();
+
+ return 0;
+}
+
/*
* Some boot consoles access data that is in the init section and which will
* be discarded after the initcalls have been run. To make sure that no code
@@ -3567,6 +3580,7 @@ void __init console_init(void)
*/
static int __init printk_late_init(void)
{
+ bool no_bootcon = true;
struct console *con;
int ret;
@@ -3588,7 +3602,10 @@ static int __init printk_late_init(void)
pr_warn("bootconsole [%s%d] uses init memory and must be disabled even before the real one is ready\n",
con->name, con->index);
unregister_console(con);
+ continue;
}
+
+ no_bootcon = false;
}
ret = cpuhp_setup_state_nocalls(CPUHP_PRINTK_DEAD, "printk:dead", NULL,
console_cpu_notify);
@@ -3597,23 +3614,18 @@ static int __init printk_late_init(void)
console_cpu_notify, NULL);
WARN_ON(ret < 0);
printk_sysctl_init();
- return 0;
-}
-late_initcall(printk_late_init);
-
-static int __init printk_activate_kthreads(void)
-{
- struct console *con;
- console_lock();
- printk_kthreads_available = true;
- for_each_console(con)
- printk_start_kthread(con);
- console_unlock();
+ /*
+ * Some console drivers are not ready to use the same port with
+ * boot (early) and normal console in parallel. Stay on the safe
+ * side and enable kthreads only when there is no boot console.
+ */
+ if (no_bootcon)
+ printk_activate_kthreads();
return 0;
}
-early_initcall(printk_activate_kthreads);
+late_initcall(printk_late_init);
#if defined CONFIG_PRINTK
/* If @con is specified, only wait for that console. Otherwise wait for all. */
--
2.35.3
On Mon 2022-06-20 17:14:30, Petr Mladek wrote:
> The console kthreads uncovered several races in console drivers.
> All problems were in situation when a console was being properly
> initialized and registered while an early console, using the same
> port, was being used.
>
> These problems are pretty hard to debug because they often result
> into silent boot crashes. It would be nice to fix them but it
> looks like a can of worms.
>
> Prevent these problems by delaying the use of console kthreads
> after all early consoles are gone. It might later be optimized.
> But let's close this can of worms with a big hammer for now
> so that they do not break first impression on the kthreads
> that solve other real problems.
Marek, could you please try if this patch solves the problem
for you?
Best Regards,
Petr
On Mon, 20 Jun 2022 17:14:27 +0200
Petr Mladek <[email protected]> wrote:
> The console kthreads uncovered several races in console drivers.
> All problems were in situation when a console was being properly
> initialized and registered while an early console, using the same
> port, was being used.
>
> These problems are pretty hard to debug because they often result
> into silent boot crashes. It would be nice to fix them but it
> looks like a can of worms.
>
> Prevent these problems by delaying the use of console kthreads
> after all early consoles are gone. It might later be optimized.
> But let's close this can of worms with a big hammer for now
> so that they do not break first impression on the kthreads
> that solve other real problems.
>
> Link: https://lore.kernel.org/r/20220619204949.50d9154d@thinkpad
> Link: https://lore.kernel.org/r/[email protected]
> Reported=by: Marek Behún <[email protected]>
> Signed-off-by: Petr Mladek <[email protected]>
This solves the issue. You have a typo in the reported-by tag
('=' instead of '-').
Tested-by: Marek Behún <[email protected]>
On Mon, Jun 20, 2022 at 10:14 AM Petr Mladek <[email protected]> wrote:
>
> The console kthreads uncovered several races in console drivers.
I really want to make it clear that this was NOT some kind of "races
in drivers".
Console drivers may very well have intentionally avoided taking locks
for console output, since the printk output was supposed to be
serialized by printk.
Don't try to make this some kind of "buggy drivers" thing. This is on
printk, not on anything else.
Assuming this solves all issues, I'm ok with this approach, but I
really want this to be clearly about printk being buggy, no "blame the
drivers" garbage.
And if there are other issues coming up, we revert the whole thing entirely.
Because printk is too important to play games with, and too important
to try to blame drivers.
Linus
On Mon 2022-06-20 14:10:20, Linus Torvalds wrote:
> On Mon, Jun 20, 2022 at 10:14 AM Petr Mladek <[email protected]> wrote:
> >
> > The console kthreads uncovered several races in console drivers.
>
> I really want to make it clear that this was NOT some kind of "races
> in drivers".
>
> Console drivers may very well have intentionally avoided taking locks
> for console output, since the printk output was supposed to be
> serialized by printk.
>
> Don't try to make this some kind of "buggy drivers" thing. This is on
> printk, not on anything else.
OK, I see that uart_console_write() is used by
early_serial8250_write() without port->lock. It means that it is
racy against serial8250_console_write(). It might
cause problems reported by this thread. And you are
right that it has never been used in parallel before the kthreads.
But I believe that it might cause real problems. serial8250_console_write()
takes port->lock to get serialized against other operations on the
port. And there might be some when the same port is added as
a proper serial console.
Today I found that probe_baud() is called from
serial8250_console_setup() without port->lock. It does reads and
writes. I believe that it might break with the earlycon.
Also the commit 589f892ac8ef244e47c5a ("serial: meson:
acquire port->lock in startup()") fixes a race between
meson_serial_port_write() and meson_uart_startup(), where
meson_serial_port_write() is used by both early and proper
console driver. The problem was there even without kthreads.
They just made it more visible.
My colleagues familiar with ARM told me that they heard about
boot freezes with early consoles before threads. The kthreads
allow to reproduce and fix them. In the end, they make the early
consoles more reliable.
> Assuming this solves all issues, I'm ok with this approach, but I
> really want this to be clearly about printk being buggy, no "blame the
> drivers" garbage.
>
> And if there are other issues coming up, we revert the whole thing entirely.
>
> Because printk is too important to play games with, and too important
> to try to blame drivers.
I take printk() really seriously. And I definitely do not want
to wave out problems as others problem.
I do not want to release 5.19 with broken printk(). But the kthreads
solve real bugs where printk() put the system into knees. I want
to invest much more time on improving them and fixing related
problems. Unfortunately, linux-next was not able to catch
the recently reported problems and we were not able to fix them
in advance.
All the recent fixes were generic and should make printk() with
kthreads much more reliable. I can't be sure if it will be enough.
I could only say that I am going to fix any new ones.
Of course, if people continue reporting problems, we would need
to revert it for 5.19. But I would really like to give it another
chance later.
Best Regards,
Petr
On 2022-06-20, Petr Mladek <[email protected]> wrote:
> The console kthreads uncovered several races in console drivers.
> All problems were in situation when a console was being properly
> initialized and registered while an early console, using the same
> port, was being used.
>
> These problems are pretty hard to debug because they often result
> into silent boot crashes. It would be nice to fix them but it
> looks like a can of worms.
>
> Prevent these problems by delaying the use of console kthreads
> after all early consoles are gone. It might later be optimized.
> But let's close this can of worms with a big hammer for now
> so that they do not break first impression on the kthreads
> that solve other real problems.
The problem is that there are multiple consoles accessing the same
hardware. The threaded console printing model did not consider this. The
commit message should clearly state the actual problem.
Since early printing does not use the kernel driver model, I do not see
a clear solution at this point. But since using early printing is not a
typical use, I am OK with this patch.
For the commit message I would suggest something like:
Threaded console printing does not take into consideration that boot
consoles may be accessing the same hardware as normal consoles and thus
must not be called in parallel.
Since it is currently not possible to identify which consoles are
accessing the same hardware, delay threaded console printing activation
until it is known that there are no boot consoles registered.
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index b095fb5f5f61..c0c5e2b6b91d 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3597,23 +3614,18 @@ static int __init printk_late_init(void)
> console_cpu_notify, NULL);
> WARN_ON(ret < 0);
> printk_sysctl_init();
> - return 0;
> -}
> -late_initcall(printk_late_init);
> -
> -static int __init printk_activate_kthreads(void)
> -{
> - struct console *con;
>
> - console_lock();
> - printk_kthreads_available = true;
> - for_each_console(con)
> - printk_start_kthread(con);
> - console_unlock();
> + /*
> + * Some console drivers are not ready to use the same port with
> + * boot (early) and normal console in parallel. Stay on the safe
> + * side and enable kthreads only when there is no boot console.
> + */
As with the commit message, this comment is not pointing out the real
issue. I suggest:
/*
* Boot consoles may be accessing the same hardware as normal
* consoles and thus must not be called in parallel. Therefore
* only activate threaded console printing if it is known that
* there are no boot consoles registered.
*/
> + if (no_bootcon)
> + printk_activate_kthreads();
>
> return 0;
> }
> -early_initcall(printk_activate_kthreads);
> +late_initcall(printk_late_init);
>
> #if defined CONFIG_PRINTK
> /* If @con is specified, only wait for that console. Otherwise wait for all. */
John Ogness
[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]
Hi, this is your Linux kernel regression tracker.
On 19.06.22 20:49, Marek Behún wrote:
> Hello Linus, Petr, John,
>
> the series
> https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
> merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
>
> and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
> printers")
>
> causes a regression on arm64 (Marvell CN9130-CRB board) where the
> system boot freezes in most cases (and is unusable until restarted by
> watchdog), or, in some cases boots, but the console output gets mangled
> for a while (the serial console spits garbage characters).
>
> The stall example:
>
> ...
> [ 1.108141] io scheduler mq-deadline registered
> [ 1.108143] io scheduler kyber registered
> [ 1.163484] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [
>
> Or another:
>
> ...
> [ 1.108155] io scheduler mq-deadline registered
> [ 1.108156] io scheduler kyber registered
> [ 1.162991] armada-ap806-pinctrl f06f4000.system-controller:pinctrl: registered pinctrl driver
> [ 1.163361] armada-cp110-pinctrl f2440000.system-controller:pinctrl: registered pinctrl driver
> [ 1.180827] mv_xor_v2 f0400000.xor: Marvell Version 2 XOR driver
> [
>
> The garbage example:
>
> ...
> [ 0.920951] raid6: using neon recovery algorithm
> [ 0.921228] iommu: Default domain type: Translated
> %
>
> gb@k+cFL/[ 4.954974] DSA: tree 0 setup
> [ 4.955286] cfg80211: Loading compiled-in X.509 certificates for regulatory database
>
> I did not have time to look at the contents of the commit yet to try to
> fix it myself, and won't have time for another two weeks. But I thought
> I should let you know now.
CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:
#regzbot ^introduced v5.18..v5.19-rc3
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/
Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
On Tue, 21 Jun 2022 00:57:09 +0206
John Ogness <[email protected]> wrote:
> Since early printing does not use the kernel driver model, I do not see
> a clear solution at this point. But since using early printing is not a
> typical use, I am OK with this patch.
Does this patch still work if you add the "keep" flag to the
earlyprintk kernel parameter?
-- Steve
On Fri 2022-06-24 18:41:36, Steven Rostedt wrote:
> On Tue, 21 Jun 2022 00:57:09 +0206
> John Ogness <[email protected]> wrote:
>
> > Since early printing does not use the kernel driver model, I do not see
> > a clear solution at this point. But since using early printing is not a
> > typical use, I am OK with this patch.
>
> Does this patch still work if you add the "keep" flag to the
> earlyprintk kernel parameter?
It works. If the boot console stays then the console kthreads will
never get enabled.
Well, I hope that we will be able to add the synchronization and remove
this constraint. I think that the port->lock should be usable in
most cases.
Best Regards,
Petr
TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.
On 21.06.22 11:39, Thorsten Leemhuis wrote:
> [TLDR: I'm adding this regression report to the list of tracked
> regressions; all text from me you find below is based on a few templates
> paragraphs you might have encountered already already in similar form.]
>
> Hi, this is your Linux kernel regression tracker.
>
> On 19.06.22 20:49, Marek Behún wrote:
>> Hello Linus, Petr, John,
>>
>> the series
>> https://lore.kernel.org/linux-kernel/YouKQw72H7y9EJQK@alley/
>> merged by commit 537e62c865dc ("Merge tag 'printk-for-5.19' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux")
>>
>> and in particular the commit 09c5ba0aa2fc ("printk: add kthread console
>> printers")
>>
>> causes a regression on arm64 (Marvell CN9130-CRB board) where the
>> system boot freezes in most cases (and is unusable until restarted by
>> watchdog), or, in some cases boots, but the console output gets mangled
>> for a while (the serial console spits garbage characters).
> [...]
> #regzbot ^introduced v5.18..v5.19-rc3
#regzbot fixed-by: 07a22b61946f0b80065b