2022-06-13 19:36:40

by Alexandru Elisei

[permalink] [raw]
Subject: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

Hi,

(Apologies for the long To: list, I've added everyone that
scripts/get_maintainer.pl listed for fs/{fs-writeback,inode}.c, for the
rockchip boards, for printk.c and for the tty layer)

When booting a kernel built from tag v5.19-rc2, my rockpro64-v2 hangs at
boot with this log:

[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[ 0.000000] Linux version 5.19.0-rc2 (alex@redacted) (aarch64-linux-gnu-gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #90 SMP PREEMPT Mon Jun 13 17:13:12 BST 2022
[ 0.000000] Machine model: Pine64 RockPro64 v2.0
[ 0.000000] efi: UEFI not found.
[ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')
[ 0.000000] printk: bootconsole [uart0] enabled
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: Faking a node at [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xf77dc140-0xf77ddfff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] On node 0, zone DMA: 512 pages in unavailable ranges
[ 0.000000] cma: Reserved 32 MiB at 0x00000000f0000000
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv1.1 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: MIGRATE_INFO_TYPE not supported.
[ 0.000000] psci: SMC Calling Convention v1.2
[ 0.000000] percpu: Embedded 20 pages/cpu s41768 r8192 d31960 u81920
[ 0.000000] Detected VIPT I-cache on CPU0
[ 0.000000] CPU features: detected: GIC system register CPU interface
[ 0.000000] CPU features: detected: ARM erratum 845719
[ 0.000000] Fallback order for Node 0: 0
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 999432
[ 0.000000] Policy zone: DMA
[ 0.000000] Kernel command line: root=PARTUUID=7f4aab92-69d8-47f3-be10-624da40a71f9 rw earlycon rootwait
[ 0.000000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 3915796K/4061184K available (15552K kernel code, 2758K rwdata, 8668K rodata, 6336K init, 564K bss, 112620K reserved, 32768K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
[ 0.000000] rcu: Preemptible hierarchical RCU implementation.
[ 0.000000] rcu: RCU event tracing is enabled.
[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=6.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=6
[ 0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
[ 0.000000] GICv3: 256 SPIs implemented
[ 0.000000] GICv3: 0 Extended SPIs implemented
[ 0.000000] Root IRQ handler: gic_handle_irq
[ 0.000000] GICv3: GICv3 features: 16 PPIs
[ 0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000fef00000
[ 0.000000] ITS [mem 0xfee20000-0xfee3ffff]
[ 0.000000] ITS@0x00000000fee20000: allocated 65536 Devices @480000 (flat, esz 8, psz 64K, shr 0)
[ 0.000000] ITS: using cache flushing for cmd queue
[ 0.000000] GICv3: using LPI property table @0x0000000000440000
[ 0.000000] GIC: using cache flushing for LPI property table
[ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000000450000
[ 0.000000] GICv3: GIC: PPI partition interrupt-partition-0[0] { /cpus/cpu@0[0] /cpus/cpu@1[1] /cpus/cpu@2[2] /cpus/cpu@3[3] }
[ 0.000000] GICv3: GIC: PPI partition interrupt-partition-1[1] { /cpus/cpu@100[4] /cpus/cpu@101[5] }
[ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.000000] arch_timer: cp15 timer(s) running at 24.00MHz (phys).
[ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x588fe9dc0, max_idle_ns: 440795202592 ns
[ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
[ 0.005107] Console: colour dummy device 80x25
[ 0.005549] printk: console [tty0] enabled
[ 0.005956] printk: bootconsole [uart0] disabled

Config can be found at [1] (expires after 6 months). I've also built the
kernel with gcc 10.3.1 [2] (aarch64-none-linux-gnu), same issue.

I've bisected the build failure to commit 10e14073107d ("writeback: Fix
inode->i_io_list not be protected by inode->i_lock error"); I've confirmed
that that commit is responsible by successfully booting the board with a
kernel built from v5.19-rc2 + the above commit reverted.

I tried to do some investigating, it seems that the kernel is stuck at
printk.c::console_init() -> drivers/tty/vt/vt.c::con_init() ->
printk.c::register_console() -> unregister_console() -> console_lock().
This has been determined by adding pr_info statements.

I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
commit fiddles with locks, but no splat was produced that would explain the
hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
is booting just fine, so I'm guessing it only affects of subset of arm64
boards.

[1] https://pastebin.com/MfDrKyKd
[2] https://developer.arm.com/downloads/-/gnu-a

Thanks,
Alex


2022-06-14 03:07:35

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On (22/06/13 17:54), Alexandru Elisei wrote:
[..]
> [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x588fe9dc0, max_idle_ns: 440795202592 ns
> [ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
> [ 0.005107] Console: colour dummy device 80x25
> [ 0.005549] printk: console [tty0] enabled
> [ 0.005956] printk: bootconsole [uart0] disabled
>
> Config can be found at [1] (expires after 6 months). I've also built the
> kernel with gcc 10.3.1 [2] (aarch64-none-linux-gnu), same issue.
>
> I've bisected the build failure to commit 10e14073107d ("writeback: Fix
> inode->i_io_list not be protected by inode->i_lock error"); I've confirmed
> that that commit is responsible by successfully booting the board with a
> kernel built from v5.19-rc2 + the above commit reverted.

Hmm, interesting and puzzling.

If you disable console kthreads (__printk_fallback_preferred_direct) and
keep 10e14073107d, are you able to successfully boot the board or does it
boot to a panic?

> I tried to do some investigating, it seems that the kernel is stuck at
> printk.c::console_init() -> drivers/tty/vt/vt.c::con_init() ->
> printk.c::register_console() -> unregister_console() -> console_lock().

Is it console lock (console_kthreads_block() -> mutex_lock(&con->lock))
that we cannot grab or is it console semaphore?

2022-06-14 05:28:43

by Jiri Slaby

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On 13. 06. 22, 18:54, Alexandru Elisei wrote:
> I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> commit fiddles with locks, but no splat was produced that would explain the
> hang.

It's too early for lockdep. Could you try to move lockdep_init() before
console_init() in start_kernel()?

You'd need to use early console (which you already do).

thanks,
--
js
suse labs

2022-06-14 07:06:08

by Jiri Slaby

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On 14. 06. 22, 7:15, Jiri Slaby wrote:
> On 13. 06. 22, 18:54, Alexandru Elisei wrote:
>> I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the
>> offending
>> commit fiddles with locks, but no splat was produced that would
>> explain the
>> hang.
>
> It's too early for lockdep. Could you try to move lockdep_init() before
> console_init() in start_kernel()?

Sorry, ignore this, lockdep_init() only dumps some info.

--
js
suse labs

2022-06-14 12:30:55

by Petr Mladek

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On Mon 2022-06-13 17:54:35, Alexandru Elisei wrote:
> Hi,
>
> (Apologies for the long To: list, I've added everyone that
> scripts/get_maintainer.pl listed for fs/{fs-writeback,inode}.c, for the
> rockchip boards, for printk.c and for the tty layer)
>
> When booting a kernel built from tag v5.19-rc2, my rockpro64-v2 hangs at
> boot with this log:
>
> [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> [ 0.000000] Linux version 5.19.0-rc2 (alex@redacted) (aarch64-linux-gnu-gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #90 SMP PREEMPT Mon Jun 13 17:13:12 BST 2022
> [ 0.000000] Machine model: Pine64 RockPro64 v2.0
> [ 0.000000] efi: UEFI not found.
> [ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')
> [ 0.000000] printk: bootconsole [uart0] enabled
> [ 0.000000] NUMA: No NUMA configuration found
> [ 0.000000] NUMA: Faking a node at [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0xf77dc140-0xf77ddfff]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] DMA32 empty
> [ 0.000000] Normal empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] On node 0, zone DMA: 512 pages in unavailable ranges
> [ 0.000000] cma: Reserved 32 MiB at 0x00000000f0000000
> [ 0.000000] psci: probing for conduit method from DT.
> [ 0.000000] psci: PSCIv1.1 detected in firmware.
> [ 0.000000] psci: Using standard PSCI v0.2 function IDs
> [ 0.000000] psci: MIGRATE_INFO_TYPE not supported.
> [ 0.000000] psci: SMC Calling Convention v1.2
> [ 0.000000] percpu: Embedded 20 pages/cpu s41768 r8192 d31960 u81920
> [ 0.000000] Detected VIPT I-cache on CPU0
> [ 0.000000] CPU features: detected: GIC system register CPU interface
> [ 0.000000] CPU features: detected: ARM erratum 845719
> [ 0.000000] Fallback order for Node 0: 0
> [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 999432
> [ 0.000000] Policy zone: DMA
> [ 0.000000] Kernel command line: root=PARTUUID=7f4aab92-69d8-47f3-be10-624da40a71f9 rw earlycon rootwait
> [ 0.000000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
> [ 0.000000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
> [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
> [ 0.000000] Memory: 3915796K/4061184K available (15552K kernel code, 2758K rwdata, 8668K rodata, 6336K init, 564K bss, 112620K reserved, 32768K cma-reserved)
> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
> [ 0.000000] rcu: Preemptible hierarchical RCU implementation.
> [ 0.000000] rcu: RCU event tracing is enabled.
> [ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=6.
> [ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=6
> [ 0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> [ 0.000000] GICv3: 256 SPIs implemented
> [ 0.000000] GICv3: 0 Extended SPIs implemented
> [ 0.000000] Root IRQ handler: gic_handle_irq
> [ 0.000000] GICv3: GICv3 features: 16 PPIs
> [ 0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000fef00000
> [ 0.000000] ITS [mem 0xfee20000-0xfee3ffff]
> [ 0.000000] ITS@0x00000000fee20000: allocated 65536 Devices @480000 (flat, esz 8, psz 64K, shr 0)
> [ 0.000000] ITS: using cache flushing for cmd queue
> [ 0.000000] GICv3: using LPI property table @0x0000000000440000
> [ 0.000000] GIC: using cache flushing for LPI property table
> [ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000000450000
> [ 0.000000] GICv3: GIC: PPI partition interrupt-partition-0[0] { /cpus/cpu@0[0] /cpus/cpu@1[1] /cpus/cpu@2[2] /cpus/cpu@3[3] }
> [ 0.000000] GICv3: GIC: PPI partition interrupt-partition-1[1] { /cpus/cpu@100[4] /cpus/cpu@101[5] }
> [ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [ 0.000000] arch_timer: cp15 timer(s) running at 24.00MHz (phys).
> [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x588fe9dc0, max_idle_ns: 440795202592 ns
> [ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
> [ 0.005107] Console: colour dummy device 80x25
> [ 0.005549] printk: console [tty0] enabled
> [ 0.005956] printk: bootconsole [uart0] disabled
>
> Config can be found at [1] (expires after 6 months). I've also built the
> kernel with gcc 10.3.1 [2] (aarch64-none-linux-gnu), same issue.
>
> I've bisected the build failure to commit 10e14073107d ("writeback: Fix
> inode->i_io_list not be protected by inode->i_lock error"); I've confirmed
> that that commit is responsible by successfully booting the board with a
> kernel built from v5.19-rc2 + the above commit reverted.

It is strange. I can't see how consoles are related to filesystem
writeback.

Anyway, the commit 10e14073107d ("writeback: Fix inode->i_io_list not
be protected by inode->i_lock error") modifies some locking and
might be source of possible deadlocks.

I am not familiar with the fs code. But I noticed the following.
The patch adds:

+ if (!was_dirty) {
+ wb = locked_inode_to_wb_and_lock_list(inode);
+ spin_lock(&inode->i_lock);

And locked_inode_to_wb_and_lock_list() is defined this way:

/**
* locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
* @inode: inode of interest with i_lock held
*
* Returns @inode's wb with its list_lock held. @inode->i_lock must be
* held on entry and is released on return. The returned wb is guaranteed
* to stay @inode's associated wb until its list_lock is released.
*/
static struct bdi_writeback *
locked_inode_to_wb_and_lock_list(struct inode *inode)
__releases(&inode->i_lock)
__acquires(&wb->list_lock)
{
while (true) {
struct bdi_writeback *wb = inode_to_wb(inode);

/*
* inode_to_wb() association is protected by both
* @inode->i_lock and @wb->list_lock but list_lock nests
* outside i_lock. Drop i_lock and verify that the
* association hasn't changed after acquiring list_lock.
*/
wb_get(wb);
spin_unlock(&inode->i_lock);

It expects that inode->i_lock is taken before. But the problematic
commit takes it later. It might mess the lock and cause a deadlock.

Jack?


> I tried to do some investigating, it seems that the kernel is stuck at
> printk.c::console_init() -> drivers/tty/vt/vt.c::con_init() ->
> printk.c::register_console() -> unregister_console() -> console_lock().
> This has been determined by adding pr_info statements.

So, you tried something like:

int unregister_console(struct console *console)
{
struct task_struct *thd;
struct console *con;
int res;

con_printk(KERN_INFO, console, "disabled\n");
[...]

+ pr_info("Stage 1\n");
console_lock();
+ pr_info("Stage 2\n");

[...]

console_unlock();
+ pr_info("Stage 3\n");
[...]
}

And "Stage 1" was the last message on the console ?

Does the system boot when you avoid "earlycon" parameter?


> I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> commit fiddles with locks, but no splat was produced that would explain the
> hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> is booting just fine, so I'm guessing it only affects of subset of arm64
> boards.

You might try to switch the order of console_init() and lockdep_init()
in start_kernel() in init/main.c

Best Regards,
Petr

2022-06-14 13:39:27

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

Hi Petr,

Thank you for having a look.

On Tue, Jun 14, 2022 at 02:23:32PM +0200, Petr Mladek wrote:
> On Mon 2022-06-13 17:54:35, Alexandru Elisei wrote:
> > [..]
>
> It is strange. I can't see how consoles are related to filesystem
> writeback.

I am in the same boat.

>
> Anyway, the commit 10e14073107d ("writeback: Fix inode->i_io_list not
> be protected by inode->i_lock error") modifies some locking and
> might be source of possible deadlocks.
>
> I am not familiar with the fs code. But I noticed the following.
> The patch adds:
>
> + if (!was_dirty) {
> + wb = locked_inode_to_wb_and_lock_list(inode);
> + spin_lock(&inode->i_lock);
>
> And locked_inode_to_wb_and_lock_list() is defined this way:
>
> /**
> * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
> * @inode: inode of interest with i_lock held
> *
> * Returns @inode's wb with its list_lock held. @inode->i_lock must be
> * held on entry and is released on return. The returned wb is guaranteed
> * to stay @inode's associated wb until its list_lock is released.
> */
> static struct bdi_writeback *
> locked_inode_to_wb_and_lock_list(struct inode *inode)
> __releases(&inode->i_lock)
> __acquires(&wb->list_lock)
> {
> while (true) {
> struct bdi_writeback *wb = inode_to_wb(inode);
>
> /*
> * inode_to_wb() association is protected by both
> * @inode->i_lock and @wb->list_lock but list_lock nests
> * outside i_lock. Drop i_lock and verify that the
> * association hasn't changed after acquiring list_lock.
> */
> wb_get(wb);
> spin_unlock(&inode->i_lock);
>
> It expects that inode->i_lock is taken before. But the problematic
> commit takes it later. It might mess the lock and cause a deadlock.
>
> Jack?
>
>
> > I tried to do some investigating, it seems that the kernel is stuck at
> > printk.c::console_init() -> drivers/tty/vt/vt.c::con_init() ->
> > printk.c::register_console() -> unregister_console() -> console_lock().
> > This has been determined by adding pr_info statements.
>
> So, you tried something like:
>
> int unregister_console(struct console *console)
> {
> struct task_struct *thd;
> struct console *con;
> int res;
>
> con_printk(KERN_INFO, console, "disabled\n");
> [...]
>
> + pr_info("Stage 1\n");
> console_lock();
> + pr_info("Stage 2\n");
>
> [...]
>
> console_unlock();
> + pr_info("Stage 3\n");
> [...]
> }
>
> And "Stage 1" was the last message on the console ?

With this change:

diff --git a/init/main.c b/init/main.c
index 0ee39cdcfcac..a245982eb8a2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1057,6 +1057,8 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)

kmem_cache_init_late();

+ lockdep_init();
+
/*
* HACK ALERT! This is early. We're enabling the console before
* we've done PCI setups etc, and console_init() must be aware of
@@ -1067,8 +1069,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
panic("Too many boot %s vars at `%s'", panic_later,
panic_param);

- lockdep_init();
-
/*
* Need to run this when irqs are enabled, because it wants
* to self-test [hard/soft]-irqs on/off lock inversion bugs
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea3dd55709e7..aa7684c6745d 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2705,7 +2705,10 @@ static int console_cpu_notify(unsigned int cpu)
*/
void console_lock(void)
{
+ pr_info("before might_sleep()");
might_sleep();
+ pr_info("before down_console_sem()");
+ pr_info("before down_console_sem()");

down_console_sem();
if (console_suspended)
@@ -3508,12 +3511,18 @@ int unregister_console(struct console *console)
if (console->exit)
res = console->exit(console);

+ pr_info("Exiting from unregister_console(), res = %d", res);
+ pr_info("Exiting from unregister_console(), res = %d", res);
+
return res;

out_disable_unlock:
console->flags &= ~CON_ENABLED;
console_unlock();

+ pr_info("Exiting from unregister_console(), res = %d", res);
+ pr_info("Exiting from unregister_console(), res = %d", res);
+
return res;
}
EXPORT_SYMBOL(unregister_console);

Some of the pr_info statements are duplicated to see the output just before
the console hangs (I assume they're needed to force a buffer flush).

This is what I got:

[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[ 0.000000] Linux version 5.19.0-rc2-dirty (alex@monolith) (aarch64-linux-gnu-gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #106 SMP PREEMPT Tue Jun 14 14:03:30 BST 2022
[ 0.000000] Machine model: Pine64 RockPro64 v2.0
[ 0.000000] efi: UEFI not found.
[ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')
[ 0.000000] printk: before might_sleep()
[ 0.000000] printk: before down_console_sem()
[ 0.000000] printk: before down_console_sem()
[ 0.000000] printk: bootconsole [uart0] enabled
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: Faking a node at [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xf77cef40-0xf77d0fff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000200000-0x00000000f7ffffff]
[ 0.000000] On node 0, zone DMA: 512 pages in unavailable ranges
[ 0.000000] cma: Reserved 32 MiB at 0x00000000f0000000
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv1.1 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: MIGRATE_INFO_TYPE not supported.
[ 0.000000] psci: SMC Calling Convention v1.2
[ 0.000000] percpu: Embedded 30 pages/cpu s82272 r8192 d32416 u122880
[ 0.000000] Detected VIPT I-cache on CPU0
[ 0.000000] CPU features: detected: GIC system register CPU interface
[ 0.000000] CPU features: detected: ARM erratum 845719
[ 0.000000] Fallback order for Node 0: 0
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 999432
[ 0.000000] Policy zone: DMA
[ 0.000000] Kernel command line: root=PARTUUID=7f4aab92-69d8-47f3-be10-624da40a71f9 rw earlycon rootwait
[ 0.000000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.000000] Memory: 3899372K/4061184K available (17600K kernel code, 3994K rwdata, 9648K rodata, 7488K init, 11297K bss, 129044K reserved, 32768K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
[ 0.000000] trace event string verifier disabled
[ 0.000000] Running RCU self tests
[ 0.000000] rcu: Preemptible hierarchical RCU implementation.
[ 0.000000] rcu: RCU event tracing is enabled.
[ 0.000000] rcu: RCU lockdep checking is enabled.
[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=6.
[ 0.000000] Trampoline variant of Tasks RCU enabled.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=6
[ 0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
[ 0.000000] GICv3: 256 SPIs implemented
[ 0.000000] GICv3: 0 Extended SPIs implemented
[ 0.000000] Root IRQ handler: gic_handle_irq
[ 0.000000] GICv3: GICv3 features: 16 PPIs
[ 0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000fef00000
[ 0.000000] ITS [mem 0xfee20000-0xfee3ffff]
[ 0.000000] ITS@0x00000000fee20000: allocated 65536 Devices @500000 (flat, esz 8, psz 64K, shr 0)
[ 0.000000] ITS: using cache flushing for cmd queue
[ 0.000000] GICv3: using LPI property table @0x00000000004e0000
[ 0.000000] GIC: using cache flushing for LPI property table
[ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x00000000004f0000
[ 0.000000] GICv3: GIC: PPI partition interrupt-partition-0[0] { /cpus/cpu@0[0] /cpus/cpu@1[1] /cpus/cpu@2[2] /cpus/cpu@3[3] }
[ 0.000000] GICv3: GIC: PPI partition interrupt-partition-1[1] { /cpus/cpu@100[4] /cpus/cpu@101[5] }
[ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.000000] arch_timer: cp15 timer(s) running at 24.00MHz (phys).
[ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x588fe9dc0, max_idle_ns: 440795202592 ns
[ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
[ 0.005602] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[ 0.006373] ... MAX_LOCKDEP_SUBCLASSES: 8
[ 0.006789] ... MAX_LOCK_DEPTH: 48
[ 0.007212] ... MAX_LOCKDEP_KEYS: 8192
[ 0.007651] ... CLASSHASH_SIZE: 4096
[ 0.008088] ... MAX_LOCKDEP_ENTRIES: 32768
[ 0.008535] ... MAX_LOCKDEP_CHAINS: 65536
[ 0.008981] ... CHAINHASH_SIZE: 32768
[ 0.009428] memory used by lock dependency info: 6365 kB
[ 0.010018] memory used for stack traces: 4224 kB
[ 0.010500] per task-struct memory footprint: 1920 bytes
[ 0.011059] printk: before might_sleep()
[ 0.011079] printk: before down_console_sem()
[ 0.011477] printk: before down_console_sem()
[ 0.012112] Console: colour dummy device 80x25
[ 0.012984] printk: before might_sleep()
[ 0.013003] printk: before down_console_sem()
[ 0.013399] printk: before down_console_sem()
[ 0.013860] printk: console [tty0] enabled
[ 0.014986] printk: bootconsole [uart0] disabled
[ 0.015564] printk: before might_sleep()
[ 0.015582] printk: before down_console_sem()

>
> Does the system boot when you avoid "earlycon" parameter?

It doesn't boot, it hangs and I don't get any output.

>
>
> > I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> > commit fiddles with locks, but no splat was produced that would explain the
> > hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> > is booting just fine, so I'm guessing it only affects of subset of arm64
> > boards.
>
> You might try to switch the order of console_init() and lockdep_init()
> in start_kernel() in init/main.c

Did so above.

Thanks,
Alex

2022-06-14 14:03:26

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On Tue 14-06-22 14:23:32, Petr Mladek wrote:
> On Mon 2022-06-13 17:54:35, Alexandru Elisei wrote:

<snip>

> > Config can be found at [1] (expires after 6 months). I've also built the
> > kernel with gcc 10.3.1 [2] (aarch64-none-linux-gnu), same issue.
> >
> > I've bisected the build failure to commit 10e14073107d ("writeback: Fix
> > inode->i_io_list not be protected by inode->i_lock error"); I've confirmed
> > that that commit is responsible by successfully booting the board with a
> > kernel built from v5.19-rc2 + the above commit reverted.
>
> It is strange. I can't see how consoles are related to filesystem
> writeback.
>
> Anyway, the commit 10e14073107d ("writeback: Fix inode->i_io_list not
> be protected by inode->i_lock error") modifies some locking and
> might be source of possible deadlocks.

Yes, I've got other reports from ARM people that this commit causes issues
for them (kernel oops or so) so the locking changes are likely at fault...

> I am not familiar with the fs code. But I noticed the following.
> The patch adds:
>
> + if (!was_dirty) {
> + wb = locked_inode_to_wb_and_lock_list(inode);
> + spin_lock(&inode->i_lock);
>
> And locked_inode_to_wb_and_lock_list() is defined this way:
>
> /**
> * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
> * @inode: inode of interest with i_lock held
> *
> * Returns @inode's wb with its list_lock held. @inode->i_lock must be
> * held on entry and is released on return. The returned wb is guaranteed
> * to stay @inode's associated wb until its list_lock is released.
> */
> static struct bdi_writeback *
> locked_inode_to_wb_and_lock_list(struct inode *inode)
> __releases(&inode->i_lock)
> __acquires(&wb->list_lock)
> {
> while (true) {
> struct bdi_writeback *wb = inode_to_wb(inode);
>
> /*
> * inode_to_wb() association is protected by both
> * @inode->i_lock and @wb->list_lock but list_lock nests
> * outside i_lock. Drop i_lock and verify that the
> * association hasn't changed after acquiring list_lock.
> */
> wb_get(wb);
> spin_unlock(&inode->i_lock);
>
> It expects that inode->i_lock is taken before. But the problematic
> commit takes it later. It might mess the lock and cause a deadlock.

No. AFAICS inode->i_lock is held on entry to
locked_inode_to_wb_and_lock_list(). The function releases it so we have to
grab it again. The locking is ugly here but correct in this regard.

It rather likely has to do something with reordering the checks and running
locked_inode_to_wb_and_lock_list() on inodes for which we previously didn't
do it but I have to yet fully understand why things crash...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-06-14 15:12:21

by Petr Mladek

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On Tue 2022-06-14 14:23:42, Alexandru Elisei wrote:
> With this change:
>
> diff --git a/init/main.c b/init/main.c
> index 0ee39cdcfcac..a245982eb8a2 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1057,6 +1057,8 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
>
> kmem_cache_init_late();
>
> + lockdep_init();
> +
> /*
> * HACK ALERT! This is early. We're enabling the console before
> * we've done PCI setups etc, and console_init() must be aware of
> @@ -1067,8 +1069,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> panic("Too many boot %s vars at `%s'", panic_later,
> panic_param);
>
> - lockdep_init();
> -
> /*
> * Need to run this when irqs are enabled, because it wants
> * to self-test [hard/soft]-irqs on/off lock inversion bugs
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ea3dd55709e7..aa7684c6745d 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2705,7 +2705,10 @@ static int console_cpu_notify(unsigned int cpu)
> */
> void console_lock(void)
> {
> + pr_info("before might_sleep()");
> might_sleep();
> + pr_info("before down_console_sem()");
> + pr_info("before down_console_sem()");
>
> down_console_sem();
> if (console_suspended)
> @@ -3508,12 +3511,18 @@ int unregister_console(struct console *console)
> if (console->exit)
> res = console->exit(console);
>
> + pr_info("Exiting from unregister_console(), res = %d", res);
> + pr_info("Exiting from unregister_console(), res = %d", res);
> +
> return res;
>
> out_disable_unlock:
> console->flags &= ~CON_ENABLED;
> console_unlock();
>
> + pr_info("Exiting from unregister_console(), res = %d", res);
> + pr_info("Exiting from unregister_console(), res = %d", res);
> +
> return res;
> }
> EXPORT_SYMBOL(unregister_console);
>
> Some of the pr_info statements are duplicated to see the output just before
> the console hangs (I assume they're needed to force a buffer flush).
>
> This is what I got:
>
> [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> [ 0.000000] Linux version 5.19.0-rc2-dirty (alex@monolith) (aarch64-linux-gnu-gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #106 SMP PREEMPT Tue Jun 14 14:03:30 BST 2022
> [ 0.000000] Machine model: Pine64 RockPro64 v2.0
> [ 0.000000] efi: UEFI not found.
> [ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')
> [ 0.000000] printk: before might_sleep()
> [ 0.000000] printk: before down_console_sem()
> [ 0.000000] printk: before down_console_sem()
> [ 0.000000] printk: bootconsole [uart0] enabled
> [ 0.000000] NUMA: No NUMA configuration found
> [ 0.000000] NUMA: Faking a node at [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0xf77cef40-0xf77d0fff]
[...]
> [ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
> [ 0.005602] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
> [ 0.006373] ... MAX_LOCKDEP_SUBCLASSES: 8
> [ 0.006789] ... MAX_LOCK_DEPTH: 48
> [ 0.007212] ... MAX_LOCKDEP_KEYS: 8192
> [ 0.007651] ... CLASSHASH_SIZE: 4096
> [ 0.008088] ... MAX_LOCKDEP_ENTRIES: 32768
> [ 0.008535] ... MAX_LOCKDEP_CHAINS: 65536
> [ 0.008981] ... CHAINHASH_SIZE: 32768
> [ 0.009428] memory used by lock dependency info: 6365 kB
> [ 0.010018] memory used for stack traces: 4224 kB
> [ 0.010500] per task-struct memory footprint: 1920 bytes
> [ 0.011059] printk: before might_sleep()
> [ 0.011079] printk: before down_console_sem()
> [ 0.011477] printk: before down_console_sem()
> [ 0.012112] Console: colour dummy device 80x25
> [ 0.012984] printk: before might_sleep()
> [ 0.013003] printk: before down_console_sem()
> [ 0.013399] printk: before down_console_sem()
> [ 0.013860] printk: console [tty0] enabled
> [ 0.014986] printk: bootconsole [uart0] disabled
> [ 0.015564] printk: before might_sleep()
> [ 0.015582] printk: before down_console_sem()

I think that it did not print the 2nd "printk: before
down_console_sem()" because there was missing newline "\n".

printk() keeps such a line open because pr_cont() might append
to it. The message will get printed to the console only when
pr_cont("bla bla \n") is called or when another non-continuous
printk() is called.

> > Does the system boot when you avoid "earlycon" parameter?
>
> It doesn't boot, it hangs and I don't get any output.

The difference might be that earlycon uses the serial port.
While the normal console is terminal "tty0".

Does it help to configure also the normal serial console.
I mean booting with something like:

earlycon console=uart,mmio32,0x00000000ff1a0000,1500000n8 console=tty0

I am not completely sure about the console=uart parameter. It is a
shame but I have never used it. I took the format from
Documentation/admin-guide/kernel-parameters.txt and the values
from your boot log:

[ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')

> > > I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> > > commit fiddles with locks, but no splat was produced that would explain the
> > > hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> > > is booting just fine, so I'm guessing it only affects of subset of arm64
> > > boards.
> >
> > You might try to switch the order of console_init() and lockdep_init()
> > in start_kernel() in init/main.c
>
> Did so above.

Unfortunately, it did not print anything :-(

Best Regards,
Petr

2022-06-14 15:26:21

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

(+Suzuki)

Hi,

I was able to boot the board after applying this patch from Suzuki [1].

[1] https://lore.kernel.org/all/[email protected]/

Thanks,
Alex

On Mon, Jun 13, 2022 at 05:54:35PM +0100, Alexandru Elisei wrote:
> Hi,
>
> (Apologies for the long To: list, I've added everyone that
> scripts/get_maintainer.pl listed for fs/{fs-writeback,inode}.c, for the
> rockchip boards, for printk.c and for the tty layer)
>
> When booting a kernel built from tag v5.19-rc2, my rockpro64-v2 hangs at
> boot with this log:
>
> [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> [ 0.000000] Linux version 5.19.0-rc2 (alex@redacted) (aarch64-linux-gnu-gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #90 SMP PREEMPT Mon Jun 13 17:13:12 BST 2022
> [ 0.000000] Machine model: Pine64 RockPro64 v2.0
> [ 0.000000] efi: UEFI not found.
> [ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')
> [ 0.000000] printk: bootconsole [uart0] enabled
> [ 0.000000] NUMA: No NUMA configuration found
> [ 0.000000] NUMA: Faking a node at [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0xf77dc140-0xf77ddfff]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] DMA32 empty
> [ 0.000000] Normal empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000000200000-0x00000000f7ffffff]
> [ 0.000000] On node 0, zone DMA: 512 pages in unavailable ranges
> [ 0.000000] cma: Reserved 32 MiB at 0x00000000f0000000
> [ 0.000000] psci: probing for conduit method from DT.
> [ 0.000000] psci: PSCIv1.1 detected in firmware.
> [ 0.000000] psci: Using standard PSCI v0.2 function IDs
> [ 0.000000] psci: MIGRATE_INFO_TYPE not supported.
> [ 0.000000] psci: SMC Calling Convention v1.2
> [ 0.000000] percpu: Embedded 20 pages/cpu s41768 r8192 d31960 u81920
> [ 0.000000] Detected VIPT I-cache on CPU0
> [ 0.000000] CPU features: detected: GIC system register CPU interface
> [ 0.000000] CPU features: detected: ARM erratum 845719
> [ 0.000000] Fallback order for Node 0: 0
> [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 999432
> [ 0.000000] Policy zone: DMA
> [ 0.000000] Kernel command line: root=PARTUUID=7f4aab92-69d8-47f3-be10-624da40a71f9 rw earlycon rootwait
> [ 0.000000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
> [ 0.000000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
> [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
> [ 0.000000] Memory: 3915796K/4061184K available (15552K kernel code, 2758K rwdata, 8668K rodata, 6336K init, 564K bss, 112620K reserved, 32768K cma-reserved)
> [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=6, Nodes=1
> [ 0.000000] rcu: Preemptible hierarchical RCU implementation.
> [ 0.000000] rcu: RCU event tracing is enabled.
> [ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=6.
> [ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=6
> [ 0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> [ 0.000000] GICv3: 256 SPIs implemented
> [ 0.000000] GICv3: 0 Extended SPIs implemented
> [ 0.000000] Root IRQ handler: gic_handle_irq
> [ 0.000000] GICv3: GICv3 features: 16 PPIs
> [ 0.000000] GICv3: CPU0: found redistributor 0 region 0:0x00000000fef00000
> [ 0.000000] ITS [mem 0xfee20000-0xfee3ffff]
> [ 0.000000] ITS@0x00000000fee20000: allocated 65536 Devices @480000 (flat, esz 8, psz 64K, shr 0)
> [ 0.000000] ITS: using cache flushing for cmd queue
> [ 0.000000] GICv3: using LPI property table @0x0000000000440000
> [ 0.000000] GIC: using cache flushing for LPI property table
> [ 0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000000450000
> [ 0.000000] GICv3: GIC: PPI partition interrupt-partition-0[0] { /cpus/cpu@0[0] /cpus/cpu@1[1] /cpus/cpu@2[2] /cpus/cpu@3[3] }
> [ 0.000000] GICv3: GIC: PPI partition interrupt-partition-1[1] { /cpus/cpu@100[4] /cpus/cpu@101[5] }
> [ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [ 0.000000] arch_timer: cp15 timer(s) running at 24.00MHz (phys).
> [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x588fe9dc0, max_idle_ns: 440795202592 ns
> [ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
> [ 0.005107] Console: colour dummy device 80x25
> [ 0.005549] printk: console [tty0] enabled
> [ 0.005956] printk: bootconsole [uart0] disabled
>
> Config can be found at [1] (expires after 6 months). I've also built the
> kernel with gcc 10.3.1 [2] (aarch64-none-linux-gnu), same issue.
>
> I've bisected the build failure to commit 10e14073107d ("writeback: Fix
> inode->i_io_list not be protected by inode->i_lock error"); I've confirmed
> that that commit is responsible by successfully booting the board with a
> kernel built from v5.19-rc2 + the above commit reverted.
>
> I tried to do some investigating, it seems that the kernel is stuck at
> printk.c::console_init() -> drivers/tty/vt/vt.c::con_init() ->
> printk.c::register_console() -> unregister_console() -> console_lock().
> This has been determined by adding pr_info statements.
>
> I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> commit fiddles with locks, but no splat was produced that would explain the
> hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> is booting just fine, so I'm guessing it only affects of subset of arm64
> boards.
>
> [1] https://pastebin.com/MfDrKyKd
> [2] https://developer.arm.com/downloads/-/gnu-a
>
> Thanks,
> Alex
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2022-06-14 15:26:52

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On (22/06/14 15:53), Alexandru Elisei wrote:
> (+Suzuki)
>
> I was able to boot the board after applying this patch from Suzuki [1].

OK, so the board actually boots to panic and the issue is that printk
does not flush panic messages. I think Petr has a patch for it.

2022-06-15 13:47:51

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

Hi Petr,

On Tue, Jun 14, 2022 at 05:00:04PM +0200, Petr Mladek wrote:
> On Tue 2022-06-14 14:23:42, Alexandru Elisei wrote:
> > With this change:
> >
> > diff --git a/init/main.c b/init/main.c
> > index 0ee39cdcfcac..a245982eb8a2 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -1057,6 +1057,8 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> >
> > kmem_cache_init_late();
> >
> > + lockdep_init();
> > +
> > /*
> > * HACK ALERT! This is early. We're enabling the console before
> > * we've done PCI setups etc, and console_init() must be aware of
> > @@ -1067,8 +1069,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> > panic("Too many boot %s vars at `%s'", panic_later,
> > panic_param);
> >
> > - lockdep_init();
> > -
> > /*
> > * Need to run this when irqs are enabled, because it wants
> > * to self-test [hard/soft]-irqs on/off lock inversion bugs
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index ea3dd55709e7..aa7684c6745d 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2705,7 +2705,10 @@ static int console_cpu_notify(unsigned int cpu)
> > */
> > void console_lock(void)
> > {
> > + pr_info("before might_sleep()");
> > might_sleep();
> > + pr_info("before down_console_sem()");
> > + pr_info("before down_console_sem()");
> >
> > down_console_sem();
> > if (console_suspended)
> > @@ -3508,12 +3511,18 @@ int unregister_console(struct console *console)
> > if (console->exit)
> > res = console->exit(console);
> >
> > + pr_info("Exiting from unregister_console(), res = %d", res);
> > + pr_info("Exiting from unregister_console(), res = %d", res);
> > +
> > return res;
> >
> > out_disable_unlock:
> > console->flags &= ~CON_ENABLED;
> > console_unlock();
> >
> > + pr_info("Exiting from unregister_console(), res = %d", res);
> > + pr_info("Exiting from unregister_console(), res = %d", res);
> > +
> > return res;
> > }
> > EXPORT_SYMBOL(unregister_console);
> >
> > Some of the pr_info statements are duplicated to see the output just before
> > the console hangs (I assume they're needed to force a buffer flush).
> >
> > This is what I got:
> >
> > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > [ 0.000000] Linux version 5.19.0-rc2-dirty (alex@monolith) (aarch64-linux-gnu-gcc (GCC) 12.1.0, GNU ld (GNU Binutils) 2.38) #106 SMP PREEMPT Tue Jun 14 14:03:30 BST 2022
> > [ 0.000000] Machine model: Pine64 RockPro64 v2.0
> > [ 0.000000] efi: UEFI not found.
> > [ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')
> > [ 0.000000] printk: before might_sleep()
> > [ 0.000000] printk: before down_console_sem()
> > [ 0.000000] printk: before down_console_sem()
> > [ 0.000000] printk: bootconsole [uart0] enabled
> > [ 0.000000] NUMA: No NUMA configuration found
> > [ 0.000000] NUMA: Faking a node at [mem 0x0000000000200000-0x00000000f7ffffff]
> > [ 0.000000] NUMA: NODE_DATA [mem 0xf77cef40-0xf77d0fff]
> [...]
> > [ 0.000001] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns
> > [ 0.005602] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
> > [ 0.006373] ... MAX_LOCKDEP_SUBCLASSES: 8
> > [ 0.006789] ... MAX_LOCK_DEPTH: 48
> > [ 0.007212] ... MAX_LOCKDEP_KEYS: 8192
> > [ 0.007651] ... CLASSHASH_SIZE: 4096
> > [ 0.008088] ... MAX_LOCKDEP_ENTRIES: 32768
> > [ 0.008535] ... MAX_LOCKDEP_CHAINS: 65536
> > [ 0.008981] ... CHAINHASH_SIZE: 32768
> > [ 0.009428] memory used by lock dependency info: 6365 kB
> > [ 0.010018] memory used for stack traces: 4224 kB
> > [ 0.010500] per task-struct memory footprint: 1920 bytes
> > [ 0.011059] printk: before might_sleep()
> > [ 0.011079] printk: before down_console_sem()
> > [ 0.011477] printk: before down_console_sem()
> > [ 0.012112] Console: colour dummy device 80x25
> > [ 0.012984] printk: before might_sleep()
> > [ 0.013003] printk: before down_console_sem()
> > [ 0.013399] printk: before down_console_sem()
> > [ 0.013860] printk: console [tty0] enabled
> > [ 0.014986] printk: bootconsole [uart0] disabled
> > [ 0.015564] printk: before might_sleep()
> > [ 0.015582] printk: before down_console_sem()
>
> I think that it did not print the 2nd "printk: before
> down_console_sem()" because there was missing newline "\n".

You're right, tried it with appending the newline character and it worked.

>
> printk() keeps such a line open because pr_cont() might append
> to it. The message will get printed to the console only when
> pr_cont("bla bla \n") is called or when another non-continuous
> printk() is called.

Thank you for the explanation!

>
> > > Does the system boot when you avoid "earlycon" parameter?
> >
> > It doesn't boot, it hangs and I don't get any output.
>
> The difference might be that earlycon uses the serial port.
> While the normal console is terminal "tty0".
>
> Does it help to configure also the normal serial console.
> I mean booting with something like:
>
> earlycon console=uart,mmio32,0x00000000ff1a0000,1500000n8 console=tty0
>
> I am not completely sure about the console=uart parameter. It is a
> shame but I have never used it. I took the format from
> Documentation/admin-guide/kernel-parameters.txt and the values
> from your boot log:
>
> [ 0.000000] earlycon: uart0 at MMIO32 0x00000000ff1a0000 (options '1500000n8')

According to the devicetree, earlycon should be serial2:1500000n8 (the
"stdout-path" property of the "chosen" node), which is an alias for the
node /serial@ff1a0000. That serial console has the compatible property
"rockchip,rk3399-uart", "snps,dw-apb-uart". The uart8250 early console
driver binds to that device.

Putting the kernel output and the information from the devicetree together,
the earlycon parameter should be:

earlycon=uart8250,mmio32,0xff1a0000,1500000n8

But when I use that, I don't get any output and the kernel hangs on v5.18
and v5.17 too.

It turns out that leaving the baudrate out of the earlycon parameter:

earlycon=uart8250,mmio32,0xff1a0000

or specifying a baudrate of 115200:

earlycon=uart8250,mmio32,0xff1a0000,115200n8

makes it work again, and the board can boot. I assume that because a bogus
baud rate makes earlycon work, there is something off with how the 8250
earlycon driver gets the UART clock frequency (credits to Andre Przywara
for helping me with the UART debugging and discovering this).

>
> > > > I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> > > > commit fiddles with locks, but no splat was produced that would explain the
> > > > hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> > > > is booting just fine, so I'm guessing it only affects of subset of arm64
> > > > boards.
> > >
> > > You might try to switch the order of console_init() and lockdep_init()
> > > in start_kernel() in init/main.c
> >
> > Did so above.
>
> Unfortunately, it did not print anything :-(

With this patch [1] I was able to succefully boot the board. So I guess
problem should be fixed.

[1] https://lore.kernel.org/all/[email protected]/

Thanks,
Alex

2022-06-15 17:10:54

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

Hi Alexandru!

On Wed 15-06-22 13:50:24, Alexandru Elisei wrote:
> On Tue, Jun 14, 2022 at 05:00:04PM +0200, Petr Mladek wrote:
> > > > > I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> > > > > commit fiddles with locks, but no splat was produced that would explain the
> > > > > hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> > > > > is booting just fine, so I'm guessing it only affects of subset of arm64
> > > > > boards.
> > > >
> > > > You might try to switch the order of console_init() and lockdep_init()
> > > > in start_kernel() in init/main.c
> > >
> > > Did so above.
> >
> > Unfortunately, it did not print anything :-(
>
> With this patch [1] I was able to succefully boot the board. So I guess
> problem should be fixed.
>
> [1] https://lore.kernel.org/all/[email protected]/

Can you please try booting with the patch I've posted at [2]? After some
discussion with Linus that would be a preferable fix... Thanks!

[2] https://lore.kernel.org/all/[email protected]

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-06-15 17:17:35

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

Hi,

On Wed, Jun 15, 2022 at 06:39:49PM +0200, Jan Kara wrote:
> Hi Alexandru!
>
> On Wed 15-06-22 13:50:24, Alexandru Elisei wrote:
> > On Tue, Jun 14, 2022 at 05:00:04PM +0200, Petr Mladek wrote:
> > > > > > I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> > > > > > commit fiddles with locks, but no splat was produced that would explain the
> > > > > > hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> > > > > > is booting just fine, so I'm guessing it only affects of subset of arm64
> > > > > > boards.
> > > > >
> > > > > You might try to switch the order of console_init() and lockdep_init()
> > > > > in start_kernel() in init/main.c
> > > >
> > > > Did so above.
> > >
> > > Unfortunately, it did not print anything :-(
> >
> > With this patch [1] I was able to succefully boot the board. So I guess
> > problem should be fixed.
> >
> > [1] https://lore.kernel.org/all/[email protected]/
>
> Can you please try booting with the patch I've posted at [2]? After some
> discussion with Linus that would be a preferable fix... Thanks!
>
> [2] https://lore.kernel.org/all/[email protected]

The patch works for me, thank you!

Do you also need me to reply to the discussion?

Alex

>
> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-06-15 21:27:08

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] rockpro64 board hangs in console_init() after commit 10e14073107d

On Wed 15-06-22 17:49:58, Alexandru Elisei wrote:
> Hi,
>
> On Wed, Jun 15, 2022 at 06:39:49PM +0200, Jan Kara wrote:
> > Hi Alexandru!
> >
> > On Wed 15-06-22 13:50:24, Alexandru Elisei wrote:
> > > On Tue, Jun 14, 2022 at 05:00:04PM +0200, Petr Mladek wrote:
> > > > > > > I've booted a kernel compiled with CONFIG_PROVE_LOCKING=y, as the offending
> > > > > > > commit fiddles with locks, but no splat was produced that would explain the
> > > > > > > hang. I've also tried to boot a v5,19-rc2 kernel on my odroid-c4, the board
> > > > > > > is booting just fine, so I'm guessing it only affects of subset of arm64
> > > > > > > boards.
> > > > > >
> > > > > > You might try to switch the order of console_init() and lockdep_init()
> > > > > > in start_kernel() in init/main.c
> > > > >
> > > > > Did so above.
> > > >
> > > > Unfortunately, it did not print anything :-(
> > >
> > > With this patch [1] I was able to succefully boot the board. So I guess
> > > problem should be fixed.
> > >
> > > [1] https://lore.kernel.org/all/[email protected]/
> >
> > Can you please try booting with the patch I've posted at [2]? After some
> > discussion with Linus that would be a preferable fix... Thanks!
> >
> > [2] https://lore.kernel.org/all/[email protected]
>
> The patch works for me, thank you!

Thanks for testing!

> Do you also need me to reply to the discussion?

No, I don't need that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR