2024-04-22 07:46:17

by Oliver Sang

[permalink] [raw]
Subject: [linux-next:master] [init] b8de39bd1b: BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code



Hello,

kernel test robot noticed "BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code" on:

commit: b8de39bd1b76faffe7cd91e148a6d7d9bf4e38f7 ("init: fix allocated page overlapping with PTR_ERR")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master a35e92ef04c07bd473404b9b73d489aea19a60a8]

in testcase: boot

compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+-------------------------------------------------------------------------------+------------+------------+
| | fdb74eb6c7 | b8de39bd1b |
+-------------------------------------------------------------------------------+------------+------------+
| boot_successes | 12 | 0 |
| boot_failures | 0 | 12 |
| BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code | 0 | 12 |
+-------------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


early console in setup code
convert early boot stage from hang to failed
BUG: kernel failed in early-boot stage, last printk: early console in setup code
Linux version 6.9.0-rc4-00031-gb8de39bd1b76 #1
Command line: ip=::::vm-meta-21::dhcp root=/dev/ram0 RESULT_ROOT=/result/boot/1/vm-snb/yocto-i386-minimal-20190520.cgz/x86_64-randconfig-003-20240419/gcc-13/b8de39bd1b76faffe7cd91e148a6d7d9bf4e38f7/3 BOOT_IMAGE=/pkg/linux/x86_64-randconfig-003-20240419/gcc-13/b8de39bd1b76faffe7cd91e148a6d7d9bf4e38f7/vmlinuz-6.9.0-rc4-00031-gb8de39bd1b76 branch=linux-next/master job=/lkp/jobs/scheduled/vm-meta-21/boot-1-yocto-i386-minimal-20190520.cgz-x86_64-randconfig-003-20240419-b8de39bd1b76-20240420-48196-3fymo-3.yaml user=lkp ARCH=x86_64 kconfig=x86_64-randconfig-003-20240419 commit=b8de39bd1b76faffe7cd91e148a6d7d9bf4e38f7 nmi_watchdog=0 intremap=posted_msi vmalloc=256M initramfs_async=0 page_owner=on max_uptime=600 LKP_SERVER=internal-lkp-server selinux=0 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw rcuperf.shutdown=0 watchdog_thresh=240 audit=0



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240422/[email protected]



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2024-04-22 08:30:04

by Nam Cao

[permalink] [raw]
Subject: Re: [linux-next:master] [init] b8de39bd1b: BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code

On Mon, Apr 22, 2024 at 03:45:00PM +0800, kernel test robot wrote:
> kernel test robot noticed "BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code" on:
>
> commit: b8de39bd1b76faffe7cd91e148a6d7d9bf4e38f7 ("init: fix allocated page overlapping with PTR_ERR")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

I can reproduce the problem. I rebased this commit onto v6.8.7, I can still
observe the problem.

No immediate idea what is the problem. Backtrace from gdb goes crazy:

(gdb) bt
#0 0xffffffffb2074ded in ?? ()
#1 0x00000000000000a1 in ?? ()
#2 0x00000000000000a1 in ?? ()
#3 0x000000007ffff000 in ?? ()
#4 0x00000000543ff000 in ?? ()
#5 0x0000000000000000 in ?? ()

@akpm: drop this commit until this is figured out?

Best regards,
Nam

2024-04-22 09:20:56

by Mike Rapoport

[permalink] [raw]
Subject: Re: [linux-next:master] [init] b8de39bd1b: BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code

On Mon, Apr 22, 2024 at 10:29:42AM +0200, Nam Cao wrote:
> On Mon, Apr 22, 2024 at 03:45:00PM +0800, kernel test robot wrote:
> > kernel test robot noticed "BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code" on:
> >
> > commit: b8de39bd1b76faffe7cd91e148a6d7d9bf4e38f7 ("init: fix allocated page overlapping with PTR_ERR")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> I can reproduce the problem. I rebased this commit onto v6.8.7, I can still
> observe the problem.
>
> No immediate idea what is the problem. Backtrace from gdb goes crazy:
>
> (gdb) bt
> #0 0xffffffffb2074ded in ?? ()
> #1 0x00000000000000a1 in ?? ()
> #2 0x00000000000000a1 in ?? ()
> #3 0x000000007ffff000 in ?? ()
> #4 0x00000000543ff000 in ?? ()
> #5 0x0000000000000000 in ?? ()

The kernel config here has CONFIG_DEBUG_VIRTUAL=y, so __pa translates to
__phys_addr() in arch/x86/mm/physaddr.c and __pa(-PAGE_SIZE) triggers

VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);

x86 has __pa_nodebug() that does not do bounds check, but it cannot be used
in generic code because no other arch except s390 define it.

For now I don't have ideas how to make this work in the general case, so
probably we should only fix riscv for now.

> @akpm: drop this commit until this is figured out?
>
> Best regards,
> Nam
>

--
Sincerely yours,
Mike.

2024-04-22 10:18:21

by Nam Cao

[permalink] [raw]
Subject: Re: [linux-next:master] [init] b8de39bd1b: BUG:kernel_failed_in_early-boot_stage,last_printk:early_console_in_setup_code

On Mon, Apr 22, 2024 at 12:18:46PM +0300, Mike Rapoport wrote:
> The kernel config here has CONFIG_DEBUG_VIRTUAL=y, so __pa translates to
> __phys_addr() in arch/x86/mm/physaddr.c and __pa(-PAGE_SIZE) triggers
>
> VIRTUAL_BUG_ON(y >= KERNEL_IMAGE_SIZE);

RISCV also has a similar thing when CONFIG_DEBUG_VIRTUAL=y

>
> x86 has __pa_nodebug() that does not do bounds check, but it cannot be used
> in generic code because no other arch except s390 define it.
>
> For now I don't have ideas how to make this work in the general case, so
> probably we should only fix riscv for now.

Agree, let's just fix riscv for now. This time I will cook up something
safer, no more __pa() on a potentially invalid address.

Best regards,
Nam