2024-02-29 16:15:47

by Breno Leitao

[permalink] [raw]
Subject: general protection fault, probably for non-canonical address in pick_next_task_fair()

I've been running some stress test using stress-ng with a kernel with some
debug options enabled, such as KASAN and friends (See the config below).

I saw it in rc4 and the decode instructions are a bit off (as it is here
also - search for mavabs in dmesg below and you will find something as `(bad)`,
so I though it was a machine issue. But now I see it again, and I am sharing
for awareness.

This is happening in upstream kernel against the following commit
d206a76d7d2726 ("Linux 6.8-rc6")

This is the exercpt that shows before the crash:

general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]

This is the stack that is getting it

? __die_body (arch/x86/kernel/dumpstack.c:421)
? die_addr (arch/x86/kernel/dumpstack.c:460)
? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)
? pick_next_task_fair (kernel/sched/fair.c:5463 kernel/sched/fair.c:8434)
? update_rq_clock_task (kernel/sched/core.c:?)
__schedule (kernel/sched/core.c:6022 kernel/sched/core.c:6545 kernel/sched/core.c:6691)
schedule (kernel/sched/core.c:6803 kernel/sched/core.c:6817)
syscall_exit_to_user_mode (kernel/entry/common.c:98 include/linux/entry-common.h:328 kernel/entry/common.c:201 kernel/entry/common.c:212)
do_syscall_64 (arch/x86/entry/common.c:102)
? irqentry_exit_to_user_mode (kernel/entry/common.c:228)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)

Full dmesg: https://paste.mozilla.org/RiLnt4QO#
Configs: https://paste.mozilla.org/XJ9wbdRp


2024-02-29 22:58:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in pick_next_task_fair()

On Thu, Feb 29 2024 at 07:55, Breno Leitao wrote:
> I've been running some stress test using stress-ng with a kernel with some
> debug options enabled, such as KASAN and friends (See the config below).
>
> I saw it in rc4 and the decode instructions are a bit off (as it is here
> also - search for mavabs in dmesg below and you will find something as `(bad)`,
> so I though it was a machine issue. But now I see it again, and I am sharing
> for awareness.

The (bad) is after the faulting instruction, but gives an hint:

2e: 0f 84 67 ff ff ff je 0xffffffffffffff9b
34: 48 89 ef mov %rbp,%rdi
37: e8 cf 70 76 00 call 0x76710b
3c: e9 .byte 0xe9

That's an invalid opcode, which means that memory is corrupted.

Thanks,

tglx

2024-03-01 03:44:07

by Abel Wu

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in pick_next_task_fair()

Hi Breno, this seems to be a known issue under discussion.

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/

On 2/29/24 11:55 PM, Breno Leitao Wrote:
> I've been running some stress test using stress-ng with a kernel with some
> debug options enabled, such as KASAN and friends (See the config below).
>
> I saw it in rc4 and the decode instructions are a bit off (as it is here
> also - search for mavabs in dmesg below and you will find something as `(bad)`,
> so I though it was a machine issue. But now I see it again, and I am sharing
> for awareness.
>
> This is happening in upstream kernel against the following commit
> d206a76d7d2726 ("Linux 6.8-rc6")
>
> This is the exercpt that shows before the crash:
>
> general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]
>
> This is the stack that is getting it
>
> ? __die_body (arch/x86/kernel/dumpstack.c:421)
> ? die_addr (arch/x86/kernel/dumpstack.c:460)
> ? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
> ? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
> ? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)
> ? pick_next_task_fair (kernel/sched/fair.c:5463 kernel/sched/fair.c:8434)
> ? update_rq_clock_task (kernel/sched/core.c:?)
> __schedule (kernel/sched/core.c:6022 kernel/sched/core.c:6545 kernel/sched/core.c:6691)
> schedule (kernel/sched/core.c:6803 kernel/sched/core.c:6817)
> syscall_exit_to_user_mode (kernel/entry/common.c:98 include/linux/entry-common.h:328 kernel/entry/common.c:201 kernel/entry/common.c:212)
> do_syscall_64 (arch/x86/entry/common.c:102)
> ? irqentry_exit_to_user_mode (kernel/entry/common.c:228)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
>
> Full dmesg: https://paste.mozilla.org/RiLnt4QO#
> Configs: https://paste.mozilla.org/XJ9wbdRp
>

2024-03-01 03:47:24

by Abel Wu

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in pick_next_task_fair()

(+ Chen Yu, Oliver Sang)

On 2/29/24 11:55 PM, Breno Leitao Wrote:
> I've been running some stress test using stress-ng with a kernel with some
> debug options enabled, such as KASAN and friends (See the config below).
>
> I saw it in rc4 and the decode instructions are a bit off (as it is here
> also - search for mavabs in dmesg below and you will find something as `(bad)`,
> so I though it was a machine issue. But now I see it again, and I am sharing
> for awareness.
>
> This is happening in upstream kernel against the following commit
> d206a76d7d2726 ("Linux 6.8-rc6")
>
> This is the exercpt that shows before the crash:
>
> general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]
>
> This is the stack that is getting it
>
> ? __die_body (arch/x86/kernel/dumpstack.c:421)
> ? die_addr (arch/x86/kernel/dumpstack.c:460)
> ? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
> ? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
> ? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)
> ? pick_next_task_fair (kernel/sched/fair.c:5463 kernel/sched/fair.c:8434)
> ? update_rq_clock_task (kernel/sched/core.c:?)
> __schedule (kernel/sched/core.c:6022 kernel/sched/core.c:6545 kernel/sched/core.c:6691)
> schedule (kernel/sched/core.c:6803 kernel/sched/core.c:6817)
> syscall_exit_to_user_mode (kernel/entry/common.c:98 include/linux/entry-common.h:328 kernel/entry/common.c:201 kernel/entry/common.c:212)
> do_syscall_64 (arch/x86/entry/common.c:102)
> ? irqentry_exit_to_user_mode (kernel/entry/common.c:228)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
>
> Full dmesg: https://paste.mozilla.org/RiLnt4QO#
> Configs: https://paste.mozilla.org/XJ9wbdRp
>

2024-03-01 07:15:17

by Chen Yu

[permalink] [raw]
Subject: Re: general protection fault, probably for non-canonical address in pick_next_task_fair()

On 2024-03-01 at 11:47:05 +0800, Abel Wu wrote:
> (+ Chen Yu, Oliver Sang)
>
> On 2/29/24 11:55 PM, Breno Leitao Wrote:
> > I've been running some stress test using stress-ng with a kernel with some
> > debug options enabled, such as KASAN and friends (See the config below).
> >
> > I saw it in rc4 and the decode instructions are a bit off (as it is here
> > also - search for mavabs in dmesg below and you will find something as `(bad)`,
> > so I though it was a machine issue. But now I see it again, and I am sharing
> > for awareness.
> >
> > This is happening in upstream kernel against the following commit
> > d206a76d7d2726 ("Linux 6.8-rc6")
> >
> > This is the exercpt that shows before the crash:
> >
> > general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> > KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]
> >
> > This is the stack that is getting it
> >
> > ? __die_body (arch/x86/kernel/dumpstack.c:421)
> > ? die_addr (arch/x86/kernel/dumpstack.c:460)
> > ? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
> > ? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
> > ? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)

Seems to be the same reason pick_eevdf returns NULL.. it panic here..
cfs_rq = group_cfs_rq(se);

I remember lkp has regular stress-ng test for regression test, but
not detect this yet.

thanks,
Chenyu