LinuxLists.cc - [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

2016-11-01 14:36:34

Subject: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

Hello.

Andy Lutomirski wrote:
> Reporting these fields on a non-current task is dangerous. If the
> task is in any state other than normal kernel code, they may contain
> garbage or even kernel addresses on some architectures. (x86_64
> used to do this. I bet lots of architectures still do.) With
> CONFIG_THREAD_INFO_IN_TASK, it can OOPS, too.
>
> As far as I know, there are no use programs that make any material
> use of these fields, so just get rid of them.
>
> Cc: Tetsuo Handa <[email protected]>
> Cc: Tycho Andersen <[email protected]>
> Cc: Kees Cook <[email protected]>
> Reported-by: Jann Horn <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> fs/proc/array.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 88c7de12197b..1bb1097e73b7 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -417,10 +417,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
> mm = get_task_mm(task);
> if (mm) {
> vsize = task_vsize(mm);
> - if (permitted) {
> - eip = KSTK_EIP(task);
> - esp = KSTK_ESP(task);
> - }
> + /*
> + * esp and eip are intentionally zeroed out. There is no
> + * non-racy way to read them without freezing the task.
> + * Programs that need reliable values can use ptrace(2).
> + */
> }
>
> get_task_comm(tcomm, task);
> --
> 2.7.4

I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
Is this same reason?

[ 580.778495] Out of memory: Kill process 10206 (a.out) score 998 or sacrifice child
[ 580.778499] Killed process 10206 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 580.797408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 580.802963] a.out x[ 580.803660] BUG: unable to handle kernel
paging request at ffffc900144dfc60
[ 580.807153] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[ 580.809313] PGD 7f4c0067 [ 580.809875] PUD 7f4c1067
PMD 47df1067 [ 580.811690] PTE 0
[ 580.812998]
[ 580.814155] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 580.816139] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc[ 580.821830] oom_reaper: reaped process 10206 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 580.822492] Out of memory: Kill process 10208 (a.out) score 998 or sacrifice child
[ 580.822496] Killed process 10208 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 580.824895] oom_reaper: reaped process 10208 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 580.833682] ebtable_filter ebtables[ 580.834453] Out of memory: Kill process 10210 (a.out) score 998 or sacrifice child
[ 580.834458] Killed process 10210 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 580.839762] ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 vmw_vmci shpchp ip_tables sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci e1000 mptspi libahci drm scsi_transport_spi mptscsih mptbase i2c_core ata_piix libata
[ 580.850620] CPU: 2 PID: 45 Comm: khungtaskd Tainted: G W 4.9.0-rc3+ #83
[ 580.853526] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 580.856842] task: ffff88007b54b7c0 task.stack: ffffc900004c0000
[ 580.859169] RIP: 0010:[<ffffffff81026feb>] [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[ 580.862264] RSP: 0018:ffffc900004c3db8 EFLAGS: 00010202
[ 580.864343] RAX: ffffc900144dfc30 RBX: ffff8800438e1c00 RCX: 0000000000000000
[ 580.867439] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800438e1c00
[ 580.869910] RBP: ffffc900004c3db8 R08: 0000000000000001 R09: 0000000000000001
[ 580.872963] R10: 0000000000000000 R11: 0000000000aaaaaa R12: 0000000000000007
[ 580.875522] R13: 000000000000028a R14: 00000000003ffa8a R15: ffff8800438e1eb8
[ 580.877387] oom_reaper: reaped process 10210 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 580.878738] Out of memory: Kill process 10212 (a.out) score 998 or sacrifice child
[ 580.878743] Killed process 10212 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 580.887239] FS: 0000000000000000(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
[ 580.890017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 580.892628] CR2: ffffc900144dfc60 CR3: 0000000001c0c000 CR4: 00000000001406e0
[ 580.895101] Stack:
[ 580.896443] ffffc900004c3de0 ffffffff810974c0 0000000000000000 ffff8800438e1c00
[ 580.899033] ffff8800438e1c00 ffffc900004c3e40 ffffffff8112a500 ffffffff8112a32d
[ 580.904306] 000000000000003c ffff8800438e1c00 0000000000000003 000000010003e000
[ 580.907040] Call Trace:
[ 580.908547] [<ffffffff810974c0>] sched_show_task+0x50/0x240
[ 580.911435] oom_reaper: reaped process 10212 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 580.912449] Out of memory: Kill process 10214 (a.out) score 998 or sacrifice child
[ 580.912453] Killed process 10214 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 580.919432] oom_reaper: reaped process 10214 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 580.920256] Out of memory: Kill process 10216 (a.out) score 998 or sacrifice child
[ 580.920259] Killed process 10216 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 580.928793] [<ffffffff8112a500>] watchdog+0x3d0/0x4f0
[ 580.930774] [<ffffffff8112a32d>] ? watchdog+0x1fd/0x4f0
[ 580.932785] [<ffffffff8112a130>] ? check_memalloc_stalling_tasks+0x820/0x820
[ 580.935649] [<ffffffff81089b4d>] kthread+0xfd/0x120
[ 580.937594] [<ffffffff81089a50>] ? kthread_park+0x60/0x60
[ 580.939693] [<ffffffff81089a50>] ? kthread_park+0x60/0x60
[ 580.941743] [<ffffffff816a4c57>] ret_from_fork+0x27/0x40
[ 580.944608] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[ 580.952519] RIP [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[ 580.954654] RSP <ffffc900004c3db8>
[ 580.956272] CR2: ffffc900144dfc60
[ 580.957861] ---[ end trace cd024114d281cfa4 ]---
[ 580.959662] BUG: sleeping function called from invalid context at ./include/linux/sched.h:3138
[ 580.962350] in_atomic(): 0, irqs_disabled(): 1, pid: 45, name: khungtaskd
[ 580.964610] INFO: lockdep is turned off.
[ 580.966236] irq event stamp: 88
[ 580.967682] hardirqs last enabled at (87): [ 580.968588] [<ffffffff816a4075>] _raw_spin_unlock_irqrestore+0x55/0x70
[ 580.970766] hardirqs last disabled at (88): [ 580.971654] [<ffffffff8169ddb1>] __schedule+0x91/0x730
[ 580.973574] softirqs last enabled at (66): [ 580.974607] [<ffffffff8106d422>] __do_softirq+0x192/0x220
[ 580.976628] softirqs last disabled at (59): [ 580.977528] [<ffffffff8106d754>] irq_exit+0xc4/0x100
[ 580.979345] Preemption disabled at:[ 580.980073] [<ffffffff810d1a7f>] wake_up_klogd+0xf/0x70
[ 580.981951] CPU: 2 PID: 45 Comm: khungtaskd Tainted: G D W 4.9.0-rc3+ #83
[ 580.984297] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 580.987279] ffffc900004c3e50 ffffffff813372bf 0000000000000000 ffff88007b54b7c0
[ 580.989759] ffffc900004c3e88 ffffffff8108fa2c ffffffff819799f2 0000000000000c42
[ 580.992259] 0000000000000000 ffff88007b54b7c0 0000000000000000 ffffc900004c3eb0
[ 580.994701] Call Trace:
[ 580.995988] [<ffffffff813372bf>] dump_stack+0x67/0x98
[ 580.997835] [<ffffffff8108fa2c>] ___might_sleep+0x16c/0x260
[ 581.000291] [<ffffffff8108fb65>] __might_sleep+0x45/0x80
[ 581.002552] [<ffffffff8107823e>] exit_signals+0x2e/0x2f0
[ 581.004411] [<ffffffff8108b991>] ? blocking_notifier_call_chain+0x11/0x20
[ 581.006760] [<ffffffff8106bbe6>] do_exit+0xb6/0xb10
[ 581.008646] [<ffffffff816a6627>] rewind_stack_do_exit+0x17/0x20
[ 608.732005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [vmtoolsd:2075]

2016-11-01 23:47:22

by Linus Torvalds

[permalink] [raw]

Subject: Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
<[email protected]> wrote:
>
> I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
> Is this same reason?

CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
did VMAP_STACK

And yes, it looks like it's the same "touching another process' stack"
issue, just in sched_show_task() called from check_hung_task(), which
seems to have been due to a watchdog triggering. I'm not sure what the
relationship is with the oom killer happening at the same time, but it
makes the whole thing fairly hard to read.

The cleaned-up oops looks like this:

> [ 580.803660] BUG: unable to handle kernel paging request at ffffc900144dfc60
> [ 580.807153] IP: thread_saved_pc+0xb/0x20
> [ 580.907040] Call Trace:
> [ 580.908547] sched_show_task+0x50/0x240
> [ 580.928793] watchdog+0x3d0/0x4f0
> [ 580.930774] ? watchdog+0x1fd/0x4f0
> [ 580.932785] ? check_memalloc_stalling_tasks+0x820/0x820
> [ 580.935649] kthread+0xfd/0x120
> [ 580.937594] ? kthread_park+0x60/0x60
> [ 580.939693] ? kthread_park+0x60/0x60
> [ 580.941743] ret_from_fork+0x27/0x40
> [ 580.944608] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
> [ 580.952519] RIP [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [ 580.954654] RSP <ffffc900004c3db8>
> [ 580.956272] CR2: ffffc900144dfc60

So we have

watchdog -> check_hung_uninterruptible_task -> check_hung_task ->
sched_show_task -> thread_saved_pc(), which oopses.

We just checked that task was TASK_UNINTERRUPTIBLE in that chain, but
clearly it races with it dying (due to oom), and so by the time er get
to thread_saved_pc() it's dead and the stack is gone.

Considering that we just print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all. The real information
would have come later as part of "show_stack()", which seems to be
doing the proper try_get_task_stack().

So I _think_ the fix is to just remove this. Perhaps something like
the attached? Adding scheduler people since this is in their code..

Linus

Attachments:

patch.diff (776.00 B)

2016-11-02 10:50:38

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

Linus Torvalds wrote:
> On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
> <[email protected]> wrote:
> >
> > I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
> > Is this same reason?
>
> CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
> did VMAP_STACK

Yes. And I wrote a reproducer.

---------- Reproducer start ----------
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
if (fork() == 0)
_exit(0);
sleep(1);
system("echo t > /proc/sysrq-trigger");
return 0;
}
---------- Reproducer end ----------

---------- Serial console log start ----------
[ 328.528734] a.out x
[ 328.529293] BUG: unable to handle kernel
[ 328.530655] paging request at ffffc90001f43e18
[ 328.531837] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[ 328.533512] PGD 7f4c0067
[ 328.533972] PUD 7f4c1067
[ 328.535065] PMD 74cba067
[ 328.535296] PTE 0

[ 328.537173] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 328.538698] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 shpchp vmw_vmci ip_tables sd_mod ata_generic pata_acpi serio_raw mptspi vmwgfx scsi_transport_spi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt mptscsih e1000 fb_sys_fops libahci ttm drm mptbase ata_piix i2c_core libata
[ 328.552465] CPU: 0 PID: 4299 Comm: sh Tainted: G W 4.9.0-rc3+ #83
[ 328.554403] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 328.556939] task: ffff8800792b5380 task.stack: ffffc90001f58000
[ 328.558686] RIP: 0010:[<ffffffff81026feb>] [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[ 328.560926] RSP: 0018:ffffc90001f5bd28 EFLAGS: 00010202
[ 328.562603] RAX: ffffc90001f43de8 RBX: ffff88007826d380 RCX: 0000000000000006
[ 328.564507] RDX: 0000000000000000 RSI: ffffffff8197f2d1 RDI: ffff88007826d380
[ 328.566437] RBP: ffffc90001f5bd28 R08: 0000000000000001 R09: 0000000000000001
[ 328.568354] R10: 0000000000000001 R11: 0000000000000004 R12: 0000000000000007
[ 328.570266] R13: ffff88007826d638 R14: ffff88007826d380 R15: 0000000000000002
[ 328.572197] FS: 00007ff7b501e740(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
[ 328.574303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 328.576006] CR2: ffffc90001f43e18 CR3: 000000007894c000 CR4: 00000000001406f0
[ 328.577995] Stack:
[ 328.579024] ffffc90001f5bd50 ffffffff810974c0 ffffc90001f5bd50 ffff88007826d380
[ 328.581219] 0000000000000000 ffffc90001f5bd88 ffffffff81097767 ffffffff810976b0
[ 328.583300] ffffffff81c74e60 0000000000000074 0000000000000000 0000000000000007
[ 328.585404] Call Trace:
[ 328.586531] [<ffffffff810974c0>] sched_show_task+0x50/0x240
[ 328.588184] [<ffffffff81097767>] show_state_filter+0xb7/0x190
[ 328.589860] [<ffffffff810976b0>] ? sched_show_task+0x240/0x240
[ 328.591553] [<ffffffff813fd4fb>] sysrq_handle_showstate+0xb/0x20
[ 328.593304] [<ffffffff813fdce6>] __handle_sysrq+0x136/0x220
[ 328.594992] [<ffffffff813fdbb0>] ? __sysrq_get_key_op+0x30/0x30
[ 328.596678] [<ffffffff813fe1f1>] write_sysrq_trigger+0x41/0x50
[ 328.598386] [<ffffffff81249c88>] proc_reg_write+0x38/0x70
[ 328.600038] [<ffffffff811dc802>] __vfs_write+0x32/0x140
[ 328.601604] [<ffffffff810dc797>] ? rcu_read_lock_sched_held+0x87/0x90
[ 328.603365] [<ffffffff810dcb2a>] ? rcu_sync_lockdep_assert+0x2a/0x50
[ 328.605111] [<ffffffff811e0279>] ? __sb_start_write+0x189/0x240
[ 328.606735] [<ffffffff811dd642>] ? vfs_write+0x182/0x1b0
[ 328.608278] [<ffffffff811dd570>] vfs_write+0xb0/0x1b0
[ 328.609777] [<ffffffff81002240>] ? syscall_trace_enter+0x1b0/0x240
[ 328.611513] [<ffffffff811dea13>] SyS_write+0x53/0xc0
[ 328.612989] [<ffffffff81353b63>] ? __this_cpu_preempt_check+0x13/0x20
[ 328.614757] [<ffffffff81002511>] do_syscall_64+0x61/0x1d0
[ 328.616329] [<ffffffff816a4aa4>] entry_SYSCALL64_slow_path+0x25/0x25
[ 328.618057] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[ 328.624402] RIP [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
[ 328.626124] RSP <ffffc90001f5bd28>
[ 328.627375] CR2: ffffc90001f43e18
[ 328.628646] ---[ end trace 70b31f25a2ce0c0c ]---
---------- Serial console log end ----------

> Considering that we just print out a useless hex number, not even a
> symbol, and there's a big question mark whether this even makes sense
> anyway, I suspect we should just remove it all. The real information
> would have come later as part of "show_stack()", which seems to be
> doing the proper try_get_task_stack().
>
> So I _think_ the fix is to just remove this. Perhaps something like
> the attached? Adding scheduler people since this is in their code..

That is not sufficient, for another Oops occurs inside stack_not_used().
Since I don't want to break stack_not_used(), can we tolerate nested
try_get_task_stack() usage and protect the whole sched_show_task()?

----------------------------------------
>From 9cf83a0a8c48d281434b040694835743940a88b2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Wed, 2 Nov 2016 19:31:07 +0900
Subject: [PATCH] sched: Fix oops in sched_show_task()

When CONFIG_VMAP_STACK=y, it is possible that an exited thread remains in
the task list after its stack pointer was already set to NULL. Therefore,
thread_saved_pc() and stack_not_used() in sched_show_task() will trigger
NULL pointer dereference if an attempt to dump such thread's traces
(e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..9abf66b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5192,6 +5192,8 @@ void sched_show_task(struct task_struct *p)
int ppid;
unsigned long state = p->state;

+ if (!try_get_task_stack(p))
+ return;
if (state)
state = __ffs(state) + 1;
printk(KERN_INFO "%-15.15s %c", p->comm,
@@ -5221,6 +5223,7 @@ void sched_show_task(struct task_struct *p)

print_worker_info(KERN_INFO, p);
show_stack(p, NULL);
+ put_task_stack(p);
}

void show_state_filter(unsigned long state_filter)
--
1.8.3.1

2016-11-02 14:06:00

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

On Wed, Nov 2, 2016 at 3:50 AM, Tetsuo Handa
<[email protected]> wrote:
> Linus Torvalds wrote:
>> On Tue, Nov 1, 2016 at 8:36 AM, Tetsuo Handa
>> <[email protected]> wrote:
>> >
>> > I got an Oops with khungtaskd. This kernel was built with CONFIG_THREAD_INFO_IN_TASK=y .
>> > Is this same reason?
>>
>> CONFIG_THREAD_INFO_IN_TASK is always set on x86, but I assume you also
>> did VMAP_STACK
>
> Yes. And I wrote a reproducer.
>
> ---------- Reproducer start ----------
> #include <unistd.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
> if (fork() == 0)
> _exit(0);
> sleep(1);
> system("echo t > /proc/sysrq-trigger");
> return 0;
> }
> ---------- Reproducer end ----------
>
> ---------- Serial console log start ----------
> [ 328.528734] a.out x
> [ 328.529293] BUG: unable to handle kernel
> [ 328.530655] paging request at ffffc90001f43e18
> [ 328.531837] IP: [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [ 328.533512] PGD 7f4c0067
> [ 328.533972] PUD 7f4c1067
> [ 328.535065] PMD 74cba067
> [ 328.535296] PTE 0
>
> [ 328.537173] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 328.538698] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_raw iptable_filter coretemp pcspkr sg i2c_piix4 shpchp vmw_vmci ip_tables sd_mod ata_generic pata_acpi serio_raw mptspi vmwgfx scsi_transport_spi drm_kms_helper ahci syscopyarea sysfillrect sysimgblt mptscsih e1000 fb_sys_fops libahci ttm drm mptbase ata_piix i2c_core libata
> [ 328.552465] CPU: 0 PID: 4299 Comm: sh Tainted: G W 4.9.0-rc3+ #83
> [ 328.554403] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [ 328.556939] task: ffff8800792b5380 task.stack: ffffc90001f58000
> [ 328.558686] RIP: 0010:[<ffffffff81026feb>] [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [ 328.560926] RSP: 0018:ffffc90001f5bd28 EFLAGS: 00010202
> [ 328.562603] RAX: ffffc90001f43de8 RBX: ffff88007826d380 RCX: 0000000000000006
> [ 328.564507] RDX: 0000000000000000 RSI: ffffffff8197f2d1 RDI: ffff88007826d380
> [ 328.566437] RBP: ffffc90001f5bd28 R08: 0000000000000001 R09: 0000000000000001
> [ 328.568354] R10: 0000000000000001 R11: 0000000000000004 R12: 0000000000000007
> [ 328.570266] R13: ffff88007826d638 R14: ffff88007826d380 R15: 0000000000000002
> [ 328.572197] FS: 00007ff7b501e740(0000) GS:ffff88007c200000(0000) knlGS:0000000000000000
> [ 328.574303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 328.576006] CR2: ffffc90001f43e18 CR3: 000000007894c000 CR4: 00000000001406f0
> [ 328.577995] Stack:
> [ 328.579024] ffffc90001f5bd50 ffffffff810974c0 ffffc90001f5bd50 ffff88007826d380
> [ 328.581219] 0000000000000000 ffffc90001f5bd88 ffffffff81097767 ffffffff810976b0
> [ 328.583300] ffffffff81c74e60 0000000000000074 0000000000000000 0000000000000007
> [ 328.585404] Call Trace:
> [ 328.586531] [<ffffffff810974c0>] sched_show_task+0x50/0x240
> [ 328.588184] [<ffffffff81097767>] show_state_filter+0xb7/0x190
> [ 328.589860] [<ffffffff810976b0>] ? sched_show_task+0x240/0x240
> [ 328.591553] [<ffffffff813fd4fb>] sysrq_handle_showstate+0xb/0x20
> [ 328.593304] [<ffffffff813fdce6>] __handle_sysrq+0x136/0x220
> [ 328.594992] [<ffffffff813fdbb0>] ? __sysrq_get_key_op+0x30/0x30
> [ 328.596678] [<ffffffff813fe1f1>] write_sysrq_trigger+0x41/0x50
> [ 328.598386] [<ffffffff81249c88>] proc_reg_write+0x38/0x70
> [ 328.600038] [<ffffffff811dc802>] __vfs_write+0x32/0x140
> [ 328.601604] [<ffffffff810dc797>] ? rcu_read_lock_sched_held+0x87/0x90
> [ 328.603365] [<ffffffff810dcb2a>] ? rcu_sync_lockdep_assert+0x2a/0x50
> [ 328.605111] [<ffffffff811e0279>] ? __sb_start_write+0x189/0x240
> [ 328.606735] [<ffffffff811dd642>] ? vfs_write+0x182/0x1b0
> [ 328.608278] [<ffffffff811dd570>] vfs_write+0xb0/0x1b0
> [ 328.609777] [<ffffffff81002240>] ? syscall_trace_enter+0x1b0/0x240
> [ 328.611513] [<ffffffff811dea13>] SyS_write+0x53/0xc0
> [ 328.612989] [<ffffffff81353b63>] ? __this_cpu_preempt_check+0x13/0x20
> [ 328.614757] [<ffffffff81002511>] do_syscall_64+0x61/0x1d0
> [ 328.616329] [<ffffffff816a4aa4>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 328.618057] Code: 55 48 8b bf d0 01 00 00 be 00 00 00 02 48 89 e5 e8 6b 58 3f 00 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 8b 87 e0 15 00 00 48 89 e5 <48> 8b 40 30 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
> [ 328.624402] RIP [<ffffffff81026feb>] thread_saved_pc+0xb/0x20
> [ 328.626124] RSP <ffffc90001f5bd28>
> [ 328.627375] CR2: ffffc90001f43e18
> [ 328.628646] ---[ end trace 70b31f25a2ce0c0c ]---
> ---------- Serial console log end ----------
>
>> Considering that we just print out a useless hex number, not even a
>> symbol, and there's a big question mark whether this even makes sense
>> anyway, I suspect we should just remove it all. The real information
>> would have come later as part of "show_stack()", which seems to be
>> doing the proper try_get_task_stack().
>>
>> So I _think_ the fix is to just remove this. Perhaps something like
>> the attached? Adding scheduler people since this is in their code..
>
> That is not sufficient, for another Oops occurs inside stack_not_used().
> Since I don't want to break stack_not_used(), can we tolerate nested
> try_get_task_stack() usage and protect the whole sched_show_task()?
>
> ----------------------------------------
> >From 9cf83a0a8c48d281434b040694835743940a88b2 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Wed, 2 Nov 2016 19:31:07 +0900
> Subject: [PATCH] sched: Fix oops in sched_show_task()
>
> When CONFIG_VMAP_STACK=y, it is possible that an exited thread remains in

Nit: It's CONFIG_THREAD_INFO_IN_TASK=y that does this.

This patch looks fine to me. Linus, your patch also looks almost good
(I think the lines you deleted were spaced like that to preserve
output alignment, which may or may not matter), and maybe it would
make sense to apply both.

2016-11-02 14:54:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

On Wed, Nov 2, 2016 at 4:50 AM, Tetsuo Handa
<[email protected]> wrote:
>>
>> So I _think_ the fix is to just remove this. Perhaps something like
>> the attached? Adding scheduler people since this is in their code..
>
> That is not sufficient, for another Oops occurs inside stack_not_used().
> Since I don't want to break stack_not_used(), can we tolerate nested
> try_get_task_stack() usage and protect the whole sched_show_task()?

Sure, looks good to me.

But as Andy says, we should probably apply my patch too, just to get
rid of the useless hex printk and the ugly ifdeffery.

PeterZ/Ingo?

Linus

2016-11-03 06:32:29

by Ingo Molnar

[permalink] [raw]

Subject: Re: [4.9-rc3] BUG: unable to handle kernel paging request at ffffc900144dfc60

* Linus Torvalds <[email protected]> wrote:

> On Wed, Nov 2, 2016 at 4:50 AM, Tetsuo Handa
> <[email protected]> wrote:
> >>
> >> So I _think_ the fix is to just remove this. Perhaps something like
> >> the attached? Adding scheduler people since this is in their code..
> >
> > That is not sufficient, for another Oops occurs inside stack_not_used().
> > Since I don't want to break stack_not_used(), can we tolerate nested
> > try_get_task_stack() usage and protect the whole sched_show_task()?
>
> Sure, looks good to me.
>
> But as Andy says, we should probably apply my patch too, just to get
> rid of the useless hex printk and the ugly ifdeffery.

Agreed, and I have applied both fixes to sched/urgent.

Thanks,

Ingo

2016-11-03 07:10:54

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/urgent] sched/core: Remove pointless printout in sched_show_task()

Commit-ID: 8243d5597793b5e85143c9a935e1b971c59740a9
Gitweb: http://git.kernel.org/tip/8243d5597793b5e85143c9a935e1b971c59740a9
Author: Linus Torvalds <[email protected]>
AuthorDate: Tue, 1 Nov 2016 17:47:18 -0600
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 3 Nov 2016 07:31:34 +0100

sched/core: Remove pointless printout in sched_show_task()

In sched_show_task() we print out a useless hex number, not even a
symbol, and there's a big question mark whether this even makes sense
anyway, I suspect we should just remove it all.

Signed-off-by: Linus Torvalds <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/CA+55aFzphURPFzAvU4z6Moy7ZmimcwPuUdYU8bj9z0J+S8X1rw@mail.gmail.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 9 ---------
1 file changed, 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9abf66b..154fd68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5198,17 +5198,8 @@ void sched_show_task(struct task_struct *p)
state = __ffs(state) + 1;
printk(KERN_INFO "%-15.15s %c", p->comm,
state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
-#if BITS_PER_LONG == 32
- if (state == TASK_RUNNING)
- printk(KERN_CONT " running ");
- else
- printk(KERN_CONT " %08lx ", thread_saved_pc(p));
-#else
if (state == TASK_RUNNING)
printk(KERN_CONT " running task ");
- else
- printk(KERN_CONT " %016lx ", thread_saved_pc(p));
-#endif
#ifdef CONFIG_DEBUG_STACK_USAGE
free = stack_not_used(p);
#endif

2016-11-03 07:10:28

by tip-bot for Vasyl Gomonovych

[permalink] [raw]

Subject: [tip:sched/urgent] sched/core: Fix oops in sched_show_task()

Commit-ID: 382005027fedc50b28d40ae64ef1461cca38953e
Gitweb: http://git.kernel.org/tip/382005027fedc50b28d40ae64ef1461cca38953e
Author: Tetsuo Handa <[email protected]>
AuthorDate: Wed, 2 Nov 2016 19:50:29 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 3 Nov 2016 07:27:34 +0100

sched/core: Fix oops in sched_show_task()

When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.

Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.

Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.

Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..9abf66b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5192,6 +5192,8 @@ void sched_show_task(struct task_struct *p)
int ppid;
unsigned long state = p->state;

+ if (!try_get_task_stack(p))
+ return;
if (state)
state = __ffs(state) + 1;
printk(KERN_INFO "%-15.15s %c", p->comm,
@@ -5221,6 +5223,7 @@ void sched_show_task(struct task_struct *p)

print_worker_info(KERN_INFO, p);
show_stack(p, NULL);
+ put_task_stack(p);
}

void show_state_filter(unsigned long state_filter)