2020-07-17 10:48:24

by Jiri Slaby

[permalink] [raw]
Subject: 5.8-rc*: kernel BUG at kernel/signal.c:1917

Hi,

the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
and i586:
> kernel BUG at kernel/signal.c:1917!
> invalid opcode: 0000 [#1] SMP NOPTI
> CPU: 7 PID: 18367 Comm: filter-unavaila Not tainted 5.8.0-rc4-3.g2cd7849-default #1 openSUSE Tumbleweed (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> RIP: 0010:do_notify_parent+0x239/0x270
> Code: d1 e8 41 89 c5 41 83 e5 01 48 8b 95 28 05 00 00 45 31 c0 4c 89 f6 44 89 e7 b9 01 00 00 00 e8 ce f2 ff ff e9 5b ff ff ff 0f 0b <0f> 0b 48 39 bf 50 05 00 00 75 17 48 8b 97 c0 05 00 00 48 8d 87 c0
> RSP: 0018:ffffa05400f93d00 EFLAGS: 00010002
> RAX: 0000000000000008 RBX: ffff9024ebb6dd00 RCX: 00000000000102f9
> RDX: 0000000000000000 RSI: 0000000000000011 RDI: ffff9024ebb6dd00
> RBP: ffff9024ebb6dd00 R08: 0000000000000000 R09: 000000000000000d
> R10: 00007f7813000000 R11: ffffc2690721ae80 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000009 R15: 0000000000000000
> FS: 00007f7812ffc700(0000) GS:ffff902537dc0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f9c246a4a10 CR3: 00000001eba7c000 CR4: 00000000000406e0
> Call Trace:
> exit_notify+0x1c0/0x1f0
> do_exit+0x234/0x430
> do_group_exit+0x33/0xa0
> get_signal+0x171/0x640
> do_signal+0x2a/0x1c0
> exit_to_usermode_loop+0x89/0x110
> __prepare_exit_to_usermode+0x92/0xd0
> do_syscall_64+0x59/0xd0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f78149db992
> Code: Bad RIP value.
> RSP: 002b:00007f7812ffbed0 EFLAGS: 00000293 ORIG_RAX: 0000000000000022
> RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f78149db992
> RDX: 0000000000000001 RSI: 0000558a4e07e033 RDI: 0000000000000000
> RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f7812ffc700
> R10: 00007f7812ffc9d0 R11: 0000000000000293 R12: 0000000000000000
> R13: 00007ffe4853b4bf R14: 00007ffe4853b4c0 R15: 00007f7812ffc700
> Modules linked in: ata_generic ata_piix qemu_fw_cfg e1000 nls_iso8859_1 nls_cp437 vfat fat virtio_blk virtio_mmio xfs btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_intel reiserfs squashfs fuse dm_snapshot dm_bufio dm_crypt dm_mod binfmt_misc loop sg virtio_rng
> ---[ end trace 364c3385f192fee3 ]---

Any ideas?

thanks,
--
js
suse labs


2020-07-17 11:08:55

by Jiri Slaby

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 17. 07. 20, 12:45, Jiri Slaby wrote:
> Hi,
>
> the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> and i586:

make check needs -jsomething, running is sequentially (-j1) doesn't
trigger it. After the error, I cannot run anything. Like ps to find out
what test caused the crash...

5.7 was fine.

>> kernel BUG at kernel/signal.c:1917!
>> invalid opcode: 0000 [#1] SMP NOPTI
>> CPU: 7 PID: 18367 Comm: filter-unavaila Not tainted 5.8.0-rc4-3.g2cd7849-default #1 openSUSE Tumbleweed (unreleased)
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>> RIP: 0010:do_notify_parent+0x239/0x270
>> Code: d1 e8 41 89 c5 41 83 e5 01 48 8b 95 28 05 00 00 45 31 c0 4c 89 f6 44 89 e7 b9 01 00 00 00 e8 ce f2 ff ff e9 5b ff ff ff 0f 0b <0f> 0b 48 39 bf 50 05 00 00 75 17 48 8b 97 c0 05 00 00 48 8d 87 c0
>> RSP: 0018:ffffa05400f93d00 EFLAGS: 00010002
>> RAX: 0000000000000008 RBX: ffff9024ebb6dd00 RCX: 00000000000102f9
>> RDX: 0000000000000000 RSI: 0000000000000011 RDI: ffff9024ebb6dd00
>> RBP: ffff9024ebb6dd00 R08: 0000000000000000 R09: 000000000000000d
>> R10: 00007f7813000000 R11: ffffc2690721ae80 R12: 0000000000000000
>> R13: 0000000000000000 R14: 0000000000000009 R15: 0000000000000000
>> FS: 00007f7812ffc700(0000) GS:ffff902537dc0000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f9c246a4a10 CR3: 00000001eba7c000 CR4: 00000000000406e0
>> Call Trace:
>> exit_notify+0x1c0/0x1f0
>> do_exit+0x234/0x430
>> do_group_exit+0x33/0xa0
>> get_signal+0x171/0x640
>> do_signal+0x2a/0x1c0
>> exit_to_usermode_loop+0x89/0x110
>> __prepare_exit_to_usermode+0x92/0xd0
>> do_syscall_64+0x59/0xd0
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> RIP: 0033:0x7f78149db992
>> Code: Bad RIP value.
>> RSP: 002b:00007f7812ffbed0 EFLAGS: 00000293 ORIG_RAX: 0000000000000022
>> RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007f78149db992
>> RDX: 0000000000000001 RSI: 0000558a4e07e033 RDI: 0000000000000000
>> RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f7812ffc700
>> R10: 00007f7812ffc9d0 R11: 0000000000000293 R12: 0000000000000000
>> R13: 00007ffe4853b4bf R14: 00007ffe4853b4c0 R15: 00007f7812ffc700
>> Modules linked in: ata_generic ata_piix qemu_fw_cfg e1000 nls_iso8859_1 nls_cp437 vfat fat virtio_blk virtio_mmio xfs btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_intel reiserfs squashfs fuse dm_snapshot dm_bufio dm_crypt dm_mod binfmt_misc loop sg virtio_rng
>> ---[ end trace 364c3385f192fee3 ]---
>
> Any ideas?
>
> thanks,
>


--
js

2020-07-17 11:14:49

by Christian Brauner

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Fri, Jul 17, 2020 at 01:04:38PM +0200, Jiri Slaby wrote:
> On 17. 07. 20, 12:45, Jiri Slaby wrote:
> > Hi,
> >
> > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> > and i586:
>
> make check needs -jsomething, running is sequentially (-j1) doesn't
> trigger it. After the error, I cannot run anything. Like ps to find out
> what test caused the crash...
>
> 5.7 was fine.

Can you try and reproduce with

commit e91b48162332480f5840902268108bb7fb7a44c7
Author: Oleg Nesterov <[email protected]>
Date: Tue Jun 30 17:32:54 2020 +0200

task_work: teach task_work_add() to do signal_wake_up()

reverted, please?

Thanks!
Christian

2020-07-17 12:28:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/17, Jiri Slaby wrote:
>
> On 17. 07. 20, 12:45, Jiri Slaby wrote:
> > Hi,
> >
> > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> > and i586:
>
> make check needs -jsomething, running is sequentially (-j1) doesn't
> trigger it. After the error, I cannot run anything. Like ps to find out
> what test caused the crash...

Strange... I'll try to reproduce but I can't do this till Monday.

Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP.

Oleg.

diff --git a/kernel/signal.c b/kernel/signal.c
index ee22ec78fd6d..84782843eb1c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1914,7 +1914,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(sig == -1);

/* do_notify_parent_cldstop should have been called instead. */
- BUG_ON(task_is_stopped_or_traced(tsk));
+ WARN(task_is_stopped_or_traced(tsk),
+ "exist with state=%ld set at %pS",
+ tsk->state, (void*)tsk->task_state_change);

BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
@@ -2214,6 +2216,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
preempt_enable_no_resched();
freezable_schedule();
cgroup_leave_frozen(true);
+ WARN_ON(current->state);
} else {
/*
* By the time we got the lock, our tracer went away.
@@ -2393,6 +2396,7 @@ static bool do_signal_stop(int signr)
/* Now we don't run again until woken by SIGCONT or SIGKILL */
cgroup_enter_frozen();
freezable_schedule();
+ WARN_ON(current->state);
return true;
} else {
/*

2020-07-17 12:43:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/17, Oleg Nesterov wrote:
>
> On 07/17, Jiri Slaby wrote:
> >
> > On 17. 07. 20, 12:45, Jiri Slaby wrote:
> > > Hi,
> > >
> > > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> > > and i586:
> >
> > make check needs -jsomething, running is sequentially (-j1) doesn't
> > trigger it. After the error, I cannot run anything. Like ps to find out
> > what test caused the crash...
>
> Strange... I'll try to reproduce but I can't do this till Monday.
>
> Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP.

please see the updated patch below, lets check ptrace_unfreeze() too.

Oleg.

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179508d6..71c76bc7dec6 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -180,6 +180,7 @@ static bool ptrace_freeze_traced(struct task_struct *task)

spin_lock_irq(&task->sighand->siglock);
if (task_is_traced(task) && !__fatal_signal_pending(task)) {
+ task->task_state_change = _THIS_IP_;
task->state = __TASK_TRACED;
ret = true;
}
@@ -203,8 +204,10 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
if (task->state == __TASK_TRACED) {
if (__fatal_signal_pending(task))
wake_up_state(task, __TASK_TRACED);
- else
+ else {
+ task->task_state_change = _THIS_IP_;
task->state = TASK_TRACED;
+ }
}
spin_unlock_irq(&task->sighand->siglock);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index ee22ec78fd6d..ba4c4c9ed9b5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1914,7 +1914,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(sig == -1);

/* do_notify_parent_cldstop should have been called instead. */
- BUG_ON(task_is_stopped_or_traced(tsk));
+ WARN(task_is_stopped_or_traced(tsk),
+ "exist with state=%ld set at %pS",
+ tsk->state, (void*)tsk->task_state_change);

BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
@@ -2214,6 +2216,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
preempt_enable_no_resched();
freezable_schedule();
cgroup_leave_frozen(true);
+ WARN_ON(current->state);
} else {
/*
* By the time we got the lock, our tracer went away.
@@ -2393,6 +2396,7 @@ static bool do_signal_stop(int signr)
/* Now we don't run again until woken by SIGCONT or SIGKILL */
cgroup_enter_frozen();
freezable_schedule();
+ WARN_ON(current->state);
return true;
} else {
/*

2020-07-18 12:33:15

by Jiri Slaby

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 17. 07. 20, 14:40, Oleg Nesterov wrote:
> On 07/17, Oleg Nesterov wrote:
>>
>> On 07/17, Jiri Slaby wrote:
>>>
>>> On 17. 07. 20, 12:45, Jiri Slaby wrote:
>>>> Hi,
>>>>
>>>> the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
>>>> and i586:
>>>
>>> make check needs -jsomething, running is sequentially (-j1) doesn't
>>> trigger it. After the error, I cannot run anything. Like ps to find out
>>> what test caused the crash...
>>
>> Strange... I'll try to reproduce but I can't do this till Monday.
>>
>> Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP.
>
> please see the updated patch below, lets check ptrace_unfreeze() too.

Sure, dmesg attached.

thanks,
--
js


Attachments:
dmesg.txt (117.09 kB)

2020-07-18 13:08:31

by Jiri Slaby

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 17. 07. 20, 13:12, Christian Brauner wrote:
> On Fri, Jul 17, 2020 at 01:04:38PM +0200, Jiri Slaby wrote:
>> On 17. 07. 20, 12:45, Jiri Slaby wrote:
>>> Hi,
>>>
>>> the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
>>> and i586:
>>
>> make check needs -jsomething, running is sequentially (-j1) doesn't
>> trigger it. After the error, I cannot run anything. Like ps to find out
>> what test caused the crash...
>>
>> 5.7 was fine.
>
> Can you try and reproduce with
>
> commit e91b48162332480f5840902268108bb7fb7a44c7
> Author: Oleg Nesterov <[email protected]>
> Date: Tue Jun 30 17:32:54 2020 +0200
>
> task_work: teach task_work_add() to do signal_wake_up()
>
> reverted, please?

I reverted these 3:
+This reverts commit b7db41c9e03b5189bc94993bd50e4506ac9e34c1.
+This reverts commit ce593a6c480a22acba08795be313c0c6d49dd35d.
+This reverts commit e91b48162332480f5840902268108bb7fb7a44c7.

But it still crashes.

thanks,
--
js

2020-07-18 17:36:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/18, Jiri Slaby wrote:
>
> On 17. 07. 20, 14:40, Oleg Nesterov wrote:
> >
> > please see the updated patch below, lets check ptrace_unfreeze() too.
>
> Sure, dmesg attached.

Thanks a lot!

But I am totally confused...

> [ 94.513944] ------------[ cut here ]------------
> [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at [<000000002fe279e9>] ptrace_check_attach+0xbf/0x110

OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
ptrace_freeze_traced(),

> [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 __might_sleep+0x6c/0x70
> [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
> [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: G E 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
> [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
> [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70
> [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
> [ 94.514092] RSP: 0018:ffff9ffa4ba1be00 EFLAGS: 00010286
> [ 94.514093] RAX: 0000000000000000 RBX: ffff8dc82b503e00 RCX: 0000000000000489
> [ 94.514094] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000247
> [ 94.514095] RBP: ffffffff8f6ffa6b R08: 0000000000000004 R09: 0000000000000489
> [ 94.514095] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000039
> [ 94.514096] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8dc82b5045e4
> [ 94.514098] FS: 00007fa00f1f9240(0000) GS:ffff8dcb0c000000(0000) knlGS:0000000000000000
> [ 94.514099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 94.514100] CR2: 0000557b53d25877 CR3: 00000004ca490005 CR4: 0000000000360ee0
> [ 94.514103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 94.514104] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 94.514105] Call Trace:
> [ 94.514821] ptrace_stop+0x1a9/0x300

This is already wrong. But

Where does this __might_sleep() come from ??? I ses no blocking calls
in ptrace_stop(). Not to mention it is called with ->siglock held and
right after this lock is dropped we take tasklist_lock.

How this connects to the debugging patch I sent? Did you see this warning
without that patch?


> [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 ptrace_stop+0x1d8/0x300
...
> [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300

This is WARN_ON(current->state) added to ptrace_stop(), this can explain
BUG_ON() in do_notify_parent() you reported.

So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
This must not be possible.

OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
nobody should do this.

Strange.

Oleg.

2020-07-18 17:47:39

by Christian Brauner

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Sat, Jul 18, 2020 at 07:14:07PM +0200, Oleg Nesterov wrote:
> On 07/18, Jiri Slaby wrote:
> >
> > On 17. 07. 20, 14:40, Oleg Nesterov wrote:
> > >
> > > please see the updated patch below, lets check ptrace_unfreeze() too.
> >
> > Sure, dmesg attached.
>
> Thanks a lot!
>
> But I am totally confused...
>
> > [ 94.513944] ------------[ cut here ]------------
> > [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at [<000000002fe279e9>] ptrace_check_attach+0xbf/0x110
>
> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
> ptrace_freeze_traced(),
>
> > [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 __might_sleep+0x6c/0x70
> > [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
> > [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: G E 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
> > [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
> > [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70
> > [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
> > [ 94.514092] RSP: 0018:ffff9ffa4ba1be00 EFLAGS: 00010286
> > [ 94.514093] RAX: 0000000000000000 RBX: ffff8dc82b503e00 RCX: 0000000000000489
> > [ 94.514094] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000247
> > [ 94.514095] RBP: ffffffff8f6ffa6b R08: 0000000000000004 R09: 0000000000000489
> > [ 94.514095] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000039
> > [ 94.514096] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8dc82b5045e4
> > [ 94.514098] FS: 00007fa00f1f9240(0000) GS:ffff8dcb0c000000(0000) knlGS:0000000000000000
> > [ 94.514099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 94.514100] CR2: 0000557b53d25877 CR3: 00000004ca490005 CR4: 0000000000360ee0
> > [ 94.514103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 94.514104] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 94.514105] Call Trace:
> > [ 94.514821] ptrace_stop+0x1a9/0x300
>
> This is already wrong. But
>
> Where does this __might_sleep() come from ??? I ses no blocking calls
> in ptrace_stop(). Not to mention it is called with ->siglock held and
> right after this lock is dropped we take tasklist_lock.
>
> How this connects to the debugging patch I sent? Did you see this warning
> without that patch?
>
>
> > [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 ptrace_stop+0x1d8/0x300
> ...
> > [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300
>
> This is WARN_ON(current->state) added to ptrace_stop(), this can explain
> BUG_ON() in do_notify_parent() you reported.
>
> So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
> This must not be possible.
>
> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
> nobody should do this.
>
> Strange.

I have tried to reproduce this with an vanilla upstream 5.8-rc4 and the
strace test-suite with

make check -j4

and I wasn't able to after multiple retries. Jiri, just to make sure
this is upstream 5.8-rc4 without any additional patches?
Anything special required to reproduce this in the way you run strace
and so on?

Christian

2020-07-19 18:24:31

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

Hi Hillf,

On 07/19, Hillf Danton wrote:
>
> Dunno if the wheel prior to JOBCTL_TASK_WORK helps debug the warnings.
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2541,7 +2541,7 @@ bool get_signal(struct ksignal *ksig)
>
> relock:
> spin_lock_irq(&sighand->siglock);
> - current->jobctl &= ~JOBCTL_TASK_WORK;
> + task_clear_jobctl_pending(current, JOBCTL_TASK_WORK);
> if (unlikely(current->task_works)) {
> spin_unlock_irq(&sighand->siglock);
> task_work_run();
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -43,8 +43,8 @@ task_work_add(struct task_struct *task,
> break;
> case TWA_SIGNAL:
> if (lock_task_sighand(task, &flags)) {
> - task->jobctl |= JOBCTL_TASK_WORK;
> - signal_wake_up(task, 0);
> + if (task_set_jobctl_pending(task, JOBCTL_TASK_WORK))
> + recalc_sigpending_and_wake(task);
> unlock_task_sighand(task, &flags);

To be be honest I don't understand why do you think this makes any sense...

But this doesn't matter, please note that the Jiri tested the kernel with
e91b48162332480f5 (which added JOBCTL_TASK_WORK) reverted.

Oleg.

2020-07-20 05:45:55

by Jiri Slaby

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 18. 07. 20, 19:44, Christian Brauner wrote:
> On Sat, Jul 18, 2020 at 07:14:07PM +0200, Oleg Nesterov wrote:
>> On 07/18, Jiri Slaby wrote:
>>>
>>> On 17. 07. 20, 14:40, Oleg Nesterov wrote:
>>>>
>>>> please see the updated patch below, lets check ptrace_unfreeze() too.
>>>
>>> Sure, dmesg attached.
>>
>> Thanks a lot!
>>
>> But I am totally confused...
>>
>>> [ 94.513944] ------------[ cut here ]------------
>>> [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at [<000000002fe279e9>] ptrace_check_attach+0xbf/0x110
>>
>> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
>> ptrace_freeze_traced(),
>>
>>> [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 __might_sleep+0x6c/0x70
>>> [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
>>> [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: G E 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
>>> [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
>>> [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70
>>> [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
>>> [ 94.514092] RSP: 0018:ffff9ffa4ba1be00 EFLAGS: 00010286
>>> [ 94.514093] RAX: 0000000000000000 RBX: ffff8dc82b503e00 RCX: 0000000000000489
>>> [ 94.514094] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000247
>>> [ 94.514095] RBP: ffffffff8f6ffa6b R08: 0000000000000004 R09: 0000000000000489
>>> [ 94.514095] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000039
>>> [ 94.514096] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8dc82b5045e4
>>> [ 94.514098] FS: 00007fa00f1f9240(0000) GS:ffff8dcb0c000000(0000) knlGS:0000000000000000
>>> [ 94.514099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 94.514100] CR2: 0000557b53d25877 CR3: 00000004ca490005 CR4: 0000000000360ee0
>>> [ 94.514103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [ 94.514104] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [ 94.514105] Call Trace:
>>> [ 94.514821] ptrace_stop+0x1a9/0x300
>>
>> This is already wrong. But
>>
>> Where does this __might_sleep() come from ??? I ses no blocking calls
>> in ptrace_stop(). Not to mention it is called with ->siglock held and
>> right after this lock is dropped we take tasklist_lock.
>>
>> How this connects to the debugging patch I sent? Did you see this warning
>> without that patch?
>>
>>
>>> [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 ptrace_stop+0x1d8/0x300
>> ...
>>> [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300
>>
>> This is WARN_ON(current->state) added to ptrace_stop(), this can explain
>> BUG_ON() in do_notify_parent() you reported.
>>
>> So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
>> This must not be possible.
>>
>> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
>> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
>> nobody should do this.
>>
>> Strange.
>
> I have tried to reproduce this with an vanilla upstream 5.8-rc4 and the
> strace test-suite with
>
> make check -j4
>
> and I wasn't able to after multiple retries. Jiri, just to make sure
> this is upstream 5.8-rc4 without any additional patches?

You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more
people.

So if you try with pure vanilla 5.8-rc6 (it contains the commit), you
should be able to reproduce. I am.

OTOH 5.8-rc6 with that commit reverted -- I cannot reproduce there. So
it must be it.

> Anything special required to reproduce this in the way you run strace
> and so on?

Nothing special there. Except the HW: when I run a VM with 16+
processors, it's much more likely to be hit (usually, on the first run
of make check).

thanks,
--
js

2020-07-20 06:01:47

by Jiri Slaby

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 18. 07. 20, 19:14, Oleg Nesterov wrote:
> On 07/18, Jiri Slaby wrote:
>>
>> On 17. 07. 20, 14:40, Oleg Nesterov wrote:
>>>
>>> please see the updated patch below, lets check ptrace_unfreeze() too.
>>
>> Sure, dmesg attached.
>
> Thanks a lot!
>
> But I am totally confused...
>
>> [ 94.513944] ------------[ cut here ]------------
>> [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at [<000000002fe279e9>] ptrace_check_attach+0xbf/0x110
>
> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
> ptrace_freeze_traced(),
>
>> [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 __might_sleep+0x6c/0x70
>> [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
>> [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: G E 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
>> [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
>> [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70
>> [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
>> [ 94.514092] RSP: 0018:ffff9ffa4ba1be00 EFLAGS: 00010286
>> [ 94.514093] RAX: 0000000000000000 RBX: ffff8dc82b503e00 RCX: 0000000000000489
>> [ 94.514094] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000247
>> [ 94.514095] RBP: ffffffff8f6ffa6b R08: 0000000000000004 R09: 0000000000000489
>> [ 94.514095] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000039
>> [ 94.514096] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8dc82b5045e4
>> [ 94.514098] FS: 00007fa00f1f9240(0000) GS:ffff8dcb0c000000(0000) knlGS:0000000000000000
>> [ 94.514099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 94.514100] CR2: 0000557b53d25877 CR3: 00000004ca490005 CR4: 0000000000360ee0
>> [ 94.514103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 94.514104] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 94.514105] Call Trace:
>> [ 94.514821] ptrace_stop+0x1a9/0x300
>
> This is already wrong. But
>
> Where does this __might_sleep() come from ??? I ses no blocking calls
> in ptrace_stop(). Not to mention it is called with ->siglock held and
> right after this lock is dropped we take tasklist_lock.

Decoded stacktrace:

> ptrace_stop (include/linux/freezer.h:57 include/linux/freezer.h:67 include/linux/freezer.h:128 include/linux/freezer.h:173 kernel/signal.c:2217)
> ptrace_do_notify (kernel/signal.c:2272)
> ptrace_notify (arch/x86/include/asm/paravirt.h:656 arch/x86/include/asm/qspinlock.h:55 include/linux/spinlock.h:211 include/linux/spinlock_api_smp.h:167 include/linux/spinlock.h:403 kernel/signal.c:2282)
> syscall_trace_enter (include/linux/tracehook.h:73 include/linux/tracehook.h:104 arch/x86/entry/common.c:159)
> do_syscall_64 (arch/x86/entry/common.c:380)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124)

It is try_to_freeze_unsafe in try_to_freeze in freezable_schedule in
ptrace_stop.

>
> How this connects to the debugging patch I sent? Did you see this warning
> without that patch?

I suppose this made it appear:
+CONFIG_PREEMPT_COUNT=y
-# CONFIG_DEBUG_ATOMIC_SLEEP is not set
+CONFIG_DEBUG_ATOMIC_SLEEP=y

>> [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 ptrace_stop+0x1d8/0x300
> ...
>> [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300
>
> This is WARN_ON(current->state) added to ptrace_stop(), this can explain
> BUG_ON() in do_notify_parent() you reported.
>
> So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
> This must not be possible.
>
> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
> nobody should do this.
>
> Strange.

Please see my other e-mail, all this is with dbfb089d360b applied. Maybe
it makes more sense now?

thanks,
--
js

2020-07-20 06:47:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/20, Jiri Slaby wrote:
>
> You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more
> people.

Thanks... so with this patch __schedule() does

prev_state = prev->state;

...

if (!preempt && prev_state && prev_state == prev->state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {

and ptrace_freeze_traced() can change ->state in between. This means
that this task can return from __schedule() with ->state != RUNNING,
this can explain BUG_ON(task_is_stopped_or_traced) in do_notify_parent()
you reported.

Oleg.

2020-07-20 06:57:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/20, Jiri Slaby wrote:
>
> On 18. 07. 20, 19:14, Oleg Nesterov wrote:
> >
> > This is already wrong. But
> >
> > Where does this __might_sleep() come from ??? I ses no blocking calls
> > in ptrace_stop(). Not to mention it is called with ->siglock held and
> > right after this lock is dropped we take tasklist_lock.
>
> Decoded stacktrace:
>
> > ptrace_stop (include/linux/freezer.h:57 include/linux/freezer.h:67 include/linux/freezer.h:128 include/linux/freezer.h:173 kernel/signal.c:2217)
> > ptrace_do_notify (kernel/signal.c:2272)
> > ptrace_notify (arch/x86/include/asm/paravirt.h:656 arch/x86/include/asm/qspinlock.h:55 include/linux/spinlock.h:211 include/linux/spinlock_api_smp.h:167 include/linux/spinlock.h:403 kernel/signal.c:2282)
> > syscall_trace_enter (include/linux/tracehook.h:73 include/linux/tracehook.h:104 arch/x86/entry/common.c:159)
> > do_syscall_64 (arch/x86/entry/common.c:380)
> > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124)
>
> It is try_to_freeze_unsafe in try_to_freeze in freezable_schedule in
> ptrace_stop.

Aha, thanks a lot!

try_to_freeze_unsafe() is called after schedule() which must return with
->state = RUNNING, so this matches another WARN_ON(current->state) added
by debugging patch after freezable_schedule().

Somehow I decided __might_sleep() was called before read_unlock/schedule.

> >
> > How this connects to the debugging patch I sent? Did you see this warning
> > without that patch?
>
> I suppose this made it appear:
> +CONFIG_PREEMPT_COUNT=y
> -# CONFIG_DEBUG_ATOMIC_SLEEP is not set
> +CONFIG_DEBUG_ATOMIC_SLEEP=y

I see,

> Please see my other e-mail, all this is with dbfb089d360b applied. Maybe
> it makes more sense now?

Yes. Thanks Jiri!

Oleg.

2020-07-20 08:27:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

Peter,

Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
->siglock held. That is why ptrace_freeze_traced() assumes it can safely
do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).

Can this change race with

if (signal_pending_state(prev->state, prev)) {
prev->state = TASK_RUNNING;
}

in __schedule() ? Hopefully not, signal-state is protected by siglock too.

So I think this logic was correct even if it doesn't look nice. But "doesn't
look nice" is true for the whole ptrace code ;)

On 07/20, Oleg Nesterov wrote:
>
> On 07/20, Jiri Slaby wrote:
> >
> > You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more
> > people.
>
> Thanks... so with this patch __schedule() does
>
> prev_state = prev->state;
>
> ...
>
> if (!preempt && prev_state && prev_state == prev->state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
>
> and ptrace_freeze_traced() can change ->state in between. This means
> that this task can return from __schedule() with ->state != RUNNING,
> this can explain BUG_ON(task_is_stopped_or_traced) in do_notify_parent()
> you reported.
>
> Oleg.

2020-07-20 08:44:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> Peter,
>
> Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
>
> Can this change race with
>
> if (signal_pending_state(prev->state, prev)) {
> prev->state = TASK_RUNNING;
> }
>
> in __schedule() ? Hopefully not, signal-state is protected by siglock too.
>
> So I think this logic was correct even if it doesn't look nice. But "doesn't
> look nice" is true for the whole ptrace code ;)

*groan*... another bit of obscure magic :-(

let me go try and wake up and figure out how best to deal with this.

2020-07-20 11:03:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> > Peter,
> >
> > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> >
> > Can this change race with
> >
> > if (signal_pending_state(prev->state, prev)) {
> > prev->state = TASK_RUNNING;
> > }
> >
> > in __schedule() ? Hopefully not, signal-state is protected by siglock too.
> >
> > So I think this logic was correct even if it doesn't look nice. But "doesn't
> > look nice" is true for the whole ptrace code ;)
>
> *groan*... another bit of obscure magic :-(
>
> let me go try and wake up and figure out how best to deal with this.

So clearly I'm still missing something, the below results in:

[ 63.760863] ------------[ cut here ]------------
[ 63.766019] !(tmp_state & __TASK_TRACED)
[ 63.766030] WARNING: CPU: 33 PID: 33801 at kernel/sched/core.c:4158 __schedule+0x6bd/0x8e0

Also, is there any way to not have ptrace do this? How performance
critical is this ptrace path? Because I really hate having to add code
to __schedule() to deal with this horrible thing.


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e15543cb84812..f65a801d268b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
static void __sched notrace __schedule(bool preempt)
{
+ unsigned long prev_state, tmp_state;
struct task_struct *prev, *next;
unsigned long *switch_count;
- unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
rq_lock(rq, &rf);
smp_mb__after_spinlock();

+ /*
+ * We must re-load prev->state in case ttwu_remote() changed it
+ * before we acquired rq->lock.
+ */
+ tmp_state = prev->state;
+ if (unlikely(prev_state != tmp_state)) {
+ if (prev_state & __TASK_TRACED) {
+ /*
+ * ptrace_{,un}freeze_traced() think it is cool
+ * to change ->state around behind our backs
+ * between TASK_TRACED and __TASK_TRACED.
+ *
+ * Luckily this transformation doesn't affect
+ * sched_contributes_to_load.
+ */
+ SCHED_WARN_ON(!(tmp_state & __TASK_TRACED));
+ } else {
+ /*
+ * For any other case, a changed prev_state
+ * must be to TASK_RUNNING, such that...
+ */
+ SCHED_WARN_ON(tmp_state != TASK_RUNNING);
+ }
+ prev_state = tmp_state;
+ }
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);

switch_count = &prev->nivcsw;
- /*
- * We must re-load prev->state in case ttwu_remote() changed it
- * before we acquired rq->lock.
- */
- if (!preempt && prev_state && prev_state == prev->state) {
+ if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {

2020-07-20 11:27:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 12:59:24PM +0200, [email protected] wrote:
> On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> > > Peter,
> > >
> > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> > >
> > > Can this change race with
> > >
> > > if (signal_pending_state(prev->state, prev)) {
> > > prev->state = TASK_RUNNING;
> > > }
> > >
> > > in __schedule() ? Hopefully not, signal-state is protected by siglock too.
> > >
> > > So I think this logic was correct even if it doesn't look nice. But "doesn't
> > > look nice" is true for the whole ptrace code ;)
> >
> > *groan*... another bit of obscure magic :-(
> >
> > let me go try and wake up and figure out how best to deal with this.

This then? That seems to survive the strace thing.

---
kernel/sched/core.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e15543cb84812..b5973d7fa521c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*/
static void __sched notrace __schedule(bool preempt)
{
+ unsigned long prev_state, tmp_state;
struct task_struct *prev, *next;
unsigned long *switch_count;
- unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
rq_lock(rq, &rf);
smp_mb__after_spinlock();

+ /*
+ * We must re-load prev->state in case ttwu_remote() changed it
+ * before we acquired rq->lock.
+ */
+ tmp_state = prev->state;
+ if (unlikely(prev_state != tmp_state)) {
+ /*
+ * ptrace_{,un}freeze_traced() think it is cool to change
+ * ->state around behind our backs between TASK_TRACED and
+ * __TASK_TRACED.
+ *
+ * This is safe because this, as well as any __TASK_TRACED
+ * wakeups are under siglock.
+ *
+ * For any other case, a changed prev_state must be to
+ * TASK_RUNNING, such that when it blocks, the load has
+ * happened before the smp_mb().
+ *
+ * Also see the comment with deactivate_task().
+ */
+ SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
+ !(tmp_state & __TASK_TRACED)));
+
+ prev_state = tmp_state;
+ }
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);

switch_count = &prev->nivcsw;
- /*
- * We must re-load prev->state in case ttwu_remote() changed it
- * before we acquired rq->lock.
- */
- if (!preempt && prev_state && prev_state == prev->state) {
+ if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {

2020-07-20 11:41:01

by Jiri Slaby

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 20. 07. 20, 13:26, [email protected] wrote:
> On Mon, Jul 20, 2020 at 12:59:24PM +0200, [email protected] wrote:
>> On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
>>> On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
>>>> Peter,
>>>>
>>>> Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
>>>> ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
>>>> ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
>>>> do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
>>>>
>>>> Can this change race with
>>>>
>>>> if (signal_pending_state(prev->state, prev)) {
>>>> prev->state = TASK_RUNNING;
>>>> }
>>>>
>>>> in __schedule() ? Hopefully not, signal-state is protected by siglock too.
>>>>
>>>> So I think this logic was correct even if it doesn't look nice. But "doesn't
>>>> look nice" is true for the whole ptrace code ;)
>>>
>>> *groan*... another bit of obscure magic :-(
>>>
>>> let me go try and wake up and figure out how best to deal with this.
>
> This then? That seems to survive the strace thing.

FWIW for me too.

thanks,
--
js

2020-07-20 12:21:10

by Valentin Schneider

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917


On 20/07/20 12:26, [email protected] wrote:
> ---
> kernel/sched/core.c | 34 ++++++++++++++++++++++++++++------
> 1 file changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e15543cb84812..b5973d7fa521c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> */
> static void __sched notrace __schedule(bool preempt)
> {
> + unsigned long prev_state, tmp_state;
> struct task_struct *prev, *next;
> unsigned long *switch_count;
> - unsigned long prev_state;
> struct rq_flags rf;
> struct rq *rq;
> int cpu;
> @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
> rq_lock(rq, &rf);
> smp_mb__after_spinlock();
>
> + /*
> + * We must re-load prev->state in case ttwu_remote() changed it
> + * before we acquired rq->lock.
> + */
> + tmp_state = prev->state;
> + if (unlikely(prev_state != tmp_state)) {
> + /*
> + * ptrace_{,un}freeze_traced() think it is cool to change
> + * ->state around behind our backs between TASK_TRACED and
> + * __TASK_TRACED.
> + *
> + * This is safe because this, as well as any __TASK_TRACED
> + * wakeups are under siglock.
> + *
> + * For any other case, a changed prev_state must be to
> + * TASK_RUNNING, such that when it blocks, the load has
> + * happened before the smp_mb().
> + *
> + * Also see the comment with deactivate_task().
> + */
> + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> + !(tmp_state & __TASK_TRACED)));
> +

IIUC if the state changed and isn't TASK_RUNNING it *has* to have
__TASK_TRACED, so can't that be

SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED));

Other than that, LGTM.

> + prev_state = tmp_state;
> + }
> +
> /* Promote REQ to ACT */
> rq->clock_update_flags <<= 1;
> update_rq_clock(rq);
>
> switch_count = &prev->nivcsw;
> - /*
> - * We must re-load prev->state in case ttwu_remote() changed it
> - * before we acquired rq->lock.
> - */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {

2020-07-20 12:58:08

by Christian Brauner

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 01:26:23PM +0200, [email protected] wrote:
> On Mon, Jul 20, 2020 at 12:59:24PM +0200, [email protected] wrote:
> > On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> > > > Peter,
> > > >
> > > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> > > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> > > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> > > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> > > >
> > > > Can this change race with
> > > >
> > > > if (signal_pending_state(prev->state, prev)) {
> > > > prev->state = TASK_RUNNING;
> > > > }
> > > >
> > > > in __schedule() ? Hopefully not, signal-state is protected by siglock too.
> > > >
> > > > So I think this logic was correct even if it doesn't look nice. But "doesn't
> > > > look nice" is true for the whole ptrace code ;)
> > >
> > > *groan*... another bit of obscure magic :-(
> > >
> > > let me go try and wake up and figure out how best to deal with this.
>
> This then? That seems to survive the strace thing.

Ran a mainline v5.8-rc6 -> reproduced crash
Ran a mainling v5.8-rc6 + patch below -> testsuite runs fine/not able to reproduce crash

So fwiw
Tested-by: Christian Brauner <[email protected]>
Acked-by: Christian Brauner <[email protected]>

>
> ---
> kernel/sched/core.c | 34 ++++++++++++++++++++++++++++------
> 1 file changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e15543cb84812..b5973d7fa521c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> */
> static void __sched notrace __schedule(bool preempt)
> {
> + unsigned long prev_state, tmp_state;
> struct task_struct *prev, *next;
> unsigned long *switch_count;
> - unsigned long prev_state;
> struct rq_flags rf;
> struct rq *rq;
> int cpu;
> @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
> rq_lock(rq, &rf);
> smp_mb__after_spinlock();
>
> + /*
> + * We must re-load prev->state in case ttwu_remote() changed it
> + * before we acquired rq->lock.
> + */
> + tmp_state = prev->state;
> + if (unlikely(prev_state != tmp_state)) {
> + /*
> + * ptrace_{,un}freeze_traced() think it is cool to change
> + * ->state around behind our backs between TASK_TRACED and
> + * __TASK_TRACED.
> + *
> + * This is safe because this, as well as any __TASK_TRACED
> + * wakeups are under siglock.
> + *
> + * For any other case, a changed prev_state must be to
> + * TASK_RUNNING, such that when it blocks, the load has
> + * happened before the smp_mb().
> + *
> + * Also see the comment with deactivate_task().
> + */
> + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> + !(tmp_state & __TASK_TRACED)));
> +
> + prev_state = tmp_state;
> + }
> +
> /* Promote REQ to ACT */
> rq->clock_update_flags <<= 1;
> update_rq_clock(rq);
>
> switch_count = &prev->nivcsw;
> - /*
> - * We must re-load prev->state in case ttwu_remote() changed it
> - * before we acquired rq->lock.
> - */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {

2020-07-20 13:19:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 01:20:26PM +0100, Valentin Schneider wrote:
> On 20/07/20 12:26, [email protected] wrote:

> > + /*
> > + * We must re-load prev->state in case ttwu_remote() changed it
> > + * before we acquired rq->lock.
> > + */
> > + tmp_state = prev->state;
> > + if (unlikely(prev_state != tmp_state)) {
> > + /*
> > + * ptrace_{,un}freeze_traced() think it is cool to change
> > + * ->state around behind our backs between TASK_TRACED and
> > + * __TASK_TRACED.
> > + *
> > + * This is safe because this, as well as any __TASK_TRACED
> > + * wakeups are under siglock.
> > + *
> > + * For any other case, a changed prev_state must be to
> > + * TASK_RUNNING, such that when it blocks, the load has
> > + * happened before the smp_mb().
> > + *
> > + * Also see the comment with deactivate_task().
> > + */
> > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> > + !(tmp_state & __TASK_TRACED)));
> > +
>
> IIUC if the state changed and isn't TASK_RUNNING it *has* to have
> __TASK_TRACED, so can't that be
>
> SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED));

Suppose task->state == TASK_UNINTERRUPTIBLE, and task != current, and
then someone goes and does task->state = __TASK_TRACED.

That is, your statement is correct given the current code, but we also
want to verify no new code comes along and does something 'creative'.

Or is the heat getting to me?

2020-07-20 14:03:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/20, Peter Zijlstra wrote:
>
> Also, is there any way to not have ptrace do this?

Well, we need to ensure that even SIGKILL can't wake the tracee up while
debugger plays with its registers/etc.

> How performance
> critical is this ptrace path?

This is a slow path.

We can probably change ptrace_check_attach() to call ptrace_freeze_traced()
after wait_task_inactive(), but I would like to not do this... Because we
actually want to avoid wait_task_inactive() when possible.

Perhaps ptrace_freeze_traced() can __task_rq_lock() to avoid the race with
__schedule() ? No, it reads prev_state before rq_lock().

> Because I really hate having to add code
> to __schedule() to deal with this horrible thing.

Oh yes, I agree.

I have to admit, I do not understand the usage of prev_state in schedule(),
it looks really, really subtle...

Oleg.

2020-07-20 14:06:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 01:26:23PM +0200, [email protected] wrote:
> kernel/sched/core.c | 34 ++++++++++++++++++++++++++++------
> 1 file changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e15543cb84812..b5973d7fa521c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> */
> static void __sched notrace __schedule(bool preempt)
> {
> + unsigned long prev_state, tmp_state;
> struct task_struct *prev, *next;
> unsigned long *switch_count;
> - unsigned long prev_state;
> struct rq_flags rf;
> struct rq *rq;
> int cpu;
> @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
> rq_lock(rq, &rf);
> smp_mb__after_spinlock();
>
> + /*
> + * We must re-load prev->state in case ttwu_remote() changed it
> + * before we acquired rq->lock.
> + */
> + tmp_state = prev->state;
> + if (unlikely(prev_state != tmp_state)) {
> + /*
> + * ptrace_{,un}freeze_traced() think it is cool to change
> + * ->state around behind our backs between TASK_TRACED and
> + * __TASK_TRACED.
> + *
> + * This is safe because this, as well as any __TASK_TRACED
> + * wakeups are under siglock.
> + *
> + * For any other case, a changed prev_state must be to
> + * TASK_RUNNING, such that when it blocks, the load has
> + * happened before the smp_mb().
> + *
> + * Also see the comment with deactivate_task().
> + */
> + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> + !(tmp_state & __TASK_TRACED)));
> +
> + prev_state = tmp_state;

While trying to write a changelog for this thing, I can't convince
myself we don't need:

smp_mb();

here. Consider:

CPU0 CPU1 CPU2

schedule()
prev_state = prev->state;
spin_lock(rq->lock);
smp_mb__after_spin_lock();
ptrace_freeze_traced()
spin_lock(siglock)
task->state = __TASK_TRACED;
spin_unlock(siglock);
tmp_state = prev->state;
if (prev_state != tmp_state)
prev_state = tmp_state;
/* NO SMP_MB */
if (prev_state)
deactivate_task()
prev->on_rq = 0;
spin_lock(siglock);
ttwu()
if (rq->on_rq && ...)
goto unlock;
smp_acquire__after_ctrl_dep();
p->state = TASK_WAKING;

Looses the ordering we previously relied upon. That is, CPU1's
prev->state load and prev->on_rq store can get reordered vs CPU2.

OTOH, we have a control dependency on CPU1 as well, that should provide
LOAD->STORE ordering, after all we only do the ->on_rq=0 store, IFF we
see prev_state.

So that is:

if (p->state) if (!p->on_rq)
p->on_rq = 0; p->state = TASK_WAKING

which matches a CTRL-DEP to a CTRL-DEP ...

But this then means we can simplify dbfb089d360 as well, but now my head
hurts.

> + }
> +
> /* Promote REQ to ACT */
> rq->clock_update_flags <<= 1;
> update_rq_clock(rq);
>
> switch_count = &prev->nivcsw;
> - /*
> - * We must re-load prev->state in case ttwu_remote() changed it
> - * before we acquired rq->lock.
> - */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {

2020-07-20 14:22:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
> I have to admit, I do not understand the usage of prev_state in schedule(),
> it looks really, really subtle...

Right, so commit dbfb089d360 solved a problem where schedule() re-read
prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
result load-accounting went wobbly.

Now, looking at that commit again, I might've solved the problem twice
:-P

So on the one hand, I provider ordering:

LOAD p->state LOAD-ACQUIRE p->on_rq == 0
MB
STORE p->on_rq, 0 STORE p->state, TASK_WAKING

such that ttwu() will only change p->state, after on_rq==0, which is
after loading p->state in schedule().

At the same time, I also had schedule() set
p->sched_contributes_to_load once, and then consistently used that value
throughout, without ever looking at p->state again, which too makes it
much harder to mess load-avg up.


Now, the ordering in schedule(), relies on doing the p->state load
before:

spin_lock(rq->lock)
smp_mb__after_spinlock();

and doing a re-load check after, with the assumption that if the reload
is different, it will not block.

That said, in a crossed email, I just proposed we could simplify all
this like so.. but now I need to go ask people to re-validate that
loadavg muck again :-/


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2a244af9a53..437fc3b241f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
local_irq_disable();
rcu_note_context_switch(preempt);

- /* See deactivate_task() below. */
- prev_state = prev->state;
-
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
@@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
* We must re-load prev->state in case ttwu_remote() changed it
* before we acquired rq->lock.
*/
- if (!preempt && prev_state && prev_state == prev->state) {
+ prev_state = prev->state;
+ if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
@@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt)

/*
* __schedule() ttwu()
- * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
- * LOCK rq->lock goto out;
- * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
- * p->on_rq = 0; p->state = TASK_WAKING;
+ * if (prev_state) if (p->on_rq && ...)
+ * p->on_rq = 0; goto out;
+ * smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
*
* After this, schedule() must not care about p->state any more.
*/

2020-07-20 14:29:59

by Valentin Schneider

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917


On 20/07/20 14:17, [email protected] wrote:
> On Mon, Jul 20, 2020 at 01:20:26PM +0100, Valentin Schneider wrote:
>> On 20/07/20 12:26, [email protected] wrote:
>
>> > + /*
>> > + * We must re-load prev->state in case ttwu_remote() changed it
>> > + * before we acquired rq->lock.
>> > + */
>> > + tmp_state = prev->state;
>> > + if (unlikely(prev_state != tmp_state)) {
>> > + /*
>> > + * ptrace_{,un}freeze_traced() think it is cool to change
>> > + * ->state around behind our backs between TASK_TRACED and
>> > + * __TASK_TRACED.
>> > + *
>> > + * This is safe because this, as well as any __TASK_TRACED
>> > + * wakeups are under siglock.
>> > + *
>> > + * For any other case, a changed prev_state must be to
>> > + * TASK_RUNNING, such that when it blocks, the load has
>> > + * happened before the smp_mb().
>> > + *
>> > + * Also see the comment with deactivate_task().
>> > + */
>> > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
>> > + !(tmp_state & __TASK_TRACED)));
>> > +
>>
>> IIUC if the state changed and isn't TASK_RUNNING it *has* to have
>> __TASK_TRACED, so can't that be
>>
>> SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED));
>
> Suppose task->state == TASK_UNINTERRUPTIBLE, and task != current, and
> then someone goes and does task->state = __TASK_TRACED.
>
> That is, your statement is correct given the current code, but we also
> want to verify no new code comes along and does something 'creative'.
>

That is what I was trying to go for; AFAICT your approach only warns if
__TASK_TRACED gets removed between the two loads (IOW it was already
there). The way I was seeing it is:

- We only get here if prev_state != tmp_state; IOW we know we raced with
something
- if (tmp_state), then it wasn't with ttwu_remote()
- thus it must only be with ptrace shenanigans, IOW __TASK_TRACED must be
there.

Now, what I suggested still doesn't detect what you pointed out, or some
crazier thing that sets __TASK_TRACED *and* some other stuff. IIUC the
ptrace transformation does TASK_TRACED -> __TASK_TRACED, so we could have
it as:

/* TODO: name me */
#define foobar TASK_TRACED ^ __TASK_TRACED

...

/* not TASK_RUNNING; check against allowed transformations
SCHED_WARN_ON(tmp_state && ((prev_state ^ tmp_state) & ~foobar));


That said...

> Or is the heat getting to me?

... that may apply to me as well :-)

2020-07-20 14:40:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/20, Peter Zijlstra wrote:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> local_irq_disable();
> rcu_note_context_switch(preempt);
>
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
> /*
> * Make sure that signal_pending_state()->signal_pending() below
> * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> * We must re-load prev->state in case ttwu_remote() changed it
> * before we acquired rq->lock.
> */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;
> + if (!preempt && prev_state) {

Heh ;) Peter, you know what? I did the same change and tried to understand
why it is wrong and what have I missed.

Thanks, now I can relax. But my head hurts too, I'll probably try to re-read
this code and other emails from you tomorrow.

Oleg.

2020-07-20 15:36:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On 07/20, Oleg Nesterov wrote:
>
> On 07/20, Peter Zijlstra wrote:
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> > local_irq_disable();
> > rcu_note_context_switch(preempt);
> >
> > - /* See deactivate_task() below. */
> > - prev_state = prev->state;
> > -
> > /*
> > * Make sure that signal_pending_state()->signal_pending() below
> > * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> > * We must re-load prev->state in case ttwu_remote() changed it
> > * before we acquired rq->lock.
> > */
> > - if (!preempt && prev_state && prev_state == prev->state) {
> > + prev_state = prev->state;
> > + if (!preempt && prev_state) {
>
> Heh ;) Peter, you know what? I did the same change and tried to understand
> why it is wrong and what have I missed.
>
> Thanks, now I can relax. But my head hurts too, I'll probably try to re-read
> this code and other emails from you tomorrow.

Yes, I can no longer read this code today ;)

but now it seems to me that (in theory) we need READ_ONCE(prev->state) here
and probably WRITE_ONCE(on_rq) in deactivate_task() to ensure ctrl-dep?

Probably not, I got lost.
Probably not, I got lost.
Probably not, I got lost.

Oleg.

2020-07-20 16:51:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Mon, Jul 20, 2020 at 05:35:15PM +0200, Oleg Nesterov wrote:
> On 07/20, Oleg Nesterov wrote:
> >
> > On 07/20, Peter Zijlstra wrote:
> > >
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> > > local_irq_disable();
> > > rcu_note_context_switch(preempt);
> > >
> > > - /* See deactivate_task() below. */
> > > - prev_state = prev->state;
> > > -
> > > /*
> > > * Make sure that signal_pending_state()->signal_pending() below
> > > * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> > > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> > > * We must re-load prev->state in case ttwu_remote() changed it
> > > * before we acquired rq->lock.
> > > */
> > > - if (!preempt && prev_state && prev_state == prev->state) {
> > > + prev_state = prev->state;
> > > + if (!preempt && prev_state) {
> >
> > Heh ;) Peter, you know what? I did the same change and tried to understand
> > why it is wrong and what have I missed.
> >
> > Thanks, now I can relax. But my head hurts too, I'll probably try to re-read
> > this code and other emails from you tomorrow.
>
> Yes, I can no longer read this code today ;)
>
> but now it seems to me that (in theory) we need READ_ONCE(prev->state) here
> and probably WRITE_ONCE(on_rq) in deactivate_task() to ensure ctrl-dep?
>
> Probably not, I got lost.
> Probably not, I got lost.
> Probably not, I got lost.

So, task_struct::state is declared volatile (we should probably 'fix'
that some day), so that doesn't require READ_ONCE() -- in fact, that
caused a bunch of re-reads in the old code which made the loadavg race
more likely.

->on_rq is only ever written 0,1,2, there's no possibe store-tearing.
But possibly, yes, WRITE_ONCE() would be nicer.

2020-07-21 04:54:39

by Paul Gortmaker

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

[Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917] On 20/07/2020 (Mon 16:21) Peter Zijlstra wrote:

> On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
> > I have to admit, I do not understand the usage of prev_state in schedule(),
> > it looks really, really subtle...
>
> Right, so commit dbfb089d360 solved a problem where schedule() re-read
> prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
> ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
> result load-accounting went wobbly.
>
> Now, looking at that commit again, I might've solved the problem twice
> :-P

[...]

> That said, in a crossed email, I just proposed we could simplify all
> this like so.. but now I need to go ask people to re-validate that
> loadavg muck again :-/

After a two hour "quick" sanity test I then gave it a full 7h run (which
always seemed to break before dbfb089d360) and I didn't see any stuck
load average with master from today + this change.

Paul.

root@t5610:/home/paul/git/linux-head#
[1]+ Done nohup tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 420 --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1
root@t5610:/home/paul/git/linux-head# cat /proc/version
Linux version 5.8.0-rc6-00001-g5714ee50bb43-dirty (paul@t5610) (gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #3 SMP Mon Jul 20 12:30:33 EDT 2020
root@t5610:/home/paul/git/linux-head# uptime
00:49:18 up 7:41, 2 users, load average: 0.01, 0.00, 0.63
root@t5610:/home/paul/git/linux-head#

--

>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a2a244af9a53..437fc3b241f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> local_irq_disable();
> rcu_note_context_switch(preempt);
>
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
> /*
> * Make sure that signal_pending_state()->signal_pending() below
> * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> * We must re-load prev->state in case ttwu_remote() changed it
> * before we acquired rq->lock.
> */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;
> + if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
> @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt)
>
> /*
> * __schedule() ttwu()
> - * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
> - * LOCK rq->lock goto out;
> - * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
> - * p->on_rq = 0; p->state = TASK_WAKING;
> + * if (prev_state) if (p->on_rq && ...)
> + * p->on_rq = 0; goto out;
> + * smp_acquire__after_ctrl_dep();
> + * p->state = TASK_WAKING
> + *
> + * Where __schedule() and ttwu() have matching control dependencies.
> *
> * After this, schedule() must not care about p->state any more.
> */

2020-07-21 08:38:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

On Tue, Jul 21, 2020 at 12:52:52AM -0400, Paul Gortmaker wrote:
> [Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917] On 20/07/2020 (Mon 16:21) Peter Zijlstra wrote:
>
> > On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
> > > I have to admit, I do not understand the usage of prev_state in schedule(),
> > > it looks really, really subtle...
> >
> > Right, so commit dbfb089d360 solved a problem where schedule() re-read
> > prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
> > ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
> > result load-accounting went wobbly.
> >
> > Now, looking at that commit again, I might've solved the problem twice
> > :-P
>
> [...]
>
> > That said, in a crossed email, I just proposed we could simplify all
> > this like so.. but now I need to go ask people to re-validate that
> > loadavg muck again :-/
>
> After a two hour "quick" sanity test I then gave it a full 7h run (which
> always seemed to break before dbfb089d360) and I didn't see any stuck
> load average with master from today + this change.
>
> Paul.
>
> root@t5610:/home/paul/git/linux-head#
> [1]+ Done nohup tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 420 --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1
> root@t5610:/home/paul/git/linux-head# cat /proc/version
> Linux version 5.8.0-rc6-00001-g5714ee50bb43-dirty (paul@t5610) (gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #3 SMP Mon Jul 20 12:30:33 EDT 2020
> root@t5610:/home/paul/git/linux-head# uptime
> 00:49:18 up 7:41, 2 users, load average: 0.01, 0.00, 0.63
> root@t5610:/home/paul/git/linux-head#
>
> --

Thanks a lot for re-testing that Paul, much appreciated!

Let me go write a Changelog for this then...

2020-07-21 09:15:35

by Valentin Schneider

[permalink] [raw]
Subject: Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917


On 20/07/20 15:21, Peter Zijlstra wrote:
> On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
>> I have to admit, I do not understand the usage of prev_state in schedule(),
>> it looks really, really subtle...
>
> Right, so commit dbfb089d360 solved a problem where schedule() re-read
> prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
> ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
> result load-accounting went wobbly.
>
> Now, looking at that commit again, I might've solved the problem twice
> :-P
>
> So on the one hand, I provider ordering:
>
> LOAD p->state LOAD-ACQUIRE p->on_rq == 0
> MB
> STORE p->on_rq, 0 STORE p->state, TASK_WAKING
>
> such that ttwu() will only change p->state, after on_rq==0, which is
> after loading p->state in schedule().
>
> At the same time, I also had schedule() set
> p->sched_contributes_to_load once, and then consistently used that value
> throughout, without ever looking at p->state again, which too makes it
> much harder to mess load-avg up.
>
>
> Now, the ordering in schedule(), relies on doing the p->state load
> before:
>
> spin_lock(rq->lock)
> smp_mb__after_spinlock();
>
> and doing a re-load check after, with the assumption that if the reload
> is different, it will not block.
>
> That said, in a crossed email, I just proposed we could simplify all
> this like so.. but now I need to go ask people to re-validate that
> loadavg muck again :-/
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a2a244af9a53..437fc3b241f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> local_irq_disable();
> rcu_note_context_switch(preempt);
>
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
> /*
> * Make sure that signal_pending_state()->signal_pending() below
> * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> * We must re-load prev->state in case ttwu_remote() changed it
> * before we acquired rq->lock.
> */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;

AFAICT the ttwu/schedule ordering dance + the "trick" of updating
p->sched_contributes_to_load in __schedule() under rq lock ensures loadavg
accounting won't go awry here. I'm still a bit low on coffee, but that does
LGTM.


> + if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
> @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt)
>
> /*
> * __schedule() ttwu()
> - * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
> - * LOCK rq->lock goto out;
> - * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
> - * p->on_rq = 0; p->state = TASK_WAKING;
> + * if (prev_state) if (p->on_rq && ...)
> + * p->on_rq = 0; goto out;
> + * smp_acquire__after_ctrl_dep();
> + * p->state = TASK_WAKING
> + *
> + * Where __schedule() and ttwu() have matching control dependencies.
> *
> * After this, schedule() must not care about p->state any more.
> */

2020-07-21 12:13:48

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] sched: Fix race against ptrace_freeze_trace()


There is apparently one site that violates the rule that only current
and ttwu() will modify task->state, namely ptrace_{,un}freeze_traced()
will change task->state for a remote task.

Oleg explains:

"TASK_TRACED/TASK_STOPPED was always protected by siglock. In
particular, ttwu(__TASK_TRACED) must be always called with siglock
held. That is why ptrace_freeze_traced() assumes it can safely do
s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)."

This breaks the ordering scheme introduced by commit:

dbfb089d360b ("sched: Fix loadavg accounting race")

Specifically, the reload not matching no longer implies we don't have
to block.

Simply things by noting that what we need is a LOAD->STORE ordering
and this can be provided by a control dependency.

So replace:

prev_state = prev->state;
raw_spin_lock(&rq->lock);
smp_mb__after_spinlock(); /* SMP-MB */
if (... && prev_state && prev_state == prev->state)
deactivate_task();

with:

prev_state = prev->state;
if (... && prev_state) /* CTRL-DEP */
deactivate_task();

Since that already implies the 'prev->state' load must be complete
before allowing the 'prev->on_rq = 0' store to become visible.

Fixes: dbfb089d360b ("sched: Fix loadavg accounting race")
Reported-by: Jiri Slaby <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Paul Gortmaker <[email protected]>
---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(b
local_irq_disable();
rcu_note_context_switch(preempt);

- /* See deactivate_task() below. */
- prev_state = prev->state;
-
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
@@ -4219,11 +4216,16 @@ static void __sched notrace __schedule(b
update_rq_clock(rq);

switch_count = &prev->nivcsw;
+
/*
- * We must re-load prev->state in case ttwu_remote() changed it
- * before we acquired rq->lock.
+ * We must load prev->state once (task_struct::state is volatile), such
+ * that:
+ *
+ * - we form a control dependency vs deactivate_task() below.
+ * - ptrace_{,un}freeze_traced() can change ->state underneath us.
*/
- if (!preempt && prev_state && prev_state == prev->state) {
+ prev_state = prev->state;
+ if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
@@ -4237,10 +4239,12 @@ static void __sched notrace __schedule(b

/*
* __schedule() ttwu()
- * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
- * LOCK rq->lock goto out;
- * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
- * p->on_rq = 0; p->state = TASK_WAKING;
+ * if (prev_state) if (p->on_rq && ...)
+ * p->on_rq = 0; goto out;
+ * smp_acquire__after_ctrl_dep();
+ * p->state = TASK_WAKING
+ *
+ * Where __schedule() and ttwu() have matching control dependencies.
*
* After this, schedule() must not care about p->state any more.
*/

2020-07-21 14:32:09

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] sched: Fix race against ptrace_freeze_trace()

On Tue, Jul 21, 2020 at 02:13:08PM +0200, [email protected] wrote:
>
> There is apparently one site that violates the rule that only current
> and ttwu() will modify task->state, namely ptrace_{,un}freeze_traced()
> will change task->state for a remote task.
>
> Oleg explains:
>
> "TASK_TRACED/TASK_STOPPED was always protected by siglock. In
> particular, ttwu(__TASK_TRACED) must be always called with siglock
> held. That is why ptrace_freeze_traced() assumes it can safely do
> s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)."
>
> This breaks the ordering scheme introduced by commit:
>
> dbfb089d360b ("sched: Fix loadavg accounting race")
>
> Specifically, the reload not matching no longer implies we don't have
> to block.
>
> Simply things by noting that what we need is a LOAD->STORE ordering
> and this can be provided by a control dependency.
>
> So replace:
>
> prev_state = prev->state;
> raw_spin_lock(&rq->lock);
> smp_mb__after_spinlock(); /* SMP-MB */
> if (... && prev_state && prev_state == prev->state)
> deactivate_task();
>
> with:
>
> prev_state = prev->state;
> if (... && prev_state) /* CTRL-DEP */
> deactivate_task();
>
> Since that already implies the 'prev->state' load must be complete
> before allowing the 'prev->on_rq = 0' store to become visible.
>
> Fixes: dbfb089d360b ("sched: Fix loadavg accounting race")
> Reported-by: Jiri Slaby <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Tested-by: Paul Gortmaker <[email protected]>
> ---

Thank you. I applied this on top of v5.8-rc6 and re-ran the strace-test
suite successfully. So at least

Tested-by: Christian Brauner <[email protected]>

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(b
> local_irq_disable();
> rcu_note_context_switch(preempt);
>
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
> /*
> * Make sure that signal_pending_state()->signal_pending() below
> * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4219,11 +4216,16 @@ static void __sched notrace __schedule(b
> update_rq_clock(rq);
>
> switch_count = &prev->nivcsw;
> +
> /*
> - * We must re-load prev->state in case ttwu_remote() changed it
> - * before we acquired rq->lock.
> + * We must load prev->state once (task_struct::state is volatile), such
> + * that:
> + *
> + * - we form a control dependency vs deactivate_task() below.
> + * - ptrace_{,un}freeze_traced() can change ->state underneath us.
> */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;
> + if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
> @@ -4237,10 +4239,12 @@ static void __sched notrace __schedule(b
>
> /*
> * __schedule() ttwu()
> - * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
> - * LOCK rq->lock goto out;
> - * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
> - * p->on_rq = 0; p->state = TASK_WAKING;
> + * if (prev_state) if (p->on_rq && ...)
> + * p->on_rq = 0; goto out;
> + * smp_acquire__after_ctrl_dep();
> + * p->state = TASK_WAKING
> + *
> + * Where __schedule() and ttwu() have matching control dependencies.
> *
> * After this, schedule() must not care about p->state any more.
> */

2020-07-21 15:42:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] sched: Fix race against ptrace_freeze_trace()

On 07/21, Peter Zijlstra wrote:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(b
> local_irq_disable();
> rcu_note_context_switch(preempt);
>
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
> /*
> * Make sure that signal_pending_state()->signal_pending() below
> * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4219,11 +4216,16 @@ static void __sched notrace __schedule(b
> update_rq_clock(rq);
>
> switch_count = &prev->nivcsw;
> +
> /*
> - * We must re-load prev->state in case ttwu_remote() changed it
> - * before we acquired rq->lock.
> + * We must load prev->state once (task_struct::state is volatile), such
> + * that:
> + *
> + * - we form a control dependency vs deactivate_task() below.
> + * - ptrace_{,un}freeze_traced() can change ->state underneath us.
> */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;
> + if (!preempt && prev_state) {

Thanks! FWIW,

Acked-by: Oleg Nesterov <[email protected]>