2020-03-27 07:45:58

by kernel test robot

[permalink] [raw]
Subject: 6d25be5782 ("sched/core, workqueues: Distangle worker .."): [ 52.816697] WARNING: CPU: 0 PID: 14 at kernel/workqueue.c:882 wq_worker_sleeping

Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

commit 6d25be5782e482eb93e3de0c94d0a517879377d0
Author: Thomas Gleixner <[email protected]>
AuthorDate: Wed Mar 13 17:55:48 2019 +0100
Commit: Ingo Molnar <[email protected]>
CommitDate: Tue Apr 16 16:55:15 2019 +0200

sched/core, workqueues: Distangle worker accounting from rq lock

The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>

e2abb39811 sched/fair: Remove unneeded prototype of capacity_of()
6d25be5782 sched/core, workqueues: Distangle worker accounting from rq lock
9efcc4a129 afs: Fix unpinned address list during probing
+--------------------------------------------------------------------------------+------------+------------+------------+
| | e2abb39811 | 6d25be5782 | 9efcc4a129 |
+--------------------------------------------------------------------------------+------------+------------+------------+
| boot_successes | 997 | 806 | 27 |
| boot_failures | 112 | 104 | 9 |
| WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_fwd_prog | 61 | 52 | 2 |
| RIP:rcu_torture_fwd_prog | 63 | 57 | 2 |
| BUG:workqueue_lockup-pool | 12 | 16 | 1 |
| BUG:kernel_hang_in_early-boot_stage,last_printk:early_console_in_setup_code | 14 | 13 | 6 |
| BUG:kernel_timeout_in_early-boot_stage,last_printk:early_console_in_setup_code | 5 | | |
| Initiating_system_reboot | 1 | | |
| BUG:spinlock_recursion_on_CPU | 10 | 6 | |
| RIP:rcu_torture_one_read | 2 | | |
| INFO:rcu_preempt_self-detected_stall_on_CPU | 1 | 3 | |
| RIP:iov_iter_copy_from_user_atomic | 1 | 3 | |
| BUG:kernel_hang_in_boot_stage | 2 | | |
| RIP:__schedule | 4 | 2 | |
| BUG:kernel_hang_in_test_stage | 2 | | |
| RIP:rcutorture_seq_diff | 3 | | |
| BUG:spinlock_wrong_owner_on_CPU | 1 | | |
| BUG:workqueue_leaked_lock_or_atomic:rcu_torture_sta | 1 | | |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/workqueue.c | 1 | | |
| WARNING:at_kernel/workqueue.c:#worker_set_flags | 1 | | |
| RIP:worker_set_flags | 1 | | |
| BUG:scheduling_while_atomic | 1 | | |
| BUG:unable_to_handle_kernel | 2 | 1 | |
| Oops:#[##] | 2 | 1 | |
| Kernel_panic-not_syncing:Fatal_exception | 6 | 1 | |
| kernel_BUG_at_lib/list_debug.c | 4 | | |
| invalid_opcode:#[##] | 4 | | |
| RIP:__list_add_valid | 4 | | |
| RIP:__rb_erase_color | 1 | | |
| INFO:rcu_preempt_detected_stalls_on_CPUs/tasks | 1 | 7 | |
| RIP:rcu_read_delay | 1 | 1 | |
| RIP:kvm_async_pf_task_wait | 1 | | |
| RIP:d_walk | 1 | | |
| WARNING:at_kernel/workqueue.c:#wq_worker_sleeping | 0 | 8 | 1 |
| RIP:wq_worker_sleeping | 0 | 8 | 1 |
| RIP:kthread_data | 0 | 2 | |
| RIP:to_kthread | 0 | 1 | |
| RIP:wq_worker_running | 0 | 5 | 1 |
| WARNING:at_kernel/workqueue.c:#worker_enter_idle | 0 | 3 | 1 |
| RIP:worker_enter_idle | 0 | 3 | 1 |
| BUG:kernel_timeout_in_boot_stage | 0 | 1 | |
| BUG:soft_lockup-CPU##stuck_for#s![rcu_torture_fak:#] | 0 | 1 | |
| RIP:preempt_count_equals | 0 | 2 | |
| Kernel_panic-not_syncing:softlockup:hung_tasks | 0 | 2 | |
| BUG:soft_lockup-CPU##stuck_for#s![rcu_torture_rea:#] | 0 | 1 | |
| RIP:ftrace_stub | 0 | 1 | |
| RIP:ftrace_likely_update | 0 | 3 | |
| RIP:simple_write_end | 0 | 1 | |
| RIP:delay_tsc | 0 | 2 | |
| RIP:check_preemption_disabled | 0 | 1 | |
+--------------------------------------------------------------------------------+------------+------------+------------+

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

[ 5.924439] igt_debug 0x0000000000000200-0x0000000000000600: 1024: used
[ 5.924883] igt_debug 0x0000000000000600-0x0000000000000a00: 1024: free
[ 5.925327] igt_debug 0x0000000000000a00-0x0000000000000e00: 1024: used
[ 5.925806] igt_debug 0x0000000000000e00-0x0000000000001000: 512: free
[ 5.926286] igt_debug total: 4096, used 2048 free 2048
[ 52.816697] WARNING: CPU: 0 PID: 14 at kernel/workqueue.c:882 wq_worker_sleeping+0x72/0x136
[ 52.821963] Modules linked in:
[ 52.821963] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 5.1.0-rc3-00024-g6d25be5782e48 #1
[ 52.821963] Workqueue: rcu_gp wait_rcu_exp_gp
[ 52.821963] RIP: 0010:wq_worker_sleeping+0x72/0x136
[ 52.821963] Code: 48 8b 58 40 45 85 ed 41 0f 95 c6 31 c9 31 d2 44 89 f6 e8 c5 3f 0d 00 48 ff 05 71 aa 59 03 45 85 ed 74 17 48 ff 05 6d aa 59 03 <0f> 0b 48 ff 05 6c aa 59 03 48 ff 05 6d aa 59 03 31 c9 31 d2 44 89
[ 52.821963] RSP: 0000:ffffa66380073b78 EFLAGS: 00010202
[ 52.821963] RAX: ffff97fe37b08de0 RBX: ffffffff9c2ae960 RCX: 0000000000000000
[ 52.821963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff9cd4e028
[ 52.821963] RBP: ffffa66380073b98 R08: 0000000000000000 R09: 0000000000000000
[ 52.821963] R10: ffffa66380073c90 R11: 0000000000000000 R12: ffff97fe37b08de0
[ 52.821963] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000005000
[ 52.821963] FS: 0000000000000000(0000) GS:ffffffff9c250000(0000) knlGS:0000000000000000
[ 52.821963] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 52.821963] CR2: 0000000000005000 CR3: 00000001b741a000 CR4: 00000000000006b0
[ 52.821963] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 52.821963] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 52.821963] Call Trace:
[ 52.821963] schedule+0x7a/0x1c0
[ 52.821963] kvm_async_pf_task_wait+0x253/0x300
[ 52.821963] do_async_page_fault+0xbf/0x141
[ 52.821963] ? kvm_async_pf_task_wait+0x5/0x300
[ 52.821963] ? do_async_page_fault+0xbf/0x141
[ 52.821963] async_page_fault+0x1e/0x30
[ 52.821963] RIP: 0010:to_kthread+0x0/0x81
[ 52.821963] Code: 03 48 ff 05 44 b0 59 03 89 de 31 c9 31 d2 48 c7 c7 a8 ea d4 9c e8 9f ec 0c 00 48 ff 05 33 b0 59 03 5b 41 5c 41 5d 41 5e 5d c3 <55> 48 ff 05 a2 a8 59 03 48 89 e5 41 55 41 54 53 45 31 ed 49 89 fc
[ 52.821963] RSP: 0000:ffffa66380073d68 EFLAGS: 00010206
[ 52.821963] RAX: 0000000000000000 RBX: ffff97fe37b52000 RCX: 0000000000000000
[ 52.821963] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97fe37b52000
[ 52.821963] RBP: ffffa66380073d70 R08: 0000000000000000 R09: ffffffff9c2c0a18
[ 52.821963] R10: ffff97fe37b52060 R11: 0000000000000000 R12: 0000000000000000
[ 52.821963] R13: ffffffff9cd51418 R14: 0000000000002710 R15: 0000000000000000
[ 52.821963] ? kthread_data+0x15/0x22
[ 52.821963] wq_worker_running+0x15/0x53
[ 52.821963] schedule+0x1ab/0x1c0
[ 52.821963] schedule_timeout+0xd6/0x10a
[ 52.821963] ? init_timer_key+0x41/0x41
[ 52.821963] rcu_exp_sel_wait_wake+0x58c/0xae2
[ 52.821963] wait_rcu_exp_gp+0x19/0x22
[ 52.821963] process_one_work+0x387/0x5eb
[ 52.821963] ? worker_thread+0x4d2/0x5e3
[ 52.821963] worker_thread+0x401/0x5e3
[ 52.821963] ? rescuer_thread+0x492/0x492
[ 52.821963] kthread+0x19b/0x1aa
[ 52.821963] ? kthread_flush_work+0x175/0x175
[ 52.821963] ret_from_fork+0x3a/0x50
[ 52.821963] random: get_random_bytes called from init_oops_id+0x2d/0x4c with crng_init=0
[ 52.821963] ---[ end trace 0be0a8f9c0f268e1 ]---
[ 64.886064] rcu-torture: rtc: (____ptrval____) ver: 324 tfle: 0 rta: 324 rtaf: 0 rtf: 315 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 101 barrier: 0/0:0

# HH:MM RESULT GOOD BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD
git bisect start v5.2 v5.1 --
git bisect bad 2c1212de6f9794a7becba5f219fa6ce8a8222c90 # 23:32 B 104 1 6 273 Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
git bisect bad 06cbd26d312edfe4a83ff541c23f8f866265eb24 # 00:18 B 26 1 0 0 Merge tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
git bisect bad 9f2e3a53f7ec9ef55e9d01bc29a6285d291c151e # 05:24 B 269 1 20 20 Merge tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect bad ba3934de557a2074b89d67e29e5ce119f042d057 # 08:23 B 131 1 11 11 Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 90489a72fba9529c85e051067ecb41183b8e982e # 16:37 G 900 0 74 74 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 82ac4043cac5d75d8cda79bc8a095f8306f35c75 # 17:45 B 245 1 22 22 Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad a0e928ed7c603a47dca8643e58db224a799ff2c5 # 21:01 B 315 1 22 22 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad e00d4135751bfe786a9e26b5560b185ce3f9f963 # 23:36 B 193 1 16 16 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad bee9853932e90ce94bce4276ec6b7b06bc48070b # 02:33 B 262 1 24 24 sched/core: Fix typo in comment
git bisect good d8743230c9f4e92f370ecd2a90c680ddcede6ae5 # 08:53 G 902 0 97 97 sched/topology: Fix build_sched_groups() comment
git bisect bad 6d25be5782e482eb93e3de0c94d0a517879377d0 # 10:00 B 62 1 12 12 sched/core, workqueues: Distangle worker accounting from rq lock
git bisect good e2abb398115e9c33f3d1e25bf6d1d08badc58b13 # 23:19 G 900 0 89 89 sched/fair: Remove unneeded prototype of capacity_of()
# first bad commit: [6d25be5782e482eb93e3de0c94d0a517879377d0] sched/core, workqueues: Distangle worker accounting from rq lock
git bisect good e2abb398115e9c33f3d1e25bf6d1d08badc58b13 # 02:02 G 1000 0 21 110 sched/fair: Remove unneeded prototype of capacity_of()
# extra tests with debug options
git bisect good 6d25be5782e482eb93e3de0c94d0a517879377d0 # 09:37 G 900 0 147 147 sched/core, workqueues: Distangle worker accounting from rq lock
# extra tests on head commit of linus/master
git bisect bad 9efcc4a129363187c9bf15338692f107c5c9b6f0 # 10:25 B 30 1 7 7 afs: Fix unpinned address list during probing
# bad: [9efcc4a129363187c9bf15338692f107c5c9b6f0] afs: Fix unpinned address list during probing
# extra tests on linus/master
# duplicated: [9efcc4a129363187c9bf15338692f107c5c9b6f0] afs: Fix unpinned address list during probing
# extra tests on linux-next/master
# 119: [89295c59c1f063b533d071ca49d0fa0c0783ca6f] Add linux-next specific files for 20200326

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (15.54 kB)
dmesg-quantal-vm-quantal-13:20200326101641:x86_64-randconfig-f001-20200301:5.1.0-rc3-00024-g6d25be5782e48:1.gz (20.58 kB)
dmesg-quantal-vm-quantal-10:20200326113507:x86_64-randconfig-f001-20200301:5.1.0-rc3-00023-ge2abb398115e9c:1.gz (21.18 kB)
reproduce-quantal-vm-quantal-13:20200326101641:x86_64-randconfig-f001-20200301:5.1.0-rc3-00024-g6d25be5782e48:1 (979.00 B)
3285ffc82e1fb00caabde46c6bed6cb262b4d498:gcc-7:x86_64-randconfig-f001-20200301:WARNING:at_kernel_workqueue.c:_wq_worker_sleeping.xz (27.86 kB)
config-5.1.0-rc3-00024-g6d25be5782e48 (148.15 kB)
Download all attachments
Subject: [PATCH] workqueue: Don't double assign worker->sleeping

The kernel test robot triggered a warning with the following race:
task-ctx interrupt-ctx
worker
-> process_one_work()
-> work_item()
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> ->sleeping = 1
atomic_dec_and_test(nr_running)
__schedule(); *interrupt*
async_page_fault()
-> local_irq_enable();
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> if (WARN_ON(->sleeping)) return
-> __schedule()
-> sched_update_worker()
-> wq_worker_running()
-> atomic_inc(nr_running);
-> ->sleeping = 0;

-> sched_update_worker()
-> wq_worker_running()
if (!->sleeping) return

In this context the warning is pointless everything is fine.

However, if the interrupt occurs in wq_worker_sleeping() between reading and
setting `sleeping' i.e.

| if (WARN_ON_ONCE(worker->sleeping))
| return;
*interrupt*
| worker->sleeping = 1;

then pool->nr_running will be decremented twice in wq_worker_sleeping()
but it will be incremented only once in wq_worker_running().

Replace the assignment of `sleeping' with a cmpxchg_local() to ensure
that there is no double assignment of the variable. The variable is only
accessed from the local CPU. Remove the WARN statement because this
condition can be valid.

An alternative would be to move `->sleeping' to `->flags' as a new bit
but this would require to acquire the pool->lock in wq_worker_running().

Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/workqueue.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4e01c448b4b48..dc477a2a3ce30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -846,11 +846,10 @@ void wq_worker_running(struct task_struct *task)
{
struct worker *worker = kthread_data(task);

- if (!worker->sleeping)
+ if (cmpxchg_local(&worker->sleeping, 1, 0) == 0)
return;
if (!(worker->flags & WORKER_NOT_RUNNING))
atomic_inc(&worker->pool->nr_running);
- worker->sleeping = 0;
}

/**
@@ -875,10 +874,9 @@ void wq_worker_sleeping(struct task_struct *task)

pool = worker->pool;

- if (WARN_ON_ONCE(worker->sleeping))
+ if (cmpxchg_local(&worker->sleeping, 0, 1) == 1)
return;

- worker->sleeping = 1;
spin_lock_irq(&pool->lock);

/*
--
2.26.0

Subject: [PATCH v2] workqueue: Remove the warning in wq_worker_sleeping()

The kernel test robot triggered a warning with the following race:
task-ctx A interrupt-ctx B
worker
-> process_one_work()
-> work_item()
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> ->sleeping = 1
atomic_dec_and_test(nr_running)
__schedule(); *interrupt*
async_page_fault()
-> local_irq_enable();
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> if (WARN_ON(->sleeping)) return
-> __schedule()
-> sched_update_worker()
-> wq_worker_running()
-> atomic_inc(nr_running);
-> ->sleeping = 0;

-> sched_update_worker()
-> wq_worker_running()
if (!->sleeping) return

In this context the warning is pointless everything is fine.
An interrupt before wq_worker_sleeping() will perform the ->sleeping
assignment (0 -> 1 > 0) twice.
An interrupt after wq_worker_sleeping() will trigger the warning and
nr_running will be decremented (by A) and incremented once (only by B, A
will skip it). This is the case until the ->sleeping is zeroed again in
wq_worker_running().

Remove the WARN statement because this condition may happen. Document
that preemption around wq_worker_sleeping() needs to be disabled to
protect ->sleeping and not just as an optimisation.

Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---

v1…v2: - Drop the warning instead of using cmpxchg_local().
Tglx pointed out that wq_worker_sleeping() is already invoked
with disabled preemption so the race described can not happen.

- Document that it is required to have preemption disabled within
wq_worker_sleeping() to protect ->sleeping (against the race
described in v1).

kernel/sched/core.c | 3 ++-
kernel/workqueue.c | 6 ++++--
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408d..7181ea73e5566 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4112,7 +4112,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
* it wants to wake up a task to maintain concurrency.
* As this function is called inside the schedule() context,
* we disable preemption to avoid it calling schedule() again
- * in the possible wakeup of a kworker.
+ * in the possible wakeup of a kworker and because wq_worker_sleeping()
+ * requires it.
*/
if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
preempt_disable();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4e01c448b4b48..39b968146f5f7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -858,7 +858,8 @@ void wq_worker_running(struct task_struct *task)
* @task: task going to sleep
*
* This function is called from schedule() when a busy worker is
- * going to sleep.
+ * going to sleep. Preemption needs to be disabled to protect ->sleeping
+ * assignment.
*/
void wq_worker_sleeping(struct task_struct *task)
{
@@ -875,7 +876,8 @@ void wq_worker_sleeping(struct task_struct *task)

pool = worker->pool;

- if (WARN_ON_ONCE(worker->sleeping))
+ /* Return if preempted before wq_worker_running() was reached */
+ if (worker->sleeping)
return;

worker->sleeping = 1;
--
2.26.0

2020-04-01 03:25:49

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Don't double assign worker->sleeping

Hello

If I don't miss all the issues you have listed, it is a good and straightforward
fix, but I have concern that cmpxchg_local() might have performance impact
on some non-x86 arch.

The two issues as you have listed:
1) WARN_ON_ONCE() on valid condition when interrupted(async-page-faulted)
2) wq_worker_running() can be interrupted(async-page-faulted in virtual machine)
and nr_running would be decreased twice.

For fixing issue one, we can just remove WARN_ON_ONCE() as this patch.
For fixing issue two, ->sleeping in wq_worker_running() can be checked&modified
under irq-disabled. (we can't use preempt-disabled context here)

thanks,
Lai

On Sat, Mar 28, 2020 at 1:53 AM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> The kernel test robot triggered a warning with the following race:
> task-ctx interrupt-ctx
> worker
> -> process_one_work()
> -> work_item()
> -> schedule();
> -> sched_submit_work()
> -> wq_worker_sleeping()
> -> ->sleeping = 1
> atomic_dec_and_test(nr_running)
> __schedule(); *interrupt*
> async_page_fault()
> -> local_irq_enable();
> -> schedule();
> -> sched_submit_work()
> -> wq_worker_sleeping()
> -> if (WARN_ON(->sleeping)) return
> -> __schedule()
> -> sched_update_worker()
> -> wq_worker_running()
> -> atomic_inc(nr_running);
> -> ->sleeping = 0;
>
> -> sched_update_worker()
> -> wq_worker_running()
> if (!->sleeping) return
>
> In this context the warning is pointless everything is fine.
>
> However, if the interrupt occurs in wq_worker_sleeping() between reading and
> setting `sleeping' i.e.
>
> | if (WARN_ON_ONCE(worker->sleeping))
> | return;
> *interrupt*
> | worker->sleeping = 1;
>
> then pool->nr_running will be decremented twice in wq_worker_sleeping()
> but it will be incremented only once in wq_worker_running().
>
> Replace the assignment of `sleeping' with a cmpxchg_local() to ensure
> that there is no double assignment of the variable. The variable is only
> accessed from the local CPU. Remove the WARN statement because this
> condition can be valid.
>
> An alternative would be to move `->sleeping' to `->flags' as a new bit
> but this would require to acquire the pool->lock in wq_worker_running().
>
> Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock")
> Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> kernel/workqueue.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 4e01c448b4b48..dc477a2a3ce30 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -846,11 +846,10 @@ void wq_worker_running(struct task_struct *task)
> {
> struct worker *worker = kthread_data(task);
>
> - if (!worker->sleeping)
> + if (cmpxchg_local(&worker->sleeping, 1, 0) == 0)
> return;
> if (!(worker->flags & WORKER_NOT_RUNNING))
> atomic_inc(&worker->pool->nr_running);
> - worker->sleeping = 0;
> }
>
> /**
> @@ -875,10 +874,9 @@ void wq_worker_sleeping(struct task_struct *task)
>
> pool = worker->pool;
>
> - if (WARN_ON_ONCE(worker->sleeping))
> + if (cmpxchg_local(&worker->sleeping, 0, 1) == 1)
> return;
>
> - worker->sleeping = 1;
> spin_lock_irq(&pool->lock);
>
> /*
> --
> 2.26.0
>

2020-04-01 03:44:51

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Don't double assign worker->sleeping

On Wed, Apr 1, 2020 at 11:22 AM Lai Jiangshan <[email protected]> wrote:
>
> Hello
>
> If I don't miss all the issues you have listed, it is a good and straightforward
> fix, but I have concern that cmpxchg_local() might have performance impact
> on some non-x86 arch.
>
> The two issues as you have listed:
> 1) WARN_ON_ONCE() on valid condition when interrupted(async-page-faulted)
> 2) wq_worker_running() can be interrupted(async-page-faulted in virtual machine)
> and nr_running would be decreased twice.

would be *increased* twice

I just saw the V2 patch, this issue is not listed, but need to be fixed too.

>
> For fixing issue one, we can just remove WARN_ON_ONCE() as this patch.
> For fixing issue two, ->sleeping in wq_worker_running() can be checked&modified
> under irq-disabled. (we can't use preempt-disabled context here)
>
> thanks,
> Lai
>
> On Sat, Mar 28, 2020 at 1:53 AM Sebastian Andrzej Siewior
> <[email protected]> wrote:
> >
> > The kernel test robot triggered a warning with the following race:
> > task-ctx interrupt-ctx
> > worker
> > -> process_one_work()
> > -> work_item()
> > -> schedule();
> > -> sched_submit_work()
> > -> wq_worker_sleeping()
> > -> ->sleeping = 1
> > atomic_dec_and_test(nr_running)
> > __schedule(); *interrupt*
> > async_page_fault()
> > -> local_irq_enable();
> > -> schedule();
> > -> sched_submit_work()
> > -> wq_worker_sleeping()
> > -> if (WARN_ON(->sleeping)) return
> > -> __schedule()
> > -> sched_update_worker()
> > -> wq_worker_running()
> > -> atomic_inc(nr_running);
> > -> ->sleeping = 0;
> >
> > -> sched_update_worker()
> > -> wq_worker_running()
> > if (!->sleeping) return
> >
> > In this context the warning is pointless everything is fine.
> >
> > However, if the interrupt occurs in wq_worker_sleeping() between reading and
> > setting `sleeping' i.e.
> >
> > | if (WARN_ON_ONCE(worker->sleeping))
> > | return;
> > *interrupt*
> > | worker->sleeping = 1;
> >
> > then pool->nr_running will be decremented twice in wq_worker_sleeping()
> > but it will be incremented only once in wq_worker_running().
> >
> > Replace the assignment of `sleeping' with a cmpxchg_local() to ensure
> > that there is no double assignment of the variable. The variable is only
> > accessed from the local CPU. Remove the WARN statement because this
> > condition can be valid.
> >
> > An alternative would be to move `->sleeping' to `->flags' as a new bit
> > but this would require to acquire the pool->lock in wq_worker_running().
> >
> > Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock")
> > Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> > ---
> > kernel/workqueue.c | 6 ++----
> > 1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 4e01c448b4b48..dc477a2a3ce30 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -846,11 +846,10 @@ void wq_worker_running(struct task_struct *task)
> > {
> > struct worker *worker = kthread_data(task);
> >
> > - if (!worker->sleeping)
> > + if (cmpxchg_local(&worker->sleeping, 1, 0) == 0)
> > return;
> > if (!(worker->flags & WORKER_NOT_RUNNING))
> > atomic_inc(&worker->pool->nr_running);
> > - worker->sleeping = 0;
> > }
> >
> > /**
> > @@ -875,10 +874,9 @@ void wq_worker_sleeping(struct task_struct *task)
> >
> > pool = worker->pool;
> >
> > - if (WARN_ON_ONCE(worker->sleeping))
> > + if (cmpxchg_local(&worker->sleeping, 0, 1) == 1)
> > return;
> >
> > - worker->sleeping = 1;
> > spin_lock_irq(&pool->lock);
> >
> > /*
> > --
> > 2.26.0
> >

Subject: Re: [PATCH] workqueue: Don't double assign worker->sleeping

On 2020-04-01 11:44:06 [+0800], Lai Jiangshan wrote:
> On Wed, Apr 1, 2020 at 11:22 AM Lai Jiangshan <[email protected]> wrote:
> >
> > Hello
Hi Lai,


> > 2) wq_worker_running() can be interrupted(async-page-faulted in virtual machine)
> > and nr_running would be decreased twice.
>
> would be *increased* twice
>
> I just saw the V2 patch, this issue is not listed, but need to be fixed too.

| void wq_worker_running(struct task_struct *task)
| {
| struct worker *worker = kthread_data(task);
|
| if (!worker->sleeping)
| return;
| if (!(worker->flags & WORKER_NOT_RUNNING))
| atomic_inc(&worker->pool->nr_running);
*0
| worker->sleeping = 0;
*1
| }

So an interrupt
- before *0, the preempting caller drop early in wq_worker_sleeping(), only one
atomic_inc()

- after *1, the preempting task will invoke wq_worker_sleeping() and do
dec() + inc().

What did I miss here?

Sebastian

2020-04-02 00:08:53

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] workqueue: Don't double assign worker->sleeping

On Wed, Apr 1, 2020 at 9:03 PM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2020-04-01 11:44:06 [+0800], Lai Jiangshan wrote:
> > On Wed, Apr 1, 2020 at 11:22 AM Lai Jiangshan <[email protected]> wrote:
> > >
> > > Hello
> Hi Lai,
>
> …
> > > 2) wq_worker_running() can be interrupted(async-page-faulted in virtual machine)
> > > and nr_running would be decreased twice.
> >
> > would be *increased* twice
> >
> > I just saw the V2 patch, this issue is not listed, but need to be fixed too.
>
> | void wq_worker_running(struct task_struct *task)
> | {
> | struct worker *worker = kthread_data(task);
> |
> | if (!worker->sleeping)
> | return;
> | if (!(worker->flags & WORKER_NOT_RUNNING))
> | atomic_inc(&worker->pool->nr_running);
> *0
> | worker->sleeping = 0;
> *1
> | }
>
> So an interrupt
> - before *0, the preempting caller drop early in wq_worker_sleeping(), only one
> atomic_inc()

If it is preempted on *0, the preempting caller drop early in
wq_worker_sleeping()
so there is no atomic decreasing, only one atomic_inc() in the
preempting caller.
The preempted point here, wq_worker_running(), has already just done
atomic_inc(),
the total number of atomic_inc() is two, while the number of atomic decreasing
is one.

>
> - after *1, the preempting task will invoke wq_worker_sleeping() and do
> dec() + inc().
>
> What did I miss here?
>
> Sebastian

Subject: Re: [PATCH] workqueue: Don't double assign worker->sleeping

On 2020-04-02 08:07:35 [+0800], Lai Jiangshan wrote:
> > > would be *increased* twice
> > >
> > > I just saw the V2 patch, this issue is not listed, but need to be fixed too.
> >
> > | void wq_worker_running(struct task_struct *task)
> > | {
> > | struct worker *worker = kthread_data(task);
> > |
> > | if (!worker->sleeping)
> > | return;
> > | if (!(worker->flags & WORKER_NOT_RUNNING))
> > | atomic_inc(&worker->pool->nr_running);
> > *0
> > | worker->sleeping = 0;
> > *1
> > | }
> >
> > So an interrupt
> > - before *0, the preempting caller drop early in wq_worker_sleeping(), only one
> > atomic_inc()
>
> If it is preempted on *0, the preempting caller drop early in
> wq_worker_sleeping()
> so there is no atomic decreasing, only one atomic_inc() in the
> preempting caller.
> The preempted point here, wq_worker_running(), has already just done
> atomic_inc(),
> the total number of atomic_inc() is two, while the number of atomic decreasing
> is one.

But in order to look at the same worker->sleeping it has to be same
`task'. This can not happen because the `worker' assignment is
per-thread.

Sebastian

2020-04-03 15:45:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] workqueue: Remove the warning in wq_worker_sleeping()

Hello,

On Sat, Mar 28, 2020 at 12:29:59AM +0100, Sebastian Andrzej Siewior wrote:
> The kernel test robot triggered a warning with the following race:
> task-ctx A interrupt-ctx B
> worker
> -> process_one_work()
> -> work_item()
> -> schedule();
> -> sched_submit_work()
> -> wq_worker_sleeping()
> -> ->sleeping = 1
> atomic_dec_and_test(nr_running)
> __schedule(); *interrupt*
> async_page_fault()
> -> local_irq_enable();
> -> schedule();
> -> sched_submit_work()
> -> wq_worker_sleeping()
> -> if (WARN_ON(->sleeping)) return
> -> __schedule()
> -> sched_update_worker()
> -> wq_worker_running()
> -> atomic_inc(nr_running);
> -> ->sleeping = 0;
>
> -> sched_update_worker()
> -> wq_worker_running()
> if (!->sleeping) return
>
> In this context the warning is pointless everything is fine.

This is not a usual control flow, right? Can we annotate this case specifically
instead of weakening santiy check for generic cases?

Thanks.

--
tejun

2020-04-03 17:50:18

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH v2] workqueue: Remove the warning in wq_worker_sleeping()

Hi Sebastian,

On Sat, Mar 28, 2020 at 12:29:59AM +0100, Sebastian Andrzej Siewior wrote:
> v1…v2: - Drop the warning instead of using cmpxchg_local().
> Tglx pointed out that wq_worker_sleeping() is already invoked
> with disabled preemption so the race described can not happen.

I guess you mean this race:

> However, if the interrupt occurs in wq_worker_sleeping() between reading and
> setting `sleeping' i.e.
>
> | if (WARN_ON_ONCE(worker->sleeping))
> | return;
> *interrupt*
> | worker->sleeping = 1;
>
> then pool->nr_running will be decremented twice in wq_worker_sleeping()
> but it will be incremented only once in wq_worker_running().

Why would preemption prevent it? Interrupts are still enabled.

What am I missing? :-)

Subject: Re: [PATCH v2] workqueue: Remove the warning in wq_worker_sleeping()

On 2020-04-03 13:45:00 [-0400], Daniel Jordan wrote:
> Why would preemption prevent it? Interrupts are still enabled.
>
> What am I missing? :-)

Preemption is disabled which means no other task is allowed to be
scheduled.

Sebastian

2020-04-03 19:06:43

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH v2] workqueue: Remove the warning in wq_worker_sleeping()

On Fri, Apr 03, 2020 at 08:25:02PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-04-03 13:45:00 [-0400], Daniel Jordan wrote:
> > Why would preemption prevent it? Interrupts are still enabled.
> >
> > What am I missing? :-)
>
> Preemption is disabled which means no other task is allowed to be
> scheduled.

Aha, "preempt_count() > 1" in kvm_async_pf_task_wait().

Thanks.

Subject: Re: [PATCH v2] workqueue: Remove the warning in wq_worker_sleeping()

On 2020-04-03 10:53:26 [-0400], Tejun Heo wrote:
> Hello,
Hello Tejun,

> This is not a usual control flow, right?

The worker is blocked on something and while invoking schedule() it
needs to be preempted by an interrupt which opens interrupts and invokes
schedule() as well.
Interrupt handler are not allowed to enable interrupts in general.
Page-fault handler do so by they happen only while the calling context
triggered a page-fault. Since the kernel is always mapped, this does not
happen.
The async page fault handler is any exception here. I don't find
anything else while looking over x86's idtentry. So this makes it highly
unusual control flow, yes.

> Can we annotate this case specifically
> instead of weakening santiy check for generic cases?

puh. So if this
do_async_page_fault() -> do_page_fault()

never happens but only
do_async_page_fault() -> kvm_async_pf_task_wait()

then kvm_async_pf_task_wait() could invoke preempt_schedule() instead.
This would avoid the worker part (and the warning) but is only available
for PREEMPTION kernels. And I think the former case might happen.

> Thanks.
>

Sebastian

Subject: [tip: sched/urgent] workqueue: Remove the warning in wq_worker_sleeping()

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 62849a9612924a655c67cf6962920544aa5c20db
Gitweb: https://git.kernel.org/tip/62849a9612924a655c67cf6962920544aa5c20db
Author: Sebastian Andrzej Siewior <[email protected]>
AuthorDate: Sat, 28 Mar 2020 00:29:59 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Wed, 08 Apr 2020 11:35:20 +02:00

workqueue: Remove the warning in wq_worker_sleeping()

The kernel test robot triggered a warning with the following race:
task-ctx A interrupt-ctx B
worker
-> process_one_work()
-> work_item()
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> ->sleeping = 1
atomic_dec_and_test(nr_running)
__schedule(); *interrupt*
async_page_fault()
-> local_irq_enable();
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> if (WARN_ON(->sleeping)) return
-> __schedule()
-> sched_update_worker()
-> wq_worker_running()
-> atomic_inc(nr_running);
-> ->sleeping = 0;

-> sched_update_worker()
-> wq_worker_running()
if (!->sleeping) return

In this context the warning is pointless everything is fine.
An interrupt before wq_worker_sleeping() will perform the ->sleeping
assignment (0 -> 1 > 0) twice.
An interrupt after wq_worker_sleeping() will trigger the warning and
nr_running will be decremented (by A) and incremented once (only by B, A
will skip it). This is the case until the ->sleeping is zeroed again in
wq_worker_running().

Remove the WARN statement because this condition may happen. Document
that preemption around wq_worker_sleeping() needs to be disabled to
protect ->sleeping and not just as an optimisation.

Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
---
kernel/sched/core.c | 3 ++-
kernel/workqueue.c | 6 ++++--
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f6b329b..c3d12e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4120,7 +4120,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
* it wants to wake up a task to maintain concurrency.
* As this function is called inside the schedule() context,
* we disable preemption to avoid it calling schedule() again
- * in the possible wakeup of a kworker.
+ * in the possible wakeup of a kworker and because wq_worker_sleeping()
+ * requires it.
*/
if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
preempt_disable();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3816a18..891ccad 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -858,7 +858,8 @@ void wq_worker_running(struct task_struct *task)
* @task: task going to sleep
*
* This function is called from schedule() when a busy worker is
- * going to sleep.
+ * going to sleep. Preemption needs to be disabled to protect ->sleeping
+ * assignment.
*/
void wq_worker_sleeping(struct task_struct *task)
{
@@ -875,7 +876,8 @@ void wq_worker_sleeping(struct task_struct *task)

pool = worker->pool;

- if (WARN_ON_ONCE(worker->sleeping))
+ /* Return if preempted before wq_worker_running() was reached */
+ if (worker->sleeping)
return;

worker->sleeping = 1;