2015-02-06 17:12:04

by Rabin Vincent

[permalink] [raw]
Subject: Soft lockups in __cancel_work_timer()

Hi Tejun,

The comment at the top of the try_to_grab_pending() says that the ENOENT
state "may persist for arbitrarily long".

If two threads call cancel_delayed_work_sync() at the same time, one of
them can end up looping in the loop in __cancel_work_timer() without
waiting in the flush_work() which is inside it.

The looping thread will see -ENOENT from try_to_grab_pending() because
the work is being cancelled, and it will see a false return from
flush_work()/start_flush_work() because there is no worker executing the
work.

task1 task2 worker

add to busy hash
clear pending
exec work

__cancel_work_timer()
try_to_grab_pending()
get pending, return 0
set work cancelling
flush_work()
wait_for_completion()

remove from busy_hash

__cancel_work_timer()
while (forever) {
try_to_grab_pending()
return -ENOENT due to cancelling
flush_work
return false due to already gone
}


On kernels with CONFIG_PREEMPT disabled, this causes a permanent soft
lockup of the system.

Adding a cond_resched() to the loop in cancel_delayed_work_sync(), like
below, helps the simple !PREEMPT case, but does not completely fix the
problem, because the problem can still be seen if the thread which sees
the ENOENT has for example SCHED_FIFO scheduling class, both on systems
with CONFIG_PREEMPT enabled and on !PREEMPT with the cond_resched().

We've seen this problem with the cancel_delayed_work() call in
jffs2_sync_fs(), but I've attached a testing patch which forces the
problem on current mainline without the need for jffs2.

Suggestions?

Thanks.

/Rabin

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index beeeac9..904289f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2741,6 +2741,9 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
*/
if (unlikely(ret == -ENOENT))
flush_work(work);
+
+ if (ret < 0)
+ cond_resched();
} while (unlikely(ret < 0));

/* tell other tasks trying to grab @work to back off */


Attachments:
(No filename) (2.46 kB)
force-wq-lockup.patch (3.58 kB)
Download all attachments

2015-02-06 18:01:42

by Tejun Heo

[permalink] [raw]
Subject: Re: Soft lockups in __cancel_work_timer()

Hello,

On Fri, Feb 06, 2015 at 06:11:56PM +0100, Rabin Vincent wrote:
> task1 task2 worker
>
> add to busy hash
> clear pending
> exec work
>
> __cancel_work_timer()
> try_to_grab_pending()
> get pending, return 0
> set work cancelling
> flush_work()
> wait_for_completion()
>
> remove from busy_hash
>
> __cancel_work_timer()
> while (forever) {
> try_to_grab_pending()
> return -ENOENT due to cancelling
> flush_work
> return false due to already gone
> }
>
>
> On kernels with CONFIG_PREEMPT disabled, this causes a permanent soft
> lockup of the system.

Ah, drat.

> Adding a cond_resched() to the loop in cancel_delayed_work_sync(), like
> below, helps the simple !PREEMPT case, but does not completely fix the
> problem, because the problem can still be seen if the thread which sees
> the ENOENT has for example SCHED_FIFO scheduling class, both on systems
> with CONFIG_PREEMPT enabled and on !PREEMPT with the cond_resched().
>
> We've seen this problem with the cancel_delayed_work() call in
> jffs2_sync_fs(), but I've attached a testing patch which forces the
> problem on current mainline without the need for jffs2.
...
> @@ -2741,6 +2741,9 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
> */
> if (unlikely(ret == -ENOENT))
> flush_work(work);
> +
> + if (ret < 0)
> + cond_resched();
> } while (unlikely(ret < 0));

Well, an obvious thing to do would be

if (unlikely(ret == -ENOENT)) {
if (!flush_work(work))
schedule_timeout(1);
}

It was gonna block for the work item anyway so I don't think
schedule_timeout() there is a problem. That said, given that we're
guaranteed to be able to dereference @work there, there probably is a
better way. I'll think more about it.

Thanks.

--
tejun

2015-02-09 16:15:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH for-3.20-fixes] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

Hello,

Can you please verify that the following patch fixes the issue?

Thanks.

-------- 8< --------
cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
this case, __cancel_work_timer() currently invokes flush_work(). The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority. Let's say task A just got woken
up from flush_work() by the completion of the target work item. If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing. This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A task B worker

executing work
__cancel_work_timer()
try_to_grab_pending()
set work CANCELING
flush_work()
block for work completion
completion, wakes up A
__cancel_work_timer()
while (forever) {
try_to_grab_pending()
-ENOENT as work is being canceled
flush_work()
false as work is no longer executing
}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT. The
explicit wait uses the matching bit waitqueue for the CANCELING bit.

Link: http://lkml.kernel.org/g/[email protected]

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Rabin Vincent <[email protected]>
Cc: [email protected]
---
include/linux/workqueue.h | 3 ++-
kernel/workqueue.c | 36 ++++++++++++++++++++++++++++++++----
2 files changed, 34 insertions(+), 5 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -70,7 +70,8 @@ enum {
/* data contains off-queue information when !WORK_STRUCT_PWQ */
WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT,

- WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE),
+ __WORK_OFFQ_CANCELING = WORK_OFFQ_FLAG_BASE,
+ WORK_OFFQ_CANCELING = (1 << __WORK_OFFQ_CANCELING),

/*
* When a work item is off queue, its high bits point to the last
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2730,17 +2730,35 @@ EXPORT_SYMBOL_GPL(flush_work);

static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
{
+ wait_queue_head_t *waitq = bit_waitqueue(&work->data,
+ __WORK_OFFQ_CANCELING);
unsigned long flags;
int ret;

do {
ret = try_to_grab_pending(work, is_dwork, &flags);
/*
- * If someone else is canceling, wait for the same event it
- * would be waiting for before retrying.
+ * If someone else is already canceling, wait for it to
+ * finish. flush_work() doesn't work for PREEMPT_NONE
+ * because we may get scheduled between @work's completion
+ * and the other canceling task resuming and clearing
+ * CANCELING - flush_work() will return false immediately
+ * as @work is no longer busy, try_to_grab_pending() will
+ * return -ENOENT as @work is still being canceled and the
+ * other canceling task won't be able to clear CANCELING as
+ * we're hogging the CPU.
+ *
+ * Explicitly wait for completion using a bit waitqueue.
+ * We can't use wait_on_bit() as the CANCELING bit may get
+ * recycled to point to pwq if @work gets re-queued.
*/
- if (unlikely(ret == -ENOENT))
- flush_work(work);
+ if (unlikely(ret == -ENOENT)) {
+ DEFINE_WAIT(wait);
+ prepare_to_wait(waitq, &wait, TASK_UNINTERRUPTIBLE);
+ if (work_is_canceling(work))
+ schedule();
+ finish_wait(waitq, &wait);
+ }
} while (unlikely(ret < 0));

/* tell other tasks trying to grab @work to back off */
@@ -2749,6 +2767,16 @@ static bool __cancel_work_timer(struct w

flush_work(work);
clear_work_data(work);
+
+ /*
+ * Paired with prepare_to_wait() above so that either
+ * waitqueue_active() is visible here or !work_is_canceling() is
+ * visible there.
+ */
+ smp_mb();
+ if (waitqueue_active(waitq))
+ wake_up(waitq);
+
return ret;
}

2015-02-09 16:30:00

by Jesper Nilsson

[permalink] [raw]
Subject: Re: [PATCH for-3.20-fixes] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

On Mon, Feb 09, 2015 at 05:15:27PM +0100, Tejun Heo wrote:
> Hello,

Hi,

> Can you please verify that the following patch fixes the issue?

Rabin will have to report if it fixes it for his synthetic case,
but I'll try it in my "real-world" jffs2 sync problem, and report
after a couple of hours.

> Thanks.

/Jesper

> -------- 8< --------
> cancel[_delayed]_work_sync() are implemented using
> __cancel_work_timer() which grabs the PENDING bit using
> try_to_grab_pending() and then flushes the work item with PENDING set
> to prevent the on-going execution of the work item from requeueing
> itself.
>
> try_to_grab_pending() can always grab PENDING bit without blocking
> except when someone else is doing the above flushing during
> cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
> this case, __cancel_work_timer() currently invokes flush_work(). The
> assumption is that the completion of the work item is what the other
> canceling task would be waiting for too and thus waiting for the same
> condition and retrying should allow forward progress without excessive
> busy looping
>
> Unfortunately, this doesn't work if preemption is disabled or the
> latter task has real time priority. Let's say task A just got woken
> up from flush_work() by the completion of the target work item. If,
> before task A starts executing, task B gets scheduled and invokes
> __cancel_work_timer() on the same work item, its try_to_grab_pending()
> will return -ENOENT as the work item is still being canceled by task A
> and flush_work() will also immediately return false as the work item
> is no longer executing. This puts task B in a busy loop possibly
> preventing task A from executing and clearing the canceling state on
> the work item leading to a hang.
>
> task A task B worker
>
> executing work
> __cancel_work_timer()
> try_to_grab_pending()
> set work CANCELING
> flush_work()
> block for work completion
> completion, wakes up A
> __cancel_work_timer()
> while (forever) {
> try_to_grab_pending()
> -ENOENT as work is being canceled
> flush_work()
> false as work is no longer executing
> }
>
> This patch removes the possible hang by updating __cancel_work_timer()
> to explicitly wait for clearing of CANCELING rather than invoking
> flush_work() after try_to_grab_pending() fails with -ENOENT. The
> explicit wait uses the matching bit waitqueue for the CANCELING bit.
>
> Link: http://lkml.kernel.org/g/[email protected]
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Rabin Vincent <[email protected]>
> Cc: [email protected]
> ---
> include/linux/workqueue.h | 3 ++-
> kernel/workqueue.c | 36 ++++++++++++++++++++++++++++++++----
> 2 files changed, 34 insertions(+), 5 deletions(-)
>
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -70,7 +70,8 @@ enum {
> /* data contains off-queue information when !WORK_STRUCT_PWQ */
> WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT,
>
> - WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE),
> + __WORK_OFFQ_CANCELING = WORK_OFFQ_FLAG_BASE,
> + WORK_OFFQ_CANCELING = (1 << __WORK_OFFQ_CANCELING),
>
> /*
> * When a work item is off queue, its high bits point to the last
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2730,17 +2730,35 @@ EXPORT_SYMBOL_GPL(flush_work);
>
> static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
> {
> + wait_queue_head_t *waitq = bit_waitqueue(&work->data,
> + __WORK_OFFQ_CANCELING);
> unsigned long flags;
> int ret;
>
> do {
> ret = try_to_grab_pending(work, is_dwork, &flags);
> /*
> - * If someone else is canceling, wait for the same event it
> - * would be waiting for before retrying.
> + * If someone else is already canceling, wait for it to
> + * finish. flush_work() doesn't work for PREEMPT_NONE
> + * because we may get scheduled between @work's completion
> + * and the other canceling task resuming and clearing
> + * CANCELING - flush_work() will return false immediately
> + * as @work is no longer busy, try_to_grab_pending() will
> + * return -ENOENT as @work is still being canceled and the
> + * other canceling task won't be able to clear CANCELING as
> + * we're hogging the CPU.
> + *
> + * Explicitly wait for completion using a bit waitqueue.
> + * We can't use wait_on_bit() as the CANCELING bit may get
> + * recycled to point to pwq if @work gets re-queued.
> */
> - if (unlikely(ret == -ENOENT))
> - flush_work(work);
> + if (unlikely(ret == -ENOENT)) {
> + DEFINE_WAIT(wait);
> + prepare_to_wait(waitq, &wait, TASK_UNINTERRUPTIBLE);
> + if (work_is_canceling(work))
> + schedule();
> + finish_wait(waitq, &wait);
> + }
> } while (unlikely(ret < 0));
>
> /* tell other tasks trying to grab @work to back off */
> @@ -2749,6 +2767,16 @@ static bool __cancel_work_timer(struct w
>
> flush_work(work);
> clear_work_data(work);
> +
> + /*
> + * Paired with prepare_to_wait() above so that either
> + * waitqueue_active() is visible here or !work_is_canceling() is
> + * visible there.
> + */
> + smp_mb();
> + if (waitqueue_active(waitq))
> + wake_up(waitq);
> +
> return ret;
> }
>

/^JN - Jesper Nilsson
--
Jesper Nilsson -- [email protected]

2015-02-10 08:35:32

by Jesper Nilsson

[permalink] [raw]
Subject: Re: [PATCH for-3.20-fixes] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

On Mon, Feb 09, 2015 at 05:29:55PM +0100, Jesper Nilsson wrote:
> On Mon, Feb 09, 2015 at 05:15:27PM +0100, Tejun Heo wrote:
> > Hello,
>
> Hi,
>
> > Can you please verify that the following patch fixes the issue?
>
> Rabin will have to report if it fixes it for his synthetic case,
> but I'll try it in my "real-world" jffs2 sync problem, and report
> after a couple of hours.

I let the test run for the night, and got no failures.
Before I got at most an hour of tests before lockup.

Tested-by: Jesper Nilsson <[email protected]>

/^JN - Jesper Nilsson
--
Jesper Nilsson -- [email protected]

2015-02-10 09:33:40

by Rabin Vincent

[permalink] [raw]
Subject: Re: [PATCH for-3.20-fixes] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

On Mon, Feb 09, 2015 at 05:15:27PM +0100, Tejun Heo wrote:
> This patch removes the possible hang by updating __cancel_work_timer()
> to explicitly wait for clearing of CANCELING rather than invoking
> flush_work() after try_to_grab_pending() fails with -ENOENT. The
> explicit wait uses the matching bit waitqueue for the CANCELING bit.
>
> Link: http://lkml.kernel.org/g/[email protected]
>
> Signed-off-by: Tejun Heo <[email protected]>

Fixes my synthetic test:

Tested-by: Rabin Vincent <[email protected]>

Thanks.

/Rabin