2010-02-03 11:28:08

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH] core: workqueue: BUG_ON on workqueue recursion

When the workqueue is flushed from workqueue context (recursively), the
system enters a strange state where things at random (dependent on the
global workqueue) start misbehaving. For example, for us the console and
logins locks up while the web server continues running.

Since the system becomes unstable, change this to a BUG_ON instead.

Signed-off-by: Simon Kagstrom <[email protected]>
---
kernel/workqueue.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dee4865..e617d29 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -482,7 +482,7 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
int active = 0;
struct wq_barrier barr;

- WARN_ON(cwq->thread == current);
+ BUG_ON(cwq->thread == current);

spin_lock_irq(&cwq->lock);
if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
--
1.6.0.4


2010-02-03 19:47:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] core: workqueue: BUG_ON on workqueue recursion

On 02/03, Simon Kagstrom wrote:
>
> When the workqueue is flushed from workqueue context (recursively), the
> system enters a strange state where things at random (dependent on the
> global workqueue) start misbehaving. For example, for us the console and
> logins locks up while the web server continues running.
>
> Since the system becomes unstable, change this to a BUG_ON instead.

I agree with this patch. We are going to deadlock anyway, if the
condition is true the caller is cwq->current_work, this means
flush_cpu_workqueue() will insert the barrier and hang.

However,

> @@ -482,7 +482,7 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
> int active = 0;
> struct wq_barrier barr;
>
> - WARN_ON(cwq->thread == current);
> + BUG_ON(cwq->thread == current);

Another option is change the code to do

if (WARN_ON(cwq->thread == current))
return;

This gives the kernel chance to survive after the warning.

What do you think?

Oleg.

2010-02-04 02:02:53

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH] core: workqueue: BUG_ON on workqueue recursion

Simon Kagstrom wrote:
> When the workqueue is flushed from workqueue context (recursively), the
> system enters a strange state where things at random (dependent on the
> global workqueue) start misbehaving. For example, for us the console and
> logins locks up while the web server continues running.
>
> Since the system becomes unstable, change this to a BUG_ON instead.

For design view, we should disallow this recursion when using workqueue.

I like BUG_ON. But it is not a fatal end usually when it happens,
most developers would like to let system go on.

Acked-by: Lai Jiangshan <[email protected]>

>
> Signed-off-by: Simon Kagstrom <[email protected]>
> ---
> kernel/workqueue.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index dee4865..e617d29 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -482,7 +482,7 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
> int active = 0;
> struct wq_barrier barr;
>
> - WARN_ON(cwq->thread == current);
> + BUG_ON(cwq->thread == current);
>
> spin_lock_irq(&cwq->lock);
> if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {

2010-02-04 02:07:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] core: workqueue: BUG_ON on workqueue recursion

Hello,

On 02/04/2010 04:43 AM, Oleg Nesterov wrote:
> On 02/03, Simon Kagstrom wrote:
>>
>> When the workqueue is flushed from workqueue context (recursively), the
>> system enters a strange state where things at random (dependent on the
>> global workqueue) start misbehaving. For example, for us the console and
>> logins locks up while the web server continues running.
>>
>> Since the system becomes unstable, change this to a BUG_ON instead.
>
> I agree with this patch. We are going to deadlock anyway, if the
> condition is true the caller is cwq->current_work, this means
> flush_cpu_workqueue() will insert the barrier and hang.
>
> However,
>
>> @@ -482,7 +482,7 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
>> int active = 0;
>> struct wq_barrier barr;
>>
>> - WARN_ON(cwq->thread == current);
>> + BUG_ON(cwq->thread == current);
>
> Another option is change the code to do
>
> if (WARN_ON(cwq->thread == current))
> return;
>
> This gives the kernel chance to survive after the warning.
>
> What do you think?

Yeah, I like this one better too. Even solely for debugging,
WARN_ON() is better as often users don't have reliable ways to gather
kernel log after a BUG_ON().

Thanks.

--
tejun

2010-02-04 08:02:25

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v2] core: workqueue: return on workqueue recursion

When the workqueue is flushed from workqueue context (recursively), the
system enters a strange state where things at random (dependent on the
global workqueue) start misbehaving. For example, for us the console and
logins locks up while the web server continues running.

The system becomes unstable since the workqueue barrier locks the
workqueue. This patch instead returns if the workqueue is flushed
recursively, which keeps the workqueue alive but warns.

Signed-off-by: Simon Kagstrom <[email protected]>
---
ChangeLog:
* Instead of BUG_ON, warn and return on recursive calls as suggested
by Oleg Nesterov and Tejun Hao

kernel/workqueue.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dee4865..49f8fa7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -482,7 +482,8 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
int active = 0;
struct wq_barrier barr;

- WARN_ON(cwq->thread == current);
+ if (WARN_ON(cwq->thread == current))
+ return 1;

spin_lock_irq(&cwq->lock);
if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
--
1.6.0.4

2010-02-04 10:54:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2] core: workqueue: return on workqueue recursion

On 02/04, Simon Kagstrom wrote:
>
> When the workqueue is flushed from workqueue context (recursively), the
> system enters a strange state where things at random (dependent on the
> global workqueue) start misbehaving. For example, for us the console and
> logins locks up while the web server continues running.
>
> The system becomes unstable since the workqueue barrier locks the
> workqueue. This patch instead returns if the workqueue is flushed
> recursively, which keeps the workqueue alive but warns.
>
> Signed-off-by: Simon Kagstrom <[email protected]>

Acked-by: Oleg Nesterov <[email protected]>

> ---
> ChangeLog:
> * Instead of BUG_ON, warn and return on recursive calls as suggested
> by Oleg Nesterov and Tejun Hao
>
> kernel/workqueue.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index dee4865..49f8fa7 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -482,7 +482,8 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
> int active = 0;
> struct wq_barrier barr;
>
> - WARN_ON(cwq->thread == current);
> + if (WARN_ON(cwq->thread == current))
> + return 1;
>
> spin_lock_irq(&cwq->lock);
> if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
> --
> 1.6.0.4
>

2010-02-12 08:42:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] core: workqueue: return on workqueue recursion

On 02/04/2010 05:02 PM, Simon Kagstrom wrote:
> When the workqueue is flushed from workqueue context (recursively), the
> system enters a strange state where things at random (dependent on the
> global workqueue) start misbehaving. For example, for us the console and
> logins locks up while the web server continues running.
>
> The system becomes unstable since the workqueue barrier locks the
> workqueue. This patch instead returns if the workqueue is flushed
> recursively, which keeps the workqueue alive but warns.
>
> Signed-off-by: Simon Kagstrom <[email protected]>

applied to wq tree. Will push out when the merge window opens.

Thanks.

--
tejun