LinuxLists.cc - Re: Filesystem lockup with CONFIG_PREEMPT

2014-06-26 19:50:47

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
<[email protected]> wrote:
> CC'ing RT folks
>
> On Wed, May 21, 2014 at 8:23 AM, Austin Schuh <[email protected]> wrote:
>> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh <[email protected]> wrote:
>>> Hi,
>>>
>>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
>>> patched kernel. I have currently only triggered it using dpkg. Dave
>>> Chinner on the XFS mailing list suggested that it was a rt-kernel
>>> workqueue issue as opposed to a XFS problem after looking at the
>>> kernel messages.

I've got a 100% reproducible test case that doesn't involve a
filesystem. I wrote a module that triggers the bug when the device is
written to, making it easy to enable tracing during the event and
capture everything.

It looks like rw_semaphores don't trigger wq_worker_sleeping to run
when work goes to sleep on a rw_semaphore. This only happens with the
RT patches, not with the mainline kernel. I'm foreseeing a second
deadlock/bug coming into play shortly. If a task holding the work
pool spinlock gets preempted, and we need to schedule more work from
another worker thread which was just blocked by a mutex, we'll then
end up trying to go to sleep on 2 locks at once.

That is getting a bit deep into the scheduler for me... Any
suggestions on how to fix it?

Austin

Attachments:

killer_module.c (4.08 kB)

2014-06-26 22:35:17

by Thomas Gleixner

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Thu, 26 Jun 2014, Austin Schuh wrote:
> On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
> <[email protected]> wrote:
> > CC'ing RT folks
> >
> > On Wed, May 21, 2014 at 8:23 AM, Austin Schuh <[email protected]> wrote:
> >> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh <[email protected]> wrote:
> >>> Hi,
> >>>
> >>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
> >>> patched kernel. I have currently only triggered it using dpkg. Dave
> >>> Chinner on the XFS mailing list suggested that it was a rt-kernel
> >>> workqueue issue as opposed to a XFS problem after looking at the
> >>> kernel messages.
>
> I've got a 100% reproducible test case that doesn't involve a
> filesystem. I wrote a module that triggers the bug when the device is
> written to, making it easy to enable tracing during the event and
> capture everything.
>
> It looks like rw_semaphores don't trigger wq_worker_sleeping to run
> when work goes to sleep on a rw_semaphore. This only happens with the
> RT patches, not with the mainline kernel. I'm foreseeing a second
> deadlock/bug coming into play shortly. If a task holding the work
> pool spinlock gets preempted, and we need to schedule more work from
> another worker thread which was just blocked by a mutex, we'll then
> end up trying to go to sleep on 2 locks at once.

I remember vaguely, that I've seen and analyzed that quite some time
ago. I can't page in all the gory details right now, but I have a look
how the related code changed in the last couple of years tomorrow
morning with an awake brain.

Steven, you did some analysis on that IIRC, or was that just related
to rw_locks?

Thanks,

tglx

2014-06-27 00:08:10

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Thu, Jun 26, 2014 at 3:35 PM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 26 Jun 2014, Austin Schuh wrote:
>> On Wed, May 21, 2014 at 12:33 AM, Richard Weinberger
>> <[email protected]> wrote:
>> > CC'ing RT folks
>> >
>> > On Wed, May 21, 2014 at 8:23 AM, Austin Schuh <[email protected]> wrote:
>> >> On Tue, May 13, 2014 at 7:29 PM, Austin Schuh <[email protected]> wrote:
>> >>> Hi,
>> >>>
>> >>> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
>> >>> patched kernel. I have currently only triggered it using dpkg. Dave
>> >>> Chinner on the XFS mailing list suggested that it was a rt-kernel
>> >>> workqueue issue as opposed to a XFS problem after looking at the
>> >>> kernel messages.
>>
>> I've got a 100% reproducible test case that doesn't involve a
>> filesystem. I wrote a module that triggers the bug when the device is
>> written to, making it easy to enable tracing during the event and
>> capture everything.
>>
>> It looks like rw_semaphores don't trigger wq_worker_sleeping to run
>> when work goes to sleep on a rw_semaphore. This only happens with the
>> RT patches, not with the mainline kernel. I'm foreseeing a second
>> deadlock/bug coming into play shortly. If a task holding the work
>> pool spinlock gets preempted, and we need to schedule more work from
>> another worker thread which was just blocked by a mutex, we'll then
>> end up trying to go to sleep on 2 locks at once.
>
> I remember vaguely, that I've seen and analyzed that quite some time
> ago. I can't page in all the gory details right now, but I have a look
> how the related code changed in the last couple of years tomorrow
> morning with an awake brain.
>
> Steven, you did some analysis on that IIRC, or was that just related
> to rw_locks?
>
> Thanks,
>
> tglx

If I'm reading the rt patch correctly, wq_worker_sleeping was moved
out of __schedule to sched_submit_work. It looks like that changes
the conditions under which wq_worker_sleeping is called. It used to
be called whenever a task was going to sleep (I think). It looks like
it is called now if the task is going to sleep, and if the task isn't
blocked on a PI mutex (I think).

If I try the following experiment

static inline void sched_submit_work(struct task_struct *tsk)
{
+ if (tsk->state && tsk->flags & PF_WQ_WORKER) {
+ wq_worker_sleeping(tsk);
+ return;
+ }

and then remove the call later in the function, I am able to pass my test.

Unfortunately, I then get a recursive pool spinlock BUG_ON after a
while (as I would expect), and it all blows up.

I'm not sure where to go from there. Any changes to the workpool to
try to fix that will be hard, or could affect latency significantly.

Austin

2014-06-27 03:22:54

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Thu, 2014-06-26 at 17:07 -0700, Austin Schuh wrote:

> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
> out of __schedule to sched_submit_work. It looks like that changes
> the conditions under which wq_worker_sleeping is called. It used to
> be called whenever a task was going to sleep (I think). It looks like
> it is called now if the task is going to sleep, and if the task isn't
> blocked on a PI mutex (I think).

Re-entanglement does turn your killer into a happy camper, but that
patch is too lovely to just surrender without a squabble.

-Mike

2014-06-27 12:57:41

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Thu, 2014-06-26 at 17:07 -0700, Austin Schuh wrote:

> I'm not sure where to go from there. Any changes to the workpool to
> try to fix that will be hard, or could affect latency significantly.

Oh what the hell, I'm out of frozen shark, may as well stock up. Nobody
else has posted spit yet. I _know_ Steven has some shark, I saw tglx
toss him a chunk.

It's the same root cause as -rt letting tasks schedule off with plugged
IO, leading to deadlock scenarios. Extending the way I dealt with that
to deal with workqueue forward progress requirements.. works.. though it
could perhaps use a face lift, tummy tuck.. and minor body-ectomy.

Subject: rtmutex: move pre/post schedule() progress guarantees to before we schedule

Queued IO can lead to IO deadlock should a task require wakeup from as task
which is blocked on that queued IO, which is why we usually pull the plug
while scheduling off. In RT, pulling the plug could lead us to block on
a contended sleeping lock while n the process of blocking. Bad idea, so
we don't, but that leaves us with various flavors of IO stall like below.

ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
pulled. dbench2 mutex owner is waiting for kjournald, who is waiting for
the buffer queued by dbench1. Game over.

The exact same situation can occur with workqueues. We must notify the
workqueue management that a worker is blocking, as if we don't, we can
lock the box up completely. It's too late to do that once we have arrived
in schedule(), as we can't take a sleeping lock.

Therefore, move progress guarantee work up to before we reach the point of
no return, and tell the scheduler that we're handling it early.

Signed-off-by: Mike Galbraith <[email protected]>
---
include/linux/sched.h | 2 +
kernel/locking/rtmutex.c | 59 +++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/core.c | 19 +++++++++++----
3 files changed, 72 insertions(+), 8 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1256,6 +1256,8 @@ struct task_struct {
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
+ unsigned sched_presched:1;
+ unsigned rtmutex_presched:1;

pid_t pid;
pid_t tgid;
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -23,6 +23,7 @@
#include <linux/sched/deadline.h>
#include <linux/timer.h>
#include <linux/ww_mutex.h>
+#include <linux/blkdev.h>

#include "rtmutex_common.h"

@@ -782,6 +783,41 @@ void rt_mutex_adjust_pi(struct task_stru
}

#ifdef CONFIG_PREEMPT_RT_FULL
+#include "../workqueue_internal.h"
+
+static inline void rt_mutex_presched(struct task_struct *tsk)
+{
+ /* Recursive handling of presched work is a very bad idea. */
+ if (tsk->rtmutex_presched || tsk->sched_presched)
+ return;
+
+ /* Tell the scheduler we handle pre/post schedule() work. */
+ tsk->rtmutex_presched = 1;
+
+ /*
+ * If a worker went to sleep, notify and ask workqueue whether
+ * it wants to wake up a task to maintain concurrency.
+ */
+ if (tsk->flags & PF_WQ_WORKER)
+ wq_worker_sleeping(tsk);
+
+ /*
+ * If we are going to sleep and we have plugged IO queued,
+ * make sure to submit it to avoid deadlocks.
+ */
+ if (blk_needs_flush_plug(tsk))
+ blk_schedule_flush_plug(tsk);
+}
+
+static inline void rt_mutex_postsched(struct task_struct *tsk)
+{
+ if (!tsk->rtmutex_presched)
+ return;
+ if (tsk->flags & PF_WQ_WORKER)
+ wq_worker_running(tsk);
+ tsk->rtmutex_presched = 0;
+}
+
/*
* preemptible spin_lock functions:
*/
@@ -857,13 +893,23 @@ static void noinline __sched rt_spin_lo
struct rt_mutex_waiter waiter, *top_waiter;
int ret;

+ /*
+ * We can't do presched work if we're already holding a lock
+ * else we can deadlock. eg, if we're holding slab_lock,
+ * ksoftirqd can block while processing BLOCK_SOFTIRQ after
+ * having acquired q->queue_lock. If _we_ then block on
+ * that q->queue_lock while flushing our plug, deadlock.
+ */
+ if (__migrate_disabled(self) < 2)
+ rt_mutex_presched(self);
+
rt_mutex_init_waiter(&waiter, true);

raw_spin_lock(&lock->wait_lock);

if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
raw_spin_unlock(&lock->wait_lock);
- return;
+ goto out;
}

BUG_ON(rt_mutex_owner(lock) == self);
@@ -928,6 +974,8 @@ static void noinline __sched rt_spin_lo
raw_spin_unlock(&lock->wait_lock);

debug_rt_mutex_free_waiter(&waiter);
+out:
+ rt_mutex_postsched(self);
}

/*
@@ -1274,18 +1322,20 @@ rt_mutex_slowlock(struct rt_mutex *lock,
int detect_deadlock, struct ww_acquire_ctx *ww_ctx)
{
struct rt_mutex_waiter waiter;
+ struct task_struct *self = current;
int ret = 0;

+ rt_mutex_presched(self);
rt_mutex_init_waiter(&waiter, false);

raw_spin_lock(&lock->wait_lock);

/* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock, current, NULL)) {
+ if (try_to_take_rt_mutex(lock, self, NULL)) {
if (ww_ctx)
ww_mutex_account_lock(lock, ww_ctx);
raw_spin_unlock(&lock->wait_lock);
- return 0;
+ goto out;
}

set_current_state(state);
@@ -1322,7 +1372,8 @@ rt_mutex_slowlock(struct rt_mutex *lock,
hrtimer_cancel(&timeout->timer);

debug_rt_mutex_free_waiter(&waiter);
-
+out:
+ rt_mutex_postsched(self);
return ret;
}

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2915,11 +2915,18 @@ static void __sched __schedule(void)
goto need_resched;
}

-static inline void sched_submit_work(struct task_struct *tsk)
+static inline void sched_presched_work(struct task_struct *tsk)
{
+ /* It's being handled by rtmutex code */
+ if (tsk->rtmutex_presched)
+ return;
+
if (!tsk->state || tsk_is_pi_blocked(tsk))
return;

+ /* Tell rtmutex presched code that we're handling it. */
+ tsk->sched_presched = 1;
+
/*
* If a worker went to sleep, notify and ask workqueue whether
* it wants to wake up a task to maintain concurrency.
@@ -2935,19 +2942,23 @@ static inline void sched_submit_work(str
blk_schedule_flush_plug(tsk);
}

-static inline void sched_update_worker(struct task_struct *tsk)
+static inline void sched_postsched_work(struct task_struct *tsk)
{
+ /* It's being handled by rtmutex code */
+ if (tsk->rtmutex_presched)
+ return;
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
+ tsk->sched_presched = 0;
}

asmlinkage void __sched schedule(void)
{
struct task_struct *tsk = current;

- sched_submit_work(tsk);
+ sched_presched_work(tsk);
__schedule();
- sched_update_worker(tsk);
+ sched_postsched_work(tsk);
}
EXPORT_SYMBOL(schedule);

2014-06-27 14:02:00

by Steven Rostedt

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 27 Jun 2014 14:57:36 +0200
Mike Galbraith <[email protected]> wrote:

> On Thu, 2014-06-26 at 17:07 -0700, Austin Schuh wrote:
>
> > I'm not sure where to go from there. Any changes to the workpool to
> > try to fix that will be hard, or could affect latency significantly.
>
> Oh what the hell, I'm out of frozen shark, may as well stock up. Nobody
> else has posted spit yet. I _know_ Steven has some shark, I saw tglx
> toss him a chunk.
>
> It's the same root cause as -rt letting tasks schedule off with plugged
> IO, leading to deadlock scenarios. Extending the way I dealt with that
> to deal with workqueue forward progress requirements.. works.. though it

> could perhaps use a face lift, tummy tuck.. and minor body-ectomy.

Yeah, I'd say ;-)

>
> Subject: rtmutex: move pre/post schedule() progress guarantees to before we schedule
>
> Queued IO can lead to IO deadlock should a task require wakeup from as task
> which is blocked on that queued IO, which is why we usually pull the plug
> while scheduling off. In RT, pulling the plug could lead us to block on
> a contended sleeping lock while n the process of blocking. Bad idea, so

"in the process"

> we don't, but that leaves us with various flavors of IO stall like below.
>
> ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
> pulled. dbench2 mutex owner is waiting for kjournald, who is waiting for
> the buffer queued by dbench1. Game over.
>
> The exact same situation can occur with workqueues. We must notify the
> workqueue management that a worker is blocking, as if we don't, we can
> lock the box up completely. It's too late to do that once we have arrived
> in schedule(), as we can't take a sleeping lock.
>
> Therefore, move progress guarantee work up to before we reach the point of
> no return, and tell the scheduler that we're handling it early.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> ---
> include/linux/sched.h | 2 +
> kernel/locking/rtmutex.c | 59 +++++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/core.c | 19 +++++++++++----
> 3 files changed, 72 insertions(+), 8 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1256,6 +1256,8 @@ struct task_struct {
> /* Revert to default priority/policy when forking */
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> + unsigned sched_presched:1;
> + unsigned rtmutex_presched:1;
>
> pid_t pid;
> pid_t tgid;
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -23,6 +23,7 @@
> #include <linux/sched/deadline.h>
> #include <linux/timer.h>
> #include <linux/ww_mutex.h>
> +#include <linux/blkdev.h>
>
> #include "rtmutex_common.h"
>
> @@ -782,6 +783,41 @@ void rt_mutex_adjust_pi(struct task_stru
> }
>
> #ifdef CONFIG_PREEMPT_RT_FULL
> +#include "../workqueue_internal.h"
> +
> +static inline void rt_mutex_presched(struct task_struct *tsk)
> +{
> + /* Recursive handling of presched work is a very bad idea. */

The above comment can be simplified to just:

/* Recursion protection */

> + if (tsk->rtmutex_presched || tsk->sched_presched)
> + return;
> +
> + /* Tell the scheduler we handle pre/post schedule() work. */
> + tsk->rtmutex_presched = 1;
> +
> + /*
> + * If a worker went to sleep, notify and ask workqueue whether
> + * it wants to wake up a task to maintain concurrency.
> + */
> + if (tsk->flags & PF_WQ_WORKER)
> + wq_worker_sleeping(tsk);
> +
> + /*
> + * If we are going to sleep and we have plugged IO queued,
> + * make sure to submit it to avoid deadlocks.
> + */
> + if (blk_needs_flush_plug(tsk))
> + blk_schedule_flush_plug(tsk);
> +}
> +
> +static inline void rt_mutex_postsched(struct task_struct *tsk)
> +{
> + if (!tsk->rtmutex_presched)
> + return;
> + if (tsk->flags & PF_WQ_WORKER)
> + wq_worker_running(tsk);
> + tsk->rtmutex_presched = 0;
> +}
> +
> /*
> * preemptible spin_lock functions:
> */
> @@ -857,13 +893,23 @@ static void noinline __sched rt_spin_lo
> struct rt_mutex_waiter waiter, *top_waiter;
> int ret;
>
> + /*
> + * We can't do presched work if we're already holding a lock
> + * else we can deadlock. eg, if we're holding slab_lock,
> + * ksoftirqd can block while processing BLOCK_SOFTIRQ after
> + * having acquired q->queue_lock. If _we_ then block on
> + * that q->queue_lock while flushing our plug, deadlock.
> + */
> + if (__migrate_disabled(self) < 2)
> + rt_mutex_presched(self);
> +
> rt_mutex_init_waiter(&waiter, true);
>
> raw_spin_lock(&lock->wait_lock);
>
> if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
> raw_spin_unlock(&lock->wait_lock);
> - return;
> + goto out;
> }
>
> BUG_ON(rt_mutex_owner(lock) == self);
> @@ -928,6 +974,8 @@ static void noinline __sched rt_spin_lo
> raw_spin_unlock(&lock->wait_lock);
>
> debug_rt_mutex_free_waiter(&waiter);
> +out:
> + rt_mutex_postsched(self);
> }
>
> /*
> @@ -1274,18 +1322,20 @@ rt_mutex_slowlock(struct rt_mutex *lock,
> int detect_deadlock, struct ww_acquire_ctx *ww_ctx)
> {
> struct rt_mutex_waiter waiter;
> + struct task_struct *self = current;
> int ret = 0;
>
> + rt_mutex_presched(self);
> rt_mutex_init_waiter(&waiter, false);
>
> raw_spin_lock(&lock->wait_lock);
>
> /* Try to acquire the lock again: */
> - if (try_to_take_rt_mutex(lock, current, NULL)) {
> + if (try_to_take_rt_mutex(lock, self, NULL)) {
> if (ww_ctx)
> ww_mutex_account_lock(lock, ww_ctx);
> raw_spin_unlock(&lock->wait_lock);
> - return 0;
> + goto out;
> }
>
> set_current_state(state);
> @@ -1322,7 +1372,8 @@ rt_mutex_slowlock(struct rt_mutex *lock,
> hrtimer_cancel(&timeout->timer);
>
> debug_rt_mutex_free_waiter(&waiter);
> -
> +out:
> + rt_mutex_postsched(self);
> return ret;
> }
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2915,11 +2915,18 @@ static void __sched __schedule(void)
> goto need_resched;
> }
>
> -static inline void sched_submit_work(struct task_struct *tsk)
> +static inline void sched_presched_work(struct task_struct *tsk)
> {
> + /* It's being handled by rtmutex code */
> + if (tsk->rtmutex_presched)
> + return;
> +
> if (!tsk->state || tsk_is_pi_blocked(tsk))
> return;
>
> + /* Tell rtmutex presched code that we're handling it. */
> + tsk->sched_presched = 1;
> +
> /*
> * If a worker went to sleep, notify and ask workqueue whether
> * it wants to wake up a task to maintain concurrency.
> @@ -2935,19 +2942,23 @@ static inline void sched_submit_work(str
> blk_schedule_flush_plug(tsk);
> }
>
> -static inline void sched_update_worker(struct task_struct *tsk)
> +static inline void sched_postsched_work(struct task_struct *tsk)
> {
> + /* It's being handled by rtmutex code */
> + if (tsk->rtmutex_presched)
> + return;
> if (tsk->flags & PF_WQ_WORKER)
> wq_worker_running(tsk);
> + tsk->sched_presched = 0;
> }
>
> asmlinkage void __sched schedule(void)
> {
> struct task_struct *tsk = current;
>
> - sched_submit_work(tsk);
> + sched_presched_work(tsk);
> __schedule();
> - sched_update_worker(tsk);
> + sched_postsched_work(tsk);
> }

This seems like a lot of hacks. I'm wondering if it would work if we
just have the rt_spin_lock_slowlock not call schedule(), but call
__schedule() directly. I mean it would keep with the mainline paradigm
as spinlocks don't sleep there, and one going to sleep in the -rt
kernel is similar to it being preempted by a very long NMI.

Does a spin_lock going to sleep really need to do all the presched and
postsched work?

-- Steve

> EXPORT_SYMBOL(schedule);
>
>

2014-06-27 14:24:58

by Thomas Gleixner

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Thu, 26 Jun 2014, Austin Schuh wrote:
> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
> out of __schedule to sched_submit_work. It looks like that changes
> the conditions under which wq_worker_sleeping is called. It used to
> be called whenever a task was going to sleep (I think). It looks like
> it is called now if the task is going to sleep, and if the task isn't
> blocked on a PI mutex (I think).
>
> If I try the following experiment
>
> static inline void sched_submit_work(struct task_struct *tsk)
> {
> + if (tsk->state && tsk->flags & PF_WQ_WORKER) {
> + wq_worker_sleeping(tsk);
> + return;
> + }
>
> and then remove the call later in the function, I am able to pass my test.
>
> Unfortunately, I then get a recursive pool spinlock BUG_ON after a
> while (as I would expect), and it all blows up.

Well, that's why the check for !pi_blocked_on is there.

> I'm not sure where to go from there. Any changes to the workpool to
> try to fix that will be hard, or could affect latency significantly.

Completely untested patch below.

Thanks,

tglx

--------------------->
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8749d20..621329a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2651,9 +2651,8 @@ need_resched:

static inline void sched_submit_work(struct task_struct *tsk)
{
- if (!tsk->state || tsk_is_pi_blocked(tsk))
+ if (!tsk->state)
return;
-
/*
* If a worker went to sleep, notify and ask workqueue whether
* it wants to wake up a task to maintain concurrency.
@@ -2661,6 +2660,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
if (tsk->flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);

+
+ if (tsk_is_pi_blocked(tsk))
+ return;
+
/*
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index be0ef50..0ca9d97 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -126,6 +126,8 @@ enum {
* cpu or grabbing pool->lock is enough for read access. If
* POOL_DISASSOCIATED is set, it's identical to L.
*
+ * On RT we need the extra protection via rt_lock_idle_list()
+ *
* MG: pool->manager_mutex and pool->lock protected. Writes require both
* locks. Reads can happen under either lock.
*
@@ -409,6 +411,31 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
if (({ assert_rcu_or_wq_mutex(wq); false; })) { } \
else

+#ifdef CONFIG_PREEMPT_RT_BASE
+static inline void rt_lock_idle_list(struct worker_pool *pool)
+{
+ preempt_disable();
+}
+static inline void rt_unlock_idle_list(struct worker_pool *pool)
+{
+ preempt_enable();
+}
+static inline void sched_lock_idle_list(struct worker_pool *pool) { }
+static inline void sched_unlock_idle_list(struct worker_pool *pool) { }
+#else
+static inline void rt_lock_idle_list(struct worker_pool *pool) { }
+static inline void rt_unlock_idle_list(struct worker_pool *pool) { }
+static inline void sched_lock_idle_list(struct worker_pool *pool)
+{
+ spin_lock_irq(&pool->lock);
+}
+static inline void sched_unlock_idle_list(struct worker_pool *pool)
+{
+ spin_unlock_irq(&pool->lock);
+}
+#endif
+
+
#ifdef CONFIG_DEBUG_OBJECTS_WORK

static struct debug_obj_descr work_debug_descr;
@@ -801,10 +828,16 @@ static struct worker *first_worker(struct worker_pool *pool)
*/
static void wake_up_worker(struct worker_pool *pool)
{
- struct worker *worker = first_worker(pool);
+ struct worker *worker;
+
+ rt_lock_idle_list(pool);
+
+ worker = first_worker(pool);

if (likely(worker))
wake_up_process(worker->task);
+
+ rt_unlock_idle_list(pool);
}

/**
@@ -832,7 +865,7 @@ void wq_worker_running(struct task_struct *task)
*/
void wq_worker_sleeping(struct task_struct *task)
{
- struct worker *next, *worker = kthread_data(task);
+ struct worker *worker = kthread_data(task);
struct worker_pool *pool;

/*
@@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
return;

worker->sleeping = 1;
- spin_lock_irq(&pool->lock);
+
/*
* The counterpart of the following dec_and_test, implied mb,
* worklist not empty test sequence is in insert_work().
* Please read comment there.
- *
- * NOT_RUNNING is clear. This means that we're bound to and
- * running on the local cpu w/ rq lock held and preemption
- * disabled, which in turn means that none else could be
- * manipulating idle_list, so dereferencing idle_list without pool
- * lock is safe.
*/
if (atomic_dec_and_test(&pool->nr_running) &&
!list_empty(&pool->worklist)) {
- next = first_worker(pool);
- if (next)
- wake_up_process(next->task);
+ sched_lock_idle_list(pool);
+ wake_up_worker(pool);
+ sched_unlock_idle_list(pool);
}
- spin_unlock_irq(&pool->lock);
}

/**
@@ -1571,7 +1597,9 @@ static void worker_enter_idle(struct worker *worker)
worker->last_active = jiffies;

/* idle_list is LIFO */
+ rt_lock_idle_list(pool);
list_add(&worker->entry, &pool->idle_list);
+ rt_unlock_idle_list(pool);

if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);
@@ -1604,7 +1632,9 @@ static void worker_leave_idle(struct worker *worker)
return;
worker_clr_flags(worker, WORKER_IDLE);
pool->nr_idle--;
+ rt_lock_idle_list(pool);
list_del_init(&worker->entry);
+ rt_unlock_idle_list(pool);
}

/**
@@ -1849,7 +1879,9 @@ static void destroy_worker(struct worker *worker)
*/
get_task_struct(worker->task);

+ rt_lock_idle_list(pool);
list_del_init(&worker->entry);
+ rt_unlock_idle_list(pool);
worker->flags |= WORKER_DIE;

idr_remove(&pool->worker_idr, worker->id);

2014-06-27 17:35:00

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 2014-06-27 at 10:01 -0400, Steven Rostedt wrote:

> This seems like a lot of hacks.

It is exactly that, lacking proper pooper-scooper, show rt kernel how to
not step in it.

> I'm wondering if it would work if we
> just have the rt_spin_lock_slowlock not call schedule(), but call
> __schedule() directly. I mean it would keep with the mainline paradigm
> as spinlocks don't sleep there, and one going to sleep in the -rt
> kernel is similar to it being preempted by a very long NMI.

Problem being that we do sleep there, do need wakeup. I have a hack
that turns them back into spinning locks, but it.. works too :)

> Does a spin_lock going to sleep really need to do all the presched and
> postsched work?

It would be lovely if we didn't have to do any of that. On the IO bit,
I haven't seen hard evidence that the spinlock bit is absolutely
required (better not be, it doesn't guarantee anything), but the
combined hack did kill IO deadlock of multiple filesystems.

-Mike

2014-06-27 17:54:19

by Steven Rostedt

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 27 Jun 2014 19:34:53 +0200
Mike Galbraith <[email protected]> wrote:

> On Fri, 2014-06-27 at 10:01 -0400, Steven Rostedt wrote:
>
> > This seems like a lot of hacks.
>
> It is exactly that, lacking proper pooper-scooper, show rt kernel how to
> not step in it.
>
> > I'm wondering if it would work if we
> > just have the rt_spin_lock_slowlock not call schedule(), but call
> > __schedule() directly. I mean it would keep with the mainline paradigm
> > as spinlocks don't sleep there, and one going to sleep in the -rt
> > kernel is similar to it being preempted by a very long NMI.
>
> Problem being that we do sleep there, do need wakeup. I have a hack
> that turns them back into spinning locks, but it.. works too :)

Why do we need the wakeup? the owner of the lock should wake it up
shouldn't it?

-- Steve

2014-06-27 18:07:59

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 2014-06-27 at 13:54 -0400, Steven Rostedt wrote:
> On Fri, 27 Jun 2014 19:34:53 +0200
> Mike Galbraith <[email protected]> wrote:
>
> > On Fri, 2014-06-27 at 10:01 -0400, Steven Rostedt wrote:
> >
> > > This seems like a lot of hacks.
> >
> > It is exactly that, lacking proper pooper-scooper, show rt kernel how to
> > not step in it.
> >
> > > I'm wondering if it would work if we
> > > just have the rt_spin_lock_slowlock not call schedule(), but call
> > > __schedule() directly. I mean it would keep with the mainline paradigm
> > > as spinlocks don't sleep there, and one going to sleep in the -rt
> > > kernel is similar to it being preempted by a very long NMI.
> >
> > Problem being that we do sleep there, do need wakeup. I have a hack
> > that turns them back into spinning locks, but it.. works too :)
>
> Why do we need the wakeup? the owner of the lock should wake it up
> shouldn't it?

True, but that can take ages.

-Mike

2014-06-27 18:19:08

by Steven Rostedt

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 27 Jun 2014 20:07:54 +0200
Mike Galbraith <[email protected]> wrote:

> > Why do we need the wakeup? the owner of the lock should wake it up
> > shouldn't it?
>
> True, but that can take ages.

Can it? If the workqueue is of some higher priority, it should boost
the process that owns the lock. Otherwise it just waits like anything
else does.

I much rather keep the paradigm of the mainline kernel than to add a
bunch of hacks that can cause more unforeseen side effects that may
cause other issues.

Remember, this would only be for spinlocks converted into a rtmutex,
not for normal mutex or other sleeps. In mainline, the wake up still
would not happen so why are we waking it up here?

This seems similar to the BKL crap we had to deal with as well. If we
were going to sleep because we were blocked on a spinlock converted
rtmutex we could not release and retake the BKL because we would end up
blocked on two locks. Instead, we made sure that the spinlock would not
release or take the BKL. It kept with the paradigm of mainline and
worked. Sucked, but it worked.

-- Steve

2014-06-27 19:11:57

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 2014-06-27 at 14:19 -0400, Steven Rostedt wrote:
> On Fri, 27 Jun 2014 20:07:54 +0200
> Mike Galbraith <[email protected]> wrote:
>
> > > Why do we need the wakeup? the owner of the lock should wake it up
> > > shouldn't it?
> >
> > True, but that can take ages.
>
> Can it? If the workqueue is of some higher priority, it should boost
> the process that owns the lock. Otherwise it just waits like anything
> else does.

No, not like everything else, preempt a lock holder. IO that starts
moving upon context switch in a stock kernel can rot for ages in rt when
an IO packing task bumps into lock held by preempted task. When there's
a prio delta on a lock, sure, PI kicks in to help a high prio task.. but
there is no PI help for high priority task waiting on IO stuck behind a
low prio task plug.

(you're taking workqueue, I'm talking IO, but the two meet in things
like raid too. you can't prioritize workqueues, and PI doesn't really
have a lot to do with the general issue of things happening or not
happening at context switch time, and consequences thereof)

> I much rather keep the paradigm of the mainline kernel than to add a
> bunch of hacks that can cause more unforeseen side effects that may
> cause other issues.

The paradigm of mainline is to start IO on context switch, reason for
that is IO deadlock prevention. We need to follow paradigm somehow.
That somehow really wants to be a tad prettier and more guaranteed than
my somehow :)

-Mike

2014-06-28 01:18:50

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, Jun 27, 2014 at 11:19 AM, Steven Rostedt <[email protected]> wrote:
> On Fri, 27 Jun 2014 20:07:54 +0200
> Mike Galbraith <[email protected]> wrote:
>
>> > Why do we need the wakeup? the owner of the lock should wake it up
>> > shouldn't it?
>>
>> True, but that can take ages.
>
> Can it? If the workqueue is of some higher priority, it should boost
> the process that owns the lock. Otherwise it just waits like anything
> else does.
>
> I much rather keep the paradigm of the mainline kernel than to add a
> bunch of hacks that can cause more unforeseen side effects that may
> cause other issues.
>
> Remember, this would only be for spinlocks converted into a rtmutex,
> not for normal mutex or other sleeps. In mainline, the wake up still
> would not happen so why are we waking it up here?
>
> This seems similar to the BKL crap we had to deal with as well. If we
> were going to sleep because we were blocked on a spinlock converted
> rtmutex we could not release and retake the BKL because we would end up
> blocked on two locks. Instead, we made sure that the spinlock would not
> release or take the BKL. It kept with the paradigm of mainline and
> worked. Sucked, but it worked.
>
> -- Steve

Sounds like you are arguing that we should disable preemption (or
whatever the right mechanism is) while holding the pool lock?

Workqueues spin up more threads when work that they are executing
blocks. This is done through hooks in the scheduler. This means that
we have to acquire the pool lock when work blocks on a lock in order
to see if there is more work and whether or not we need to spin up a
new thread.

It would be more context switches, but I wonder if we could kick the
workqueue logic completely out of the scheduler into a thread. Have
the scheduler increment/decrement an atomic pool counter, and wake up
the monitoring thread to spawn new threads when needed? That would
get rid of the recursive pool lock problem, and should reduce
scheduler latency if we would need to spawn a new thread.

Austin

2014-06-28 03:32:18

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote:

> It would be more context switches, but I wonder if we could kick the
> workqueue logic completely out of the scheduler into a thread. Have
> the scheduler increment/decrement an atomic pool counter, and wake up
> the monitoring thread to spawn new threads when needed? That would
> get rid of the recursive pool lock problem, and should reduce
> scheduler latency if we would need to spawn a new thread.

I was wondering the same thing, and not only for workqueue, but also the
plug pulling. It's kind of a wart to have that stuff sitting in the
hear of the scheduler in the first place, would be nice if it just went
away. When a task can't help itself, you _could_ wake a proxy do that
for you. Trouble is, I can imagine that being a heck of a lot of
context switches with some loads.. and who's gonna help the helper when
he blocks while trying to help?

-Mike

2014-06-28 04:51:20

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 2014-06-27 at 16:24 +0200, Thomas Gleixner wrote:

> Completely untested patch below.

It's no longer completely untested, killer_module is no longer a killer.
I'll let box (lockdep etc is enabled) chew on it a while, no news is
good news as usual.

-Mike

2014-06-28 06:21:22

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, Jun 27, 2014 at 8:32 PM, Mike Galbraith
<[email protected]> wrote:
> On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote:
>
>> It would be more context switches, but I wonder if we could kick the
>> workqueue logic completely out of the scheduler into a thread. Have
>> the scheduler increment/decrement an atomic pool counter, and wake up
>> the monitoring thread to spawn new threads when needed? That would
>> get rid of the recursive pool lock problem, and should reduce
>> scheduler latency if we would need to spawn a new thread.
>
> I was wondering the same thing, and not only for workqueue, but also the
> plug pulling. It's kind of a wart to have that stuff sitting in the
> hear of the scheduler in the first place, would be nice if it just went
> away. When a task can't help itself, you _could_ wake a proxy do that
> for you. Trouble is, I can imagine that being a heck of a lot of
> context switches with some loads.. and who's gonna help the helper when
> he blocks while trying to help?
>
> -Mike

For workqueues, as long as the helper doesn't block on a lock which
requires the work queue to be freed up, it will eventually become
unblocked and make progress. The helper _should_ only need the pool
lock, which will wake the helper back up when it is available again.
Nothing should go to sleep in an un-recoverable way with the work pool
lock held.

To drop the extra context switch, you could have a minimum of 2 worker
threads around at all times, and have the management thread start the
work and delegate to the next management thread. That thread would
then wait for the first thread to block, spawn a new thread, and then
start the next piece of work. Definitely a bit more complicated.

Austin

2014-06-28 07:11:23

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, 2014-06-27 at 23:20 -0700, Austin Schuh wrote:

> For workqueues, as long as the helper doesn't block on a lock which
> requires the work queue to be freed up, it will eventually become
> unblocked and make progress. The helper _should_ only need the pool
> lock, which will wake the helper back up when it is available again.
> Nothing should go to sleep in an un-recoverable way with the work pool
> lock held.

Well, Thomas killed taking a lock from within the core of a lock, so
that wart shrank back to microscopic. Applying that same ointment to
the plug puller gizmo and schedule_work() would be most excellent.

-Mike

2014-07-01 00:12:47

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 26 Jun 2014, Austin Schuh wrote:
>> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
>> out of __schedule to sched_submit_work. It looks like that changes
>> the conditions under which wq_worker_sleeping is called. It used to
>> be called whenever a task was going to sleep (I think). It looks like
>> it is called now if the task is going to sleep, and if the task isn't
>> blocked on a PI mutex (I think).
>>
>> If I try the following experiment
>>
>> static inline void sched_submit_work(struct task_struct *tsk)
>> {
>> + if (tsk->state && tsk->flags & PF_WQ_WORKER) {
>> + wq_worker_sleeping(tsk);
>> + return;
>> + }
>>
>> and then remove the call later in the function, I am able to pass my test.
>>
>> Unfortunately, I then get a recursive pool spinlock BUG_ON after a
>> while (as I would expect), and it all blows up.
>
> Well, that's why the check for !pi_blocked_on is there.
>
>> I'm not sure where to go from there. Any changes to the workpool to
>> try to fix that will be hard, or could affect latency significantly.
>
> Completely untested patch below.
>
> Thanks,
>
> tglx
>
> --------------------->
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8749d20..621329a 100644
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index be0ef50..0ca9d97 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
> return;
>
> worker->sleeping = 1;
> - spin_lock_irq(&pool->lock);
> +
> /*
> * The counterpart of the following dec_and_test, implied mb,
> * worklist not empty test sequence is in insert_work().
> * Please read comment there.
> - *
> - * NOT_RUNNING is clear. This means that we're bound to and
> - * running on the local cpu w/ rq lock held and preemption
> - * disabled, which in turn means that none else could be
> - * manipulating idle_list, so dereferencing idle_list without pool
> - * lock is safe.
> */
> if (atomic_dec_and_test(&pool->nr_running) &&
> !list_empty(&pool->worklist)) {

What guarantees that this pool->worklist access is safe? Preemption
isn't disabled.

I'm seeing some inconsistency when accessing the idle list. You seem
to only disable preemption when writing to the idle list? I'm unsure
if I'm missing something, or what. With that question in mind, I took
a stab at adding locking around all the idle_list calls.

I went through and double checked to make sure that rt_lock_idle_list
and rt_unlock_idle_list surround all idle_list or entry accesses. The
following are places where I think you should be locking, but aren't.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8900da8..314a098 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -778,8 +805,12 @@ static bool too_many_workers(struct worker_pool *pool)
* nr_idle and idle_list may disagree if idle rebinding is in
* progress. Never return %true if idle_list is empty.
*/
- if (list_empty(&pool->idle_list))
+ rt_lock_idle_list(pool);
+ if (list_empty(&pool->idle_list)) {
+ rt_unlock_idle_list(pool);
return false;
+ }
+ rt_unlock_idle_list(pool);

return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
}
@@ -788,7 +819,7 @@ static bool too_many_workers(struct worker_pool *pool)
* Wake up functions.
*/

-/* Return the first worker. Safe with preemption disabled */
+/* Return the first worker. Preemption must be disabled */
static struct worker *first_worker(struct worker_pool *pool)
{
if (unlikely(list_empty(&pool->idle_list)))
@@ -1567,10 +1598,16 @@ static void worker_enter_idle(struct worker *worker)
{
struct worker_pool *pool = worker->pool;

- if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
- WARN_ON_ONCE(!list_empty(&worker->entry) &&
- (worker->hentry.next || worker->hentry.pprev)))
- return;
+ if (WARN_ON_ONCE(worker->flags & WORKER_IDLE)) return;
+
+ rt_lock_idle_list(pool);
+ if (WARN_ON_ONCE(!list_empty(&worker->entry))) {
+ rt_unlock_idle_list(pool);
+ if (worker->hentry.next || worker->hentry.pprev)
+ return;
+ } else {
+ rt_unlock_idle_list(pool);
+ }

/* can't use worker_set_flags(), also called from start_worker() */
worker->flags |= WORKER_IDLE;
@@ -1882,7 +1925,9 @@ static void idle_worker_timeout(unsigned long __pool)
unsigned long expires;

/* idle_list is kept in LIFO order, check the last one */
+ rt_lock_idle_list(pool);
worker = list_entry(pool->idle_list.prev, struct worker, entry);
+ rt_unlock_idle_list(pool);
expires = worker->last_active + IDLE_WORKER_TIMEOUT;

if (time_before(jiffies, expires))
@@ -2026,7 +2071,9 @@ static bool maybe_destroy_workers(struct
worker_pool *pool)
struct worker *worker;
unsigned long expires;

+ rt_lock_idle_list(pool);
worker = list_entry(pool->idle_list.prev, struct worker, entry);
+ rt_unlock_idle_list(pool);
expires = worker->last_active + IDLE_WORKER_TIMEOUT;

if (time_before(jiffies, expires)) {
@@ -2299,7 +2346,9 @@ woke_up:
/* am I supposed to die? */
if (unlikely(worker->flags & WORKER_DIE)) {
spin_unlock_irq(&pool->lock);
+ rt_lock_idle_list(pool);
WARN_ON_ONCE(!list_empty(&worker->entry));
+ rt_unlock_idle_list(pool);
worker->task->flags &= ~PF_WQ_WORKER;
return 0;
}
@@ -3584,8 +3633,17 @@ static void put_unbound_pool(struct worker_pool *pool)
mutex_lock(&pool->manager_mutex);
spin_lock_irq(&pool->lock);

- while ((worker = first_worker(pool)))
- destroy_worker(worker);
+ while (true) {
+ rt_lock_idle_list(pool);
+ if ((worker = first_worker(pool))) {
+ destroy_worker(worker);
+ } else {
+ break;
+ }
+ rt_unlock_idle_list(pool);
+ }
+ rt_unlock_idle_list(pool);
+
WARN_ON(pool->nr_workers || pool->nr_idle);

spin_unlock_irq(&pool->lock);

So far, my test machines are staying up, which is progress. I have a
couple hours of runtime on the unmodified patch, and 1/2 hour ish on
my modifications.

Thanks!
Austin

2014-07-01 00:53:52

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Mon, Jun 30, 2014 at 5:12 PM, Austin Schuh <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <[email protected]> wrote:
>> On Thu, 26 Jun 2014, Austin Schuh wrote:
>>> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
>>> out of __schedule to sched_submit_work. It looks like that changes
>>> the conditions under which wq_worker_sleeping is called. It used to
>>> be called whenever a task was going to sleep (I think). It looks like
>>> it is called now if the task is going to sleep, and if the task isn't
>>> blocked on a PI mutex (I think).
>>>
>>> If I try the following experiment
>>>
>>> static inline void sched_submit_work(struct task_struct *tsk)
>>> {
>>> + if (tsk->state && tsk->flags & PF_WQ_WORKER) {
>>> + wq_worker_sleeping(tsk);
>>> + return;
>>> + }
>>>
>>> and then remove the call later in the function, I am able to pass my test.
>>>
>>> Unfortunately, I then get a recursive pool spinlock BUG_ON after a
>>> while (as I would expect), and it all blows up.
>>
>> Well, that's why the check for !pi_blocked_on is there.
>>
>>> I'm not sure where to go from there. Any changes to the workpool to
>>> try to fix that will be hard, or could affect latency significantly.
>>
>> Completely untested patch below.
>>
>> Thanks,
>>
>> tglx
>>
>> --------------------->
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8749d20..621329a 100644
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index be0ef50..0ca9d97 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
>> return;
>>
>> worker->sleeping = 1;
>> - spin_lock_irq(&pool->lock);
>> +
>> /*
>> * The counterpart of the following dec_and_test, implied mb,
>> * worklist not empty test sequence is in insert_work().
>> * Please read comment there.
>> - *
>> - * NOT_RUNNING is clear. This means that we're bound to and
>> - * running on the local cpu w/ rq lock held and preemption
>> - * disabled, which in turn means that none else could be
>> - * manipulating idle_list, so dereferencing idle_list without pool
>> - * lock is safe.
>> */
>> if (atomic_dec_and_test(&pool->nr_running) &&
>> !list_empty(&pool->worklist)) {
>
> What guarantees that this pool->worklist access is safe? Preemption
> isn't disabled.

I think I might have an answer for my own question, but I would
appreciate someone else to double check. If list_empty erroneously
returns that there is work to do when there isn't work to do, we wake
up an extra worker which then goes back to sleep. Not a big loss. If
list_empty erroneously returns that there isn't work to do when there
is, this should only be because someone is modifying the work list.
When they finish, as far as I can tell, all callers then check to see
if a worker needs to be started up, and start one.

Austin

2014-07-01 03:02:10

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <[email protected]> wrote:
> Completely untested patch below.

By chance, I found this in my boot logs. I'll do some more startup
testing tomorrow.

Jun 30 19:54:40 vpc5 kernel: [ 0.670955] ------------[ cut here ]------------
Jun 30 19:54:40 vpc5 kernel: [ 0.670962] WARNING: CPU: 0 PID: 4 at
kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b()
Jun 30 19:54:40 vpc5 kernel: [ 0.670970] Modules linked in:
Jun 30 19:54:40 vpc5 kernel: [ 0.670973] CPU: 0 PID: 4 Comm:
kworker/0:0 Not tainted 3.14.3-rt4abs+ #8
Jun 30 19:54:40 vpc5 kernel: [ 0.670974] Hardware name: CompuLab
Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013
Jun 30 19:54:40 vpc5 kernel: [ 0.670983] 0000000000000009
ffff88040ce75de8 ffffffff81510faf 0000000000000002
Jun 30 19:54:40 vpc5 kernel: [ 0.670985] 0000000000000000
ffff88040ce75e28 ffffffff81042085 0000000000000001
Jun 30 19:54:40 vpc5 kernel: [ 0.670987] ffffffff81057a60
ffff88042d406900 ffff88042da63fc0 ffff88042da64030
Jun 30 19:54:40 vpc5 kernel: [ 0.670988] Call Trace:
Jun 30 19:54:40 vpc5 kernel: [ 0.670995] [<ffffffff81510faf>]
dump_stack+0x4f/0x7c
Jun 30 19:54:40 vpc5 kernel: [ 0.670999] [<ffffffff81042085>]
warn_slowpath_common+0x81/0x9c
Jun 30 19:54:40 vpc5 kernel: [ 0.671002] [<ffffffff81057a60>] ?
worker_enter_idle+0x65/0x16b
Jun 30 19:54:40 vpc5 kernel: [ 0.671005] [<ffffffff810420ba>]
warn_slowpath_null+0x1a/0x1c
Jun 30 19:54:40 vpc5 kernel: [ 0.671007] [<ffffffff81057a60>]
worker_enter_idle+0x65/0x16b
Jun 30 19:54:40 vpc5 kernel: [ 0.671010] [<ffffffff8105a0a9>]
worker_thread+0x1b3/0x22b
Jun 30 19:54:40 vpc5 kernel: [ 0.671013] [<ffffffff81059ef6>] ?
rescuer_thread+0x293/0x293
Jun 30 19:54:40 vpc5 kernel: [ 0.671015] [<ffffffff81059ef6>] ?
rescuer_thread+0x293/0x293
Jun 30 19:54:40 vpc5 kernel: [ 0.671018] [<ffffffff8105f7ab>]
kthread+0xdc/0xe4
Jun 30 19:54:40 vpc5 kernel: [ 0.671022] [<ffffffff8105f6cf>] ?
flush_kthread_worker+0xe1/0xe1
Jun 30 19:54:40 vpc5 kernel: [ 0.671025] [<ffffffff8151586c>]
ret_from_fork+0x7c/0xb0
Jun 30 19:54:40 vpc5 kernel: [ 0.671027] [<ffffffff8105f6cf>] ?
flush_kthread_worker+0xe1/0xe1
Jun 30 19:54:40 vpc5 kernel: [ 0.671029] ---[ end trace 0000000000000001 ]---

2014-07-01 19:33:16

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <[email protected]> wrote:
>> Completely untested patch below.
>
> By chance, I found this in my boot logs. I'll do some more startup
> testing tomorrow.
>
> Jun 30 19:54:40 vpc5 kernel: [ 0.670955] ------------[ cut here ]------------
> Jun 30 19:54:40 vpc5 kernel: [ 0.670962] WARNING: CPU: 0 PID: 4 at
> kernel/workqueue.c:1604 worker_enter_idle+0x65/0x16b()
> Jun 30 19:54:40 vpc5 kernel: [ 0.670970] Modules linked in:
> Jun 30 19:54:40 vpc5 kernel: [ 0.670973] CPU: 0 PID: 4 Comm:
> kworker/0:0 Not tainted 3.14.3-rt4abs+ #8
> Jun 30 19:54:40 vpc5 kernel: [ 0.670974] Hardware name: CompuLab
> Intense-PC/Intense-PC, BIOS CR_2.2.0.377 X64 04/10/2013
> Jun 30 19:54:40 vpc5 kernel: [ 0.670983] 0000000000000009
> ffff88040ce75de8 ffffffff81510faf 0000000000000002
> Jun 30 19:54:40 vpc5 kernel: [ 0.670985] 0000000000000000
> ffff88040ce75e28 ffffffff81042085 0000000000000001
> Jun 30 19:54:40 vpc5 kernel: [ 0.670987] ffffffff81057a60
> ffff88042d406900 ffff88042da63fc0 ffff88042da64030
> Jun 30 19:54:40 vpc5 kernel: [ 0.670988] Call Trace:
> Jun 30 19:54:40 vpc5 kernel: [ 0.670995] [<ffffffff81510faf>]
> dump_stack+0x4f/0x7c
> Jun 30 19:54:40 vpc5 kernel: [ 0.670999] [<ffffffff81042085>]
> warn_slowpath_common+0x81/0x9c
> Jun 30 19:54:40 vpc5 kernel: [ 0.671002] [<ffffffff81057a60>] ?
> worker_enter_idle+0x65/0x16b
> Jun 30 19:54:40 vpc5 kernel: [ 0.671005] [<ffffffff810420ba>]
> warn_slowpath_null+0x1a/0x1c
> Jun 30 19:54:40 vpc5 kernel: [ 0.671007] [<ffffffff81057a60>]
> worker_enter_idle+0x65/0x16b
> Jun 30 19:54:40 vpc5 kernel: [ 0.671010] [<ffffffff8105a0a9>]
> worker_thread+0x1b3/0x22b
> Jun 30 19:54:40 vpc5 kernel: [ 0.671013] [<ffffffff81059ef6>] ?
> rescuer_thread+0x293/0x293
> Jun 30 19:54:40 vpc5 kernel: [ 0.671015] [<ffffffff81059ef6>] ?
> rescuer_thread+0x293/0x293
> Jun 30 19:54:40 vpc5 kernel: [ 0.671018] [<ffffffff8105f7ab>]
> kthread+0xdc/0xe4
> Jun 30 19:54:40 vpc5 kernel: [ 0.671022] [<ffffffff8105f6cf>] ?
> flush_kthread_worker+0xe1/0xe1
> Jun 30 19:54:40 vpc5 kernel: [ 0.671025] [<ffffffff8151586c>]
> ret_from_fork+0x7c/0xb0
> Jun 30 19:54:40 vpc5 kernel: [ 0.671027] [<ffffffff8105f6cf>] ?
> flush_kthread_worker+0xe1/0xe1
> Jun 30 19:54:40 vpc5 kernel: [ 0.671029] ---[ end trace 0000000000000001 ]---

Bug in my extra locking... Sorry for the noise. The second diff is a
cleaner way of destroying the workers.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8900da8..590cc26 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1567,10 +1602,16 @@ static void worker_enter_idle(struct worker *worker)
{
struct worker_pool *pool = worker->pool;

- if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
- WARN_ON_ONCE(!list_empty(&worker->entry) &&
- (worker->hentry.next || worker->hentry.pprev)))
+ if (WARN_ON_ONCE(worker->flags & WORKER_IDLE)) return;
+
+ rt_lock_idle_list(pool);
+ if (WARN_ON_ONCE(!list_empty(&worker->entry) &&
+ (worker->hentry.next || worker->hentry.pprev))) {
+ rt_unlock_idle_list(pool);
return;
+ } else {
+ rt_unlock_idle_list(pool);
+ }

/* can't use worker_set_flags(), also called from start_worker() */
worker->flags |= WORKER_IDLE;
@@ -3584,8 +3637,14 @@ static void put_unbound_pool(struct worker_pool *pool)
mutex_lock(&pool->manager_mutex);
spin_lock_irq(&pool->lock);

- while ((worker = first_worker(pool)))
+ rt_lock_idle_list(pool);
+ while ((worker = first_worker(pool))) {
+ rt_unlock_idle_list(pool);
destroy_worker(worker);
+ rt_lock_idle_list(pool);
+ }
+ rt_unlock_idle_list(pool);
+
WARN_ON(pool->nr_workers || pool->nr_idle);

spin_unlock_irq(&pool->lock);

2014-07-03 23:08:35

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Tue, Jul 1, 2014 at 12:32 PM, Austin Schuh <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <[email protected]> wrote:
>>> Completely untested patch below.

I've tested it and looked it over now, and feel pretty confident in
the patch. Thanks Thomas!

I've been running this fix with my extra lock steps for a couple days
now on 3 machines, and haven't seen any issues. On each of the
machines, I've got a CAN card generating about 1 CPU worth of load
from interrupts, and the rest of the cores are tied up putting work on
the work queues with the killer_module.

What next?

Austin

2014-07-04 04:42:08

by Mike Galbraith

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Thu, 2014-07-03 at 16:08 -0700, Austin Schuh wrote:
> On Tue, Jul 1, 2014 at 12:32 PM, Austin Schuh <[email protected]> wrote:
> > On Mon, Jun 30, 2014 at 8:01 PM, Austin Schuh <[email protected]> wrote:
> >> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <[email protected]> wrote:
> >>> Completely untested patch below.
>
> I've tested it and looked it over now, and feel pretty confident in
> the patch. Thanks Thomas!

FWIW, I flogged my boxen hard with the virgin tglx patch applied again
after merging 3.14-rt5 with 3.14.10. It remained thoroughly boring.

-Mike

2014-07-05 20:27:00

by Thomas Gleixner

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Mon, 30 Jun 2014, Austin Schuh wrote:
> I think I might have an answer for my own question, but I would
> appreciate someone else to double check. If list_empty erroneously
> returns that there is work to do when there isn't work to do, we wake
> up an extra worker which then goes back to sleep. Not a big loss. If
> list_empty erroneously returns that there isn't work to do when there
> is, this should only be because someone is modifying the work list.
> When they finish, as far as I can tell, all callers then check to see
> if a worker needs to be started up, and start one.

Precisely.

2014-07-06 04:56:06

by Austin Schuh

[permalink] [raw]

Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT

On Sat, Jul 5, 2014 at 1:26 PM, Thomas Gleixner <[email protected]> wrote:
> On Mon, 30 Jun 2014, Austin Schuh wrote:
>> I think I might have an answer for my own question, but I would
>> appreciate someone else to double check. If list_empty erroneously
>> returns that there is work to do when there isn't work to do, we wake
>> up an extra worker which then goes back to sleep. Not a big loss. If
>> list_empty erroneously returns that there isn't work to do when there
>> is, this should only be because someone is modifying the work list.
>> When they finish, as far as I can tell, all callers then check to see
>> if a worker needs to be started up, and start one.
>
> Precisely.

A comment there when you put together a polished patch for inclusion
would be awesome. I'm assuming that you will put the patch together?