LinuxLists.cc - [RFC][PATCH 1/3] sched, block: Move unplug

2011-06-21 23:46:45

Subject: [RFC][PATCH 1/3] sched, block: Move unplug

Thomas found that we're doing a horrendous amount of work in that scheduler
unplug hook while having preempt and IRQs disabled.

Move it to the head of schedule() where both preemption and IRQs are enabled
such that we don't get these silly long IRQ/preempt disable times.

This allows us to remove a lot of special magic in the unplug path,
simplifying that code as a bonus.

Jens Axboe <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
block/blk-core.c | 28 ++++++++--------------------
include/linux/blkdev.h | 12 ++----------
kernel/sched.c | 26 ++++++++++++++++----------
3 files changed, 26 insertions(+), 40 deletions(-)

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -2655,25 +2655,13 @@ static int plug_rq_cmp(void *priv, struc
* additional stack usage in driver dispatch, in places where the originally
* plugger did not intend it.
*/
-static void queue_unplugged(struct request_queue *q, unsigned int depth,
- bool from_schedule)
+static void queue_unplugged(struct request_queue *q, unsigned int depth)
__releases(q->queue_lock)
{
- trace_block_unplug(q, depth, !from_schedule);
-
- /*
- * If we are punting this to kblockd, then we can safely drop
- * the queue_lock before waking kblockd (which needs to take
- * this lock).
- */
- if (from_schedule) {
- spin_unlock(q->queue_lock);
- blk_run_queue_async(q);
- } else {
- __blk_run_queue(q);
- spin_unlock(q->queue_lock);
- }
+ trace_block_unplug(q, depth, true);

+ __blk_run_queue(q);
+ spin_unlock(q->queue_lock);
}

static void flush_plug_callbacks(struct blk_plug *plug)
@@ -2694,7 +2682,7 @@ static void flush_plug_callbacks(struct
}
}

-void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
+void blk_flush_plug_list(struct blk_plug *plug)
{
struct request_queue *q;
unsigned long flags;
@@ -2732,7 +2720,7 @@ void blk_flush_plug_list(struct blk_plug
* This drops the queue lock
*/
if (q)
- queue_unplugged(q, depth, from_schedule);
+ queue_unplugged(q, depth);
q = rq->q;
depth = 0;
spin_lock(q->queue_lock);
@@ -2752,14 +2740,14 @@ void blk_flush_plug_list(struct blk_plug
* This drops the queue lock
*/
if (q)
- queue_unplugged(q, depth, from_schedule);
+ queue_unplugged(q, depth);

local_irq_restore(flags);
}

void blk_finish_plug(struct blk_plug *plug)
{
- blk_flush_plug_list(plug, false);
+ blk_flush_plug_list(plug);

if (plug == current->plug)
current->plug = NULL;
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -870,22 +870,14 @@ struct blk_plug_cb {

extern void blk_start_plug(struct blk_plug *);
extern void blk_finish_plug(struct blk_plug *);
-extern void blk_flush_plug_list(struct blk_plug *, bool);
+extern void blk_flush_plug_list(struct blk_plug *);

static inline void blk_flush_plug(struct task_struct *tsk)
{
struct blk_plug *plug = tsk->plug;

if (plug)
- blk_flush_plug_list(plug, false);
-}
-
-static inline void blk_schedule_flush_plug(struct task_struct *tsk)
-{
- struct blk_plug *plug = tsk->plug;
-
- if (plug)
- blk_flush_plug_list(plug, true);
+ blk_flush_plug_list(plug);
}

static inline bool blk_needs_flush_plug(struct task_struct *tsk)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4209,6 +4209,20 @@ pick_next_task(struct rq *rq)
BUG(); /* the idle class will always have a runnable task */
}

+static inline void sched_submit_work(void)
+{
+ struct task_struct *tsk = current;
+
+ if (tsk->state && !(preempt_count() & PREEMPT_ACTIVE)) {
+ /*
+ * If we are going to sleep and we have plugged IO
+ * queued, make sure to submit it to avoid deadlocks.
+ */
+ if (blk_needs_flush_plug(tsk))
+ blk_flush_plug(tsk);
+ }
+}
+
/*
* schedule() is the main scheduler function.
*/
@@ -4219,6 +4233,8 @@ asmlinkage void __sched schedule(void)
struct rq *rq;
int cpu;

+ sched_submit_work();
+
need_resched:
preempt_disable();
cpu = smp_processor_id();
@@ -4253,16 +4269,6 @@ asmlinkage void __sched schedule(void)
if (to_wakeup)
try_to_wake_up_local(to_wakeup);
}
-
- /*
- * If we are going to sleep and we have plugged IO
- * queued, make sure to submit it to avoid deadlocks.
- */
- if (blk_needs_flush_plug(prev)) {
- raw_spin_unlock(&rq->lock);
- blk_schedule_flush_plug(prev);
- raw_spin_lock(&rq->lock);
- }
}
switch_count = &prev->nvcsw;
}

2011-06-22 07:01:45

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On 2011-06-22 01:34, Peter Zijlstra wrote:
> Thomas found that we're doing a horrendous amount of work in that scheduler
> unplug hook while having preempt and IRQs disabled.
>
> Move it to the head of schedule() where both preemption and IRQs are enabled
> such that we don't get these silly long IRQ/preempt disable times.
>
> This allows us to remove a lot of special magic in the unplug path,
> simplifying that code as a bonus.

The major change here is moving the queue running inline, instead of
punting to a thread. The worry is/was that we risk blowing the stack if
something ends up blocking inadvertently further down the call path.
Since it's the unlikely way to unplug, a bit of latency was acceptable
to prevent this problem.

I'm curious why you made that change? It seems orthogonal to the change
you are actually describing in the commit message.

--
Jens Axboe

2011-06-22 13:53:45

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On Wed, 22 Jun 2011, Jens Axboe wrote:

> On 2011-06-22 01:34, Peter Zijlstra wrote:
> > Thomas found that we're doing a horrendous amount of work in that scheduler
> > unplug hook while having preempt and IRQs disabled.
> >
> > Move it to the head of schedule() where both preemption and IRQs are enabled
> > such that we don't get these silly long IRQ/preempt disable times.
> >
> > This allows us to remove a lot of special magic in the unplug path,
> > simplifying that code as a bonus.
>
> The major change here is moving the queue running inline, instead of
> punting to a thread. The worry is/was that we risk blowing the stack if
> something ends up blocking inadvertently further down the call path.

Is that a real problem or just a "we have no clue what might happen"
countermeasure? The plug list should not be magically refilled once
it's split off so this should not recurse endlessly, right? If it does
then we better fix it at the root cause of the problem and not by
adding some last resort band aid into the scheduler code.

If the stack usage of that whole block code is the real issue, then we
probably need to keep that "delegate to async" workaround [sigh!], but
definitely outside of the scheduler core code.

> Since it's the unlikely way to unplug, a bit of latency was acceptable
> to prevent this problem.

It's not at all acceptable. There is no reason to hook stuff which
runs perfectly fine in preemptible code into the irq disabled region
of the scheduler internals.

> I'm curious why you made that change? It seems orthogonal to the change
> you are actually describing in the commit message.

Right, it should be split into two separate commits, one moving the
stuff out from the irq disabled region and the other removing that
from_schedule hackery. The latter can be dropped.

Thanks,

tglx

2011-06-22 14:02:06

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On 2011-06-22 15:53, Thomas Gleixner wrote:
> On Wed, 22 Jun 2011, Jens Axboe wrote:
>
>> On 2011-06-22 01:34, Peter Zijlstra wrote:
>>> Thomas found that we're doing a horrendous amount of work in that scheduler
>>> unplug hook while having preempt and IRQs disabled.
>>>
>>> Move it to the head of schedule() where both preemption and IRQs are enabled
>>> such that we don't get these silly long IRQ/preempt disable times.
>>>
>>> This allows us to remove a lot of special magic in the unplug path,
>>> simplifying that code as a bonus.
>>
>> The major change here is moving the queue running inline, instead of
>> punting to a thread. The worry is/was that we risk blowing the stack if
>> something ends up blocking inadvertently further down the call path.
>
> Is that a real problem or just a "we have no clue what might happen"
> countermeasure? The plug list should not be magically refilled once
> it's split off so this should not recurse endlessly, right? If it does
> then we better fix it at the root cause of the problem and not by
> adding some last resort band aid into the scheduler code.

It is supposedly a real problem, not just an inkling. It's not about
recursing indefinitely, the plug is fairly bounded. But the IO dispatch
path can be pretty deep, and if you hit that deep inside the reclaim or
file system write path, then you get dangerously close. Dave Chinner
posted some numbers in the 2.6.39-rc1 time frame showing how close we
got.

The scheduler hook has nothing to do wit this, we need that regardless.
My objection was the conversion from async to sync run, something that
wasn't even mentioned in the patch description (yet it was the most
interesting part of the change).
According to eg

> If the stack usage of that whole block code is the real issue, then we
> probably need to keep that "delegate to async" workaround [sigh!], but
> definitely outside of the scheduler core code.

Placement of the call is also orthogonal. The only requirements are
really:

- IFF the process is going to sleep, flush the plug list

Nothing more, nothing less. We can tolerate false positives, but as a
general rule it should only happen when the process goes to sleep.

>> Since it's the unlikely way to unplug, a bit of latency was acceptable
>> to prevent this problem.
>
> It's not at all acceptable. There is no reason to hook stuff which
> runs perfectly fine in preemptible code into the irq disabled region
> of the scheduler internals.

We are talking past each other again. Flushing on going to sleep is
needed. Placement of that call was pretty much left in the hands of the
scheduler people. I personally don't care where it's put, as long as it
does what is needed.

>> I'm curious why you made that change? It seems orthogonal to the change
>> you are actually describing in the commit message.
>
> Right, it should be split into two separate commits, one moving the
> stuff out from the irq disabled region and the other removing that
> from_schedule hackery. The latter can be dropped.

Exactly.

--
Jens Axboe

2011-06-22 14:30:43

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On Wed, 22 Jun 2011, Jens Axboe wrote:
> On 2011-06-22 15:53, Thomas Gleixner wrote:
> > On Wed, 22 Jun 2011, Jens Axboe wrote:
> > Is that a real problem or just a "we have no clue what might happen"
> > countermeasure? The plug list should not be magically refilled once
> > it's split off so this should not recurse endlessly, right? If it does
> > then we better fix it at the root cause of the problem and not by
> > adding some last resort band aid into the scheduler code.
>
> It is supposedly a real problem, not just an inkling. It's not about
> recursing indefinitely, the plug is fairly bounded. But the IO dispatch
> path can be pretty deep, and if you hit that deep inside the reclaim or
> file system write path, then you get dangerously close. Dave Chinner
> posted some numbers in the 2.6.39-rc1 time frame showing how close we
> got.

Fair enough.

> We are talking past each other again. Flushing on going to sleep is
> needed. Placement of that call was pretty much left in the hands of the
> scheduler people. I personally don't care where it's put, as long as it
> does what is needed.

Ok. So we move it out and keep the from_scheduler flag so that code
does not go down the IO path from there.

Thanks,

tglx

2011-06-22 14:38:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On Wed, 2011-06-22 at 16:30 +0200, Thomas Gleixner wrote:
> > It is supposedly a real problem, not just an inkling. It's not about
> > recursing indefinitely, the plug is fairly bounded. But the IO dispatch
> > path can be pretty deep, and if you hit that deep inside the reclaim or
> > file system write path, then you get dangerously close. Dave Chinner
> > posted some numbers in the 2.6.39-rc1 time frame showing how close we
> > got.
>
> Fair enough.

> Ok. So we move it out and keep the from_scheduler flag so that code
> does not go down the IO path from there.

Won't punting the plug to a worker thread wreck all kinds of io
accounting due to the wrong task doing the actual io submission?

2011-06-22 15:09:14

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On Wed, Jun 22, 2011 at 04:38:01PM +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-22 at 16:30 +0200, Thomas Gleixner wrote:
> > > It is supposedly a real problem, not just an inkling. It's not about
> > > recursing indefinitely, the plug is fairly bounded. But the IO dispatch
> > > path can be pretty deep, and if you hit that deep inside the reclaim or
> > > file system write path, then you get dangerously close. Dave Chinner
> > > posted some numbers in the 2.6.39-rc1 time frame showing how close we
> > > got.
> >
> > Fair enough.
>
> > Ok. So we move it out and keep the from_scheduler flag so that code
> > does not go down the IO path from there.
>
> Won't punting the plug to a worker thread wreck all kinds of io
> accounting due to the wrong task doing the actual io submission?

I think all the accounting will the done in IO submission path and
while IO is added to plug. This is just plug flush so should not
have effect on accounting.

Thanks
Vivek

2011-06-22 16:04:46

by Jens Axboe

[permalink] [raw]

Subject: Re: [RFC][PATCH 1/3] sched, block: Move unplug

On 2011-06-22 17:08, Vivek Goyal wrote:
> On Wed, Jun 22, 2011 at 04:38:01PM +0200, Peter Zijlstra wrote:
>> On Wed, 2011-06-22 at 16:30 +0200, Thomas Gleixner wrote:
>>>> It is supposedly a real problem, not just an inkling. It's not about
>>>> recursing indefinitely, the plug is fairly bounded. But the IO dispatch
>>>> path can be pretty deep, and if you hit that deep inside the reclaim or
>>>> file system write path, then you get dangerously close. Dave Chinner
>>>> posted some numbers in the 2.6.39-rc1 time frame showing how close we
>>>> got.
>>>
>>> Fair enough.
>>
>>> Ok. So we move it out and keep the from_scheduler flag so that code
>>> does not go down the IO path from there.
>>
>> Won't punting the plug to a worker thread wreck all kinds of io
>> accounting due to the wrong task doing the actual io submission?
>
> I think all the accounting will the done in IO submission path and
> while IO is added to plug. This is just plug flush so should not
> have effect on accounting.

Exactly, this is just the insert operation, so no worries there. The
request are fully "formulated".

--
Jens Axboe