2015-02-05 21:51:43

by NeilBrown

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Thu, 05 Feb 2015 15:27:58 -0500 Tony Battersby <[email protected]>
wrote:

> I get the might_sleep() warning below when writing some data to an ext3
> filesystem on a RAID1. But everything works OK, so there is no actual
> problem, just a warning.
>
> I see that there has been a fix for a might_sleep() warning in md/bitmap
> since 3.19-rc7, but this is a different warning.

Hi Tony,
this is another false positive caused by

commit 8eb23b9f35aae413140d3fda766a98092c21e9b0
Author: Peter Zijlstra <[email protected]>
Date: Wed Sep 24 10:18:55 2014 +0200

sched: Debug nested sleeps


It is even described in that commit:

Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.

That is exactly what is happening here. However I don't think that is an
"observed problem" but rather an "observed false-positive".

If nothing inside the outer loop blocks, then in particular
generic_make_request will not be called, so nothing will be added to the
queue that blk_schedule_flush_plug flushes.
So the first time through the loop, a call the 'schedule()' may not actually
block, but every subsequent time it will.
So there is no actual problem here.

So I'd be included to add sched_annotate_sleep() in blk_flush_plug_list().

Peter: what do you think is the best way to silence this warning.

Thanks,
NeilBrown



>
> ---
>
> > cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sda1[0] sdb1[1]
> 1959884 blocks super 1.0 [2/2] [UU]
>
> unused devices: <none>
>
> ---
>
> > grep md0 /proc/mounts
> /dev/md0 / ext3 rw,noatime,errors=continue,barrier=1,data=journal 0 0
>
> ---
>
> WARNING: CPU: 3 PID: 1069 at kernel/sched/core.c:7300 __might_sleep+0x82/0x90()
> do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff8028faa1>] prepare_to_wait+0x31/0xa0
> Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi igb i2c_algo_bit ptp pps_core mptsas mptscsih mptbase pm80xx libsas mpt2sas scsi_transport_sas raid_class sg coretemp eeprom w83795 i2c_i801
> CPU: 3 PID: 1069 Comm: kjournald Not tainted 3.19.0-rc7 #1
> Hardware name: Supermicro X8DTH-i/6/iF/6F/X8DTH, BIOS 2.1b 05/04/12
> 0000000000001c84 ffff88032f1df608 ffffffff80645918 0000000000001c84
> ffff88032f1df658 ffff88032f1df648 ffffffff8025ea6b ffff8800bb0b4d58
> 0000000000000000 00000000000006f6 ffffffff80942b6f ffff8803317b8a00
> Call Trace:
> [<ffffffff80645918>] dump_stack+0x4f/0x6f
> [<ffffffff8025ea6b>] warn_slowpath_common+0x8b/0xd0
> [<ffffffff8025eb51>] warn_slowpath_fmt+0x41/0x50
> [<ffffffff8028faa1>] ? prepare_to_wait+0x31/0xa0
> [<ffffffff8028faa1>] ? prepare_to_wait+0x31/0xa0
> [<ffffffff8027ee62>] __might_sleep+0x82/0x90
> [<ffffffff803bee06>] generic_make_request_checks+0x36/0x2d0
> [<ffffffff802943ed>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff803bf0b3>] generic_make_request+0x13/0x100
> [<ffffffff8054983b>] raid1_unplug+0x12b/0x170
> [<ffffffff803c1302>] blk_flush_plug_list+0xa2/0x230
> [<ffffffff80294315>] ? trace_hardirqs_on_caller+0x105/0x1d0
> [<ffffffff80646760>] ? bit_wait_timeout+0x70/0x70
> [<ffffffff80646383>] io_schedule+0x43/0x80
> [<ffffffff80646787>] bit_wait_io+0x27/0x50
> [<ffffffff80646a7d>] __wait_on_bit+0x5d/0x90
> [<ffffffff803bf160>] ? generic_make_request+0xc0/0x100
> [<ffffffff80646760>] ? bit_wait_timeout+0x70/0x70
> [<ffffffff80646bc3>] out_of_line_wait_on_bit+0x73/0x90
> [<ffffffff8028f680>] ? wake_atomic_t_function+0x40/0x40
> [<ffffffff8034b60f>] __wait_on_buffer+0x3f/0x50
> [<ffffffff8034df18>] __bread_gfp+0xa8/0xd0
> [<ffffffff80388d45>] ext3_get_branch+0x95/0x140
> [<ffffffff80389716>] ext3_get_blocks_handle+0xb6/0xca0
> [<ffffffff8029760c>] ? __lock_acquire+0x50c/0xc30
> [<ffffffff803114b2>] ? __slab_alloc+0x212/0x560
> [<ffffffff80294315>] ? trace_hardirqs_on_caller+0x105/0x1d0
> [<ffffffff8038a3a8>] ext3_get_block+0xa8/0x100
> [<ffffffff80349bba>] generic_block_bmap+0x3a/0x40
> [<ffffffff8038956d>] ext3_bmap+0x7d/0x90
> [<ffffffff80333e2c>] bmap+0x1c/0x20
> [<ffffffff8039ee70>] journal_bmap+0x30/0xa0
> [<ffffffff8039f238>] journal_next_log_block+0x78/0xa0
> [<ffffffff8039a637>] journal_commit_transaction+0x657/0x13e0
> [<ffffffff802aaa87>] ? lock_timer_base+0x37/0x70
> [<ffffffff802ab0c0>] ? get_next_timer_interrupt+0x240/0x240
> [<ffffffff8039e632>] kjournald+0xf2/0x210
> [<ffffffff8028f600>] ? woken_wake_function+0x10/0x10
> [<ffffffff8039e540>] ? commit_timeout+0x10/0x10
> [<ffffffff80279e2e>] kthread+0xee/0x120
> [<ffffffff80279d40>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff8064b56c>] ret_from_fork+0x7c/0xb0
> [<ffffffff80279d40>] ? __init_kthread_worker+0x70/0x70
> ---[ end trace 27f081e879dfbb12 ]---


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-02-06 11:39:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, Feb 06, 2015 at 08:51:33AM +1100, NeilBrown wrote:
> That is exactly what is happening here. However I don't think that is an
> "observed problem" but rather an "observed false-positive".
>
> If nothing inside the outer loop blocks, then in particular
> generic_make_request will not be called, so nothing will be added to the
> queue that blk_schedule_flush_plug flushes.
> So the first time through the loop, a call the 'schedule()' may not actually
> block, but every subsequent time it will.
> So there is no actual problem here.
>
> So I'd be included to add sched_annotate_sleep() in blk_flush_plug_list().
>
> Peter: what do you think is the best way to silence this warning.

> > Call Trace:

> > [<ffffffff8027ee62>] __might_sleep+0x82/0x90
> > [<ffffffff803bee06>] generic_make_request_checks+0x36/0x2d0
> > [<ffffffff803bf0b3>] generic_make_request+0x13/0x100
> > [<ffffffff8054983b>] raid1_unplug+0x12b/0x170
> > [<ffffffff803c1302>] blk_flush_plug_list+0xa2/0x230
> > [<ffffffff80646383>] io_schedule+0x43/0x80
> > [<ffffffff80646787>] bit_wait_io+0x27/0x50

Well, I don't know. I don't particularly like the whole blk_flush_plug()
thing scheduling while on its way to schedule. If you ever end up
calling io_schedule() from it there's 'fun'.

Also, how likely is it to actually schedule when doing all that? This
block layer stuff is somewhat impenetrable for me, too many callbacks.

You have some words on how its unlikely, but I can't even find _where_
it would schedule :/ All I see is a loop calling ->make_request_fn() and
god only knows where that ends up.

So there appear to be two blk_flush_plug() variants, one with an
@from_schedule = true, which seems to really try not to schedule, which
seems to suggest the 'false' one (the one above) is meant to schedule?

If scheduling is the rule rather than the exception, the above is
properly broken.

But again, I don't know.

If you're confident that scheduling is rare for _ALL_ (current and
future) block device implementations, not just the raid one, then you
can annotate blk_flush_plug_list() I suppose.

Otherwise I would suggest adding them one at a time in whatever blk
device thing likes to go schedule on us. Also, add a comment that
explains why its rare for the future us who need to look at it again.

2015-02-09 01:14:10

by NeilBrown

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, 6 Feb 2015 12:39:30 +0100 Peter Zijlstra <[email protected]> wrote:

> On Fri, Feb 06, 2015 at 08:51:33AM +1100, NeilBrown wrote:
> > That is exactly what is happening here. However I don't think that is an
> > "observed problem" but rather an "observed false-positive".
> >
> > If nothing inside the outer loop blocks, then in particular
> > generic_make_request will not be called, so nothing will be added to the
> > queue that blk_schedule_flush_plug flushes.
> > So the first time through the loop, a call the 'schedule()' may not actually
> > block, but every subsequent time it will.
> > So there is no actual problem here.
> >
> > So I'd be included to add sched_annotate_sleep() in blk_flush_plug_list().
> >
> > Peter: what do you think is the best way to silence this warning.
>
> > > Call Trace:
>
> > > [<ffffffff8027ee62>] __might_sleep+0x82/0x90
> > > [<ffffffff803bee06>] generic_make_request_checks+0x36/0x2d0
> > > [<ffffffff803bf0b3>] generic_make_request+0x13/0x100
> > > [<ffffffff8054983b>] raid1_unplug+0x12b/0x170
> > > [<ffffffff803c1302>] blk_flush_plug_list+0xa2/0x230
> > > [<ffffffff80646383>] io_schedule+0x43/0x80
> > > [<ffffffff80646787>] bit_wait_io+0x27/0x50
>
> Well, I don't know. I don't particularly like the whole blk_flush_plug()
> thing scheduling while on its way to schedule. If you ever end up
> calling io_schedule() from it there's 'fun'.
>
> Also, how likely is it to actually schedule when doing all that? This
> block layer stuff is somewhat impenetrable for me, too many callbacks.
>
> You have some words on how its unlikely, but I can't even find _where_
> it would schedule :/ All I see is a loop calling ->make_request_fn() and
> god only knows where that ends up.
>
> So there appear to be two blk_flush_plug() variants, one with an
> @from_schedule = true, which seems to really try not to schedule, which
> seems to suggest the 'false' one (the one above) is meant to schedule?
>
> If scheduling is the rule rather than the exception, the above is
> properly broken.
>
> But again, I don't know.

I had to re-read the code (And your analysis) a couple of times to be sure ...

As you say, when schedule() calls blk_schedule_flush_plug(), the
@from_schedule=true variant is used and the unplug code doesn't block.
So there is no problem there.

However, when io_schedule() explicitly calls blk_flush_plug(), then
@from_schedule=false variant is used, and the unplug functions are allowed to
allocate memory and block and maybe even call mempool_alloc() which might
call io_schedule().

This shouldn't be a problem as blk_flush_plug() spliced out the plug list, so
any recursive call will find an empty list and do nothing.

Worst case is that a wait_event loop that calls io_schedule() (i.e.
wait_on_bit_io()) might not block in the first call to io_schedule()
if the unplugging needed to wait. Every subsequent call will block as
required as there is nothing else to add requests to the plug queue.

So as long as wait_on_bio_io() can cope with a single false wakeup (which it
can), there is no problem here.

>
> If you're confident that scheduling is rare for _ALL_ (current and
> future) block device implementations, not just the raid one, then you
> can annotate blk_flush_plug_list() I suppose.
>
> Otherwise I would suggest adding them one at a time in whatever blk
> device thing likes to go schedule on us. Also, add a comment that
> explains why its rare for the future us who need to look at it again.

It isn't that scheduling is "rare" - it is that it can only occur once in a
loop which doesn't expect it.

So I propose the following, though I haven't tested it.

Signed-off-by: NeilBrown <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb11b560..b0f12ab3df23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4374,6 +4374,11 @@ void __sched io_schedule(void)

delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
+ /* Any sleeping in blk_flush_plug() should not
+ * trigger the "do not call blocking ops" warning
+ * as it can only happen once in a wait_event loop.
+ */
+ sched_annotate_sleep();
blk_flush_plug(current);
current->in_iowait = 1;
schedule();
@@ -4390,6 +4395,11 @@ long __sched io_schedule_timeout(long timeout)

delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
+ /* Any sleeping in blk_flush_plug() should not
+ * trigger the "do not call blocking ops" warning
+ * as it can only happen once in a wait_event loop.
+ */
+ sched_annotate_sleep();
blk_flush_plug(current);
current->in_iowait = 1;
ret = schedule_timeout(timeout);


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-02-09 09:10:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Mon, Feb 09, 2015 at 12:13:57PM +1100, NeilBrown wrote:
> I had to re-read the code (And your analysis) a couple of times to be sure ...

Sorry :-)

> However, when io_schedule() explicitly calls blk_flush_plug(), then
> @from_schedule=false variant is used, and the unplug functions are allowed to
> allocate memory and block and maybe even call mempool_alloc() which might
> call io_schedule().
>
> This shouldn't be a problem as blk_flush_plug() spliced out the plug list, so
> any recursive call will find an empty list and do nothing.

Unless, something along the way stuck something back on, right? So
should we stick an:

WARN_ON(current->in_iowait);

somewhere near where things are added to this plug list? (and move the
blk_flush_plug() call inside of where that's actually true of course).

> Worst case is that a wait_event loop that calls io_schedule() (i.e.
> wait_on_bit_io()) might not block in the first call to io_schedule()
> if the unplugging needed to wait. Every subsequent call will block as
> required as there is nothing else to add requests to the plug queue.

Again, assuming @cond will not actually stick something on this list.
Which if we add the above we'll get warned about.

> It isn't that scheduling is "rare" - it is that it can only occur once in a
> loop which doesn't expect it.

With the above WARN stuck in, agreed.

> So I propose the following, though I haven't tested it.
>
> Signed-off-by: NeilBrown <[email protected]>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e628cb11b560..b0f12ab3df23 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4374,6 +4374,11 @@ void __sched io_schedule(void)
>
> delayacct_blkio_start();
> atomic_inc(&rq->nr_iowait);
> + /* Any sleeping in blk_flush_plug() should not
> + * trigger the "do not call blocking ops" warning
> + * as it can only happen once in a wait_event loop.
> + */

Might I suggest the 'regular' multi-line comment style, and a reference
to the above WARN that makes everything actually work?

/*
* multi-line
* comments have an empty
* line at the start... As per CodingStyle ch. 8
*/

> + sched_annotate_sleep();
> blk_flush_plug(current);

Also, at this point, should we put it in blk_flush_plug()?


The only thing that really goes wrong then is if people 'forget' to put
a loop around io_schedule().

2015-02-10 02:50:31

by NeilBrown

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Mon, 9 Feb 2015 10:10:00 +0100 Peter Zijlstra <[email protected]> wrote:

> On Mon, Feb 09, 2015 at 12:13:57PM +1100, NeilBrown wrote:
> > I had to re-read the code (And your analysis) a couple of times to be sure ...
>
> Sorry :-)

My point was that actually reading it (rather than assuming I knew what it
said) actually helped!

>
> > However, when io_schedule() explicitly calls blk_flush_plug(), then
> > @from_schedule=false variant is used, and the unplug functions are allowed to
> > allocate memory and block and maybe even call mempool_alloc() which might
> > call io_schedule().
> >
> > This shouldn't be a problem as blk_flush_plug() spliced out the plug list, so
> > any recursive call will find an empty list and do nothing.
>
> Unless, something along the way stuck something back on, right? So
> should we stick an:
>
> WARN_ON(current->in_iowait);
>
> somewhere near where things are added to this plug list? (and move the
> blk_flush_plug() call inside of where that's actually true of course).

No, I don't think so.

It is certainly possible that some request on plug->cb_list could add
something to plug->list - which is processed after ->cb_list.

I think the best way to think about this is that the *problem* was that a
wait_event loop could spin without making any progress. So any time that
clear forward progress is made it is safe sleep without necessitating the
warning. Hence sched_annotate_sleep() is reasonable.
blk_flush_plug() with definitely have dispatched some requests if it
might_sleep(), so the sleep is OK.


>
> The only thing that really goes wrong then is if people 'forget' to put
> a loop around io_schedule().

You mean like in congestion_wait() ??

Though that is mostly called inside a loop...


NeilBrown

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 92f4b4b288dd..7334be27823d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1111,6 +1111,14 @@ static inline void blk_flush_plug(struct task_struct *tsk)
{
struct blk_plug *plug = tsk->plug;

+ /*
+ * Any sleeping in blk_flush_plug() should not
+ * trigger the "do not call blocking ops" warning
+ * as it makes clear forward process (requests are
+ * dispatched) and so it will not cause indefinite
+ * looping in a higher level wait loop.
+ */
+ sched_annotate_sleep();
if (plug)
blk_flush_plug_list(plug, false);
}


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-02-10 09:29:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Tue, Feb 10, 2015 at 01:50:17PM +1100, NeilBrown wrote:
> On Mon, 9 Feb 2015 10:10:00 +0100 Peter Zijlstra <[email protected]> wrote:
> > > However, when io_schedule() explicitly calls blk_flush_plug(), then
> > > @from_schedule=false variant is used, and the unplug functions are allowed to
> > > allocate memory and block and maybe even call mempool_alloc() which might
> > > call io_schedule().
> > >
> > > This shouldn't be a problem as blk_flush_plug() spliced out the plug list, so
> > > any recursive call will find an empty list and do nothing.
> >
> > Unless, something along the way stuck something back on, right? So
> > should we stick an:
> >
> > WARN_ON(current->in_iowait);
> >
> > somewhere near where things are added to this plug list? (and move the
> > blk_flush_plug() call inside of where that's actually true of course).
>
> No, I don't think so.
>
> It is certainly possible that some request on plug->cb_list could add
> something to plug->list - which is processed after ->cb_list.
>
> I think the best way to think about this is that the *problem* was that a
> wait_event loop could spin without making any progress. So any time that
> clear forward progress is made it is safe sleep without necessitating the
> warning. Hence sched_annotate_sleep() is reasonable.
> blk_flush_plug() with definitely have dispatched some requests if it
> might_sleep(), so the sleep is OK.

Well, yes, but you forget that this gets us back into recursion land.
io_schedule() calling io_schedule() calling io_schedule() and *boom*
stack overflow -> dead machine.

We must either guarantee io_schedule() will never call io_schedule() or
that io_schedule() itself will not add new work to the current plug such
that calling io_schedule() itself will not recurse on the blk stuff.

Pick either option, but pick one.

Without providing such a guarantee I'm not comfortable making this warn
go away.

2015-02-10 11:01:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Tue, Feb 10, 2015 at 10:29:36AM +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2015 at 01:50:17PM +1100, NeilBrown wrote:
> > On Mon, 9 Feb 2015 10:10:00 +0100 Peter Zijlstra <[email protected]> wrote:
> > > > However, when io_schedule() explicitly calls blk_flush_plug(), then
> > > > @from_schedule=false variant is used, and the unplug functions are allowed to
> > > > allocate memory and block and maybe even call mempool_alloc() which might
> > > > call io_schedule().

Note that as it stands recursively calling io_schedule() is already
broken. Things like delayacct_blkio_{start,end}() do not nest properly.

2015-02-13 05:26:13

by NeilBrown

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Tue, 10 Feb 2015 10:29:36 +0100 Peter Zijlstra <[email protected]>
wrote:

> On Tue, Feb 10, 2015 at 01:50:17PM +1100, NeilBrown wrote:
> > On Mon, 9 Feb 2015 10:10:00 +0100 Peter Zijlstra <[email protected]> wrote:
> > > > However, when io_schedule() explicitly calls blk_flush_plug(), then
> > > > @from_schedule=false variant is used, and the unplug functions are allowed to
> > > > allocate memory and block and maybe even call mempool_alloc() which might
> > > > call io_schedule().
> > > >
> > > > This shouldn't be a problem as blk_flush_plug() spliced out the plug list, so
> > > > any recursive call will find an empty list and do nothing.
> > >
> > > Unless, something along the way stuck something back on, right? So
> > > should we stick an:
> > >
> > > WARN_ON(current->in_iowait);
> > >
> > > somewhere near where things are added to this plug list? (and move the
> > > blk_flush_plug() call inside of where that's actually true of course).
> >
> > No, I don't think so.
> >
> > It is certainly possible that some request on plug->cb_list could add
> > something to plug->list - which is processed after ->cb_list.
> >
> > I think the best way to think about this is that the *problem* was that a
> > wait_event loop could spin without making any progress. So any time that
> > clear forward progress is made it is safe sleep without necessitating the
> > warning. Hence sched_annotate_sleep() is reasonable.
> > blk_flush_plug() with definitely have dispatched some requests if it
> > might_sleep(), so the sleep is OK.
>
> Well, yes, but you forget that this gets us back into recursion land.
> io_schedule() calling io_schedule() calling io_schedule() and *boom*
> stack overflow -> dead machine.
>
> We must either guarantee io_schedule() will never call io_schedule() or
> that io_schedule() itself will not add new work to the current plug such
> that calling io_schedule() itself will not recurse on the blk stuff.
>
> Pick either option, but pick one.

I choose ... Buzz Lightyear !!!

Sorry, go carried away there. Uhhmm. I think I pick a/ (But I expect I'll
find a goat... ho hum).

Does this look credible?

Thanks,
NeilBrown


From: NeilBrown <[email protected]>
Date: Fri, 13 Feb 2015 15:49:17 +1100
Subject: [PATCH] sched: prevent recursion in io_schedule()

io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.

Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.

This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().

Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.

So:
- use in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.

Signed-off-by: NeilBrown <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Peter Zijlstra <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f37fe7f77a4..90f3de8bc7ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4420,30 +4420,27 @@ EXPORT_SYMBOL_GPL(yield_to);
*/
void __sched io_schedule(void)
{
- struct rq *rq = raw_rq();
-
- delayacct_blkio_start();
- atomic_inc(&rq->nr_iowait);
- blk_flush_plug(current);
- current->in_iowait = 1;
- schedule();
- current->in_iowait = 0;
- atomic_dec(&rq->nr_iowait);
- delayacct_blkio_end();
+ io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
}
EXPORT_SYMBOL(io_schedule);

long __sched io_schedule_timeout(long timeout)
{
- struct rq *rq = raw_rq();
+ struct rq *rq;
long ret;
+ int old_iowait = current->in_iowait;
+
+ current->in_iowait = 1;
+ if (old_iowait)
+ blk_schedule_flush_plug(current);
+ else
+ blk_flush_plug(current);

delayacct_blkio_start();
+ rq = raw_rq();
atomic_inc(&rq->nr_iowait);
- blk_flush_plug(current);
- current->in_iowait = 1;
ret = schedule_timeout(timeout);
- current->in_iowait = 0;
+ current->in_iowait = old_iowait;
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
return ret;


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-02-13 08:32:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, Feb 13, 2015 at 04:26:00PM +1100, NeilBrown wrote:
> I choose ... Buzz Lightyear !!!

Great choice!

> From: NeilBrown <[email protected]>
> Date: Fri, 13 Feb 2015 15:49:17 +1100
> Subject: [PATCH] sched: prevent recursion in io_schedule()
>
> io_schedule() calls blk_flush_plug() which, depending on the
> contents of current->plug, can initiate arbitrary blk-io requests.
>
> Note that this contrasts with blk_schedule_flush_plug() which requires
> all non-trivial work to be handed off to a separate thread.
>
> This makes it possible for io_schedule() to recurse, and initiating
> block requests could possibly call mempool_alloc() which, in times of
> memory pressure, uses io_schedule().
>
> Apart from any stack usage issues, io_schedule() will not behave
> correctly when called recursively as delayacct_blkio_start() does
> not allow for repeated calls.

Which seems to still be an issue with this patch.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1f37fe7f77a4..90f3de8bc7ca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4420,30 +4420,27 @@ EXPORT_SYMBOL_GPL(yield_to);
> */
> void __sched io_schedule(void)
> {
> + io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
> }
> EXPORT_SYMBOL(io_schedule);

Might as well move it to sched.h as an inline or so..

> long __sched io_schedule_timeout(long timeout)
> {
> + struct rq *rq;
> long ret;
> + int old_iowait = current->in_iowait;
> +
> + current->in_iowait = 1;
> + if (old_iowait)
> + blk_schedule_flush_plug(current);
> + else
> + blk_flush_plug(current);
>
> delayacct_blkio_start();
> + rq = raw_rq();
> atomic_inc(&rq->nr_iowait);
> ret = schedule_timeout(timeout);
> + current->in_iowait = old_iowait;
> atomic_dec(&rq->nr_iowait);
> delayacct_blkio_end();
> return ret;

Like said, that will still recursive call delayacct_blkio_*() and would
increase nr_iowait for a second time; while arguably its still the same
one io-wait instance.

So would a little something like:

long __sched io_schedule_timeout(long timeout)
{
struct rq *rq;
long ret;

/*
* Recursive io_schedule() call; make sure to not recurse
* on the blk_flush_plug() stuff again.
*/
if (unlikely(current->in_iowait)) {
/*
* Our parent io_schedule() call will already have done
* all the required io-wait accounting.
*/
blk_schedule_flush_plug(current);
return schedule_timeout(timeout);
}

current->in_iowait = 1;
delayacct_blkio_start();
rq = raw_rq();
atomic_inc(&rq->nr_iowait);
blk_flush_plug(current);
ret = schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
current->in_iowait = 0;

return ret;
}

not make more sense?

2015-02-13 08:50:06

by NeilBrown

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, 13 Feb 2015 09:32:50 +0100 Peter Zijlstra <[email protected]>
wrote:

> On Fri, Feb 13, 2015 at 04:26:00PM +1100, NeilBrown wrote:
> > I choose ... Buzz Lightyear !!!
>
> Great choice!
>
> > From: NeilBrown <[email protected]>
> > Date: Fri, 13 Feb 2015 15:49:17 +1100
> > Subject: [PATCH] sched: prevent recursion in io_schedule()
> >
> > io_schedule() calls blk_flush_plug() which, depending on the
> > contents of current->plug, can initiate arbitrary blk-io requests.
> >
> > Note that this contrasts with blk_schedule_flush_plug() which requires
> > all non-trivial work to be handed off to a separate thread.
> >
> > This makes it possible for io_schedule() to recurse, and initiating
> > block requests could possibly call mempool_alloc() which, in times of
> > memory pressure, uses io_schedule().
> >
> > Apart from any stack usage issues, io_schedule() will not behave
> > correctly when called recursively as delayacct_blkio_start() does
> > not allow for repeated calls.
>
> Which seems to still be an issue with this patch.
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1f37fe7f77a4..90f3de8bc7ca 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4420,30 +4420,27 @@ EXPORT_SYMBOL_GPL(yield_to);
> > */
> > void __sched io_schedule(void)
> > {
> > + io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
> > }
> > EXPORT_SYMBOL(io_schedule);
>
> Might as well move it to sched.h as an inline or so..
>
> > long __sched io_schedule_timeout(long timeout)
> > {
> > + struct rq *rq;
> > long ret;
> > + int old_iowait = current->in_iowait;
> > +
> > + current->in_iowait = 1;
> > + if (old_iowait)
> > + blk_schedule_flush_plug(current);
> > + else
> > + blk_flush_plug(current);
> >
> > delayacct_blkio_start();
> > + rq = raw_rq();
> > atomic_inc(&rq->nr_iowait);
> > ret = schedule_timeout(timeout);
> > + current->in_iowait = old_iowait;
> > atomic_dec(&rq->nr_iowait);
> > delayacct_blkio_end();
> > return ret;
>
> Like said, that will still recursive call delayacct_blkio_*() and would
> increase nr_iowait for a second time; while arguably its still the same
> one io-wait instance.

No it doesn't. There is no "blk_flush_plug" call between the
delayacct_blkio_*() calls.

I've moved blk_flush_plug to the beginning of the function.

>
> So would a little something like:
>
> long __sched io_schedule_timeout(long timeout)
> {
> struct rq *rq;
> long ret;
>
> /*
> * Recursive io_schedule() call; make sure to not recurse
> * on the blk_flush_plug() stuff again.
> */
> if (unlikely(current->in_iowait)) {
> /*
> * Our parent io_schedule() call will already have done
> * all the required io-wait accounting.
> */
> blk_schedule_flush_plug(current);
> return schedule_timeout(timeout);
> }
>
> current->in_iowait = 1;
> delayacct_blkio_start();
> rq = raw_rq();
> atomic_inc(&rq->nr_iowait);
> blk_flush_plug(current);
> ret = schedule_timeout(timeout);
> atomic_dec(&rq->nr_iowait);
> delayacct_blkio_end();
> current->in_iowait = 0;
>
> return ret;
> }
>
> not make more sense?

That does make a similar amount of sense at least....

I wondered if it really make sense to call blk_flush_plug with nr_iowait
elevated and delayacct_blkio active. blk_flush_plug() could call schedule()
for non-"io" reasons and maybe that could upset stuff???

I don't really know. I'm happy with your version. I don't suppose anyone
else is paying attention and could give a third opinion....

Thanks,
NeilBrown



> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-02-13 10:27:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, Feb 13, 2015 at 07:49:53PM +1100, NeilBrown wrote:
> > Like said, that will still recursive call delayacct_blkio_*() and would
> > increase nr_iowait for a second time; while arguably its still the same
> > one io-wait instance.
>
> No it doesn't. There is no "blk_flush_plug" call between the
> delayacct_blkio_*() calls.

Duh, clearly I needed to still wake up :/

> I've moved blk_flush_plug to the beginning of the function.

> I wondered if it really make sense to call blk_flush_plug with nr_iowait
> elevated and delayacct_blkio active. blk_flush_plug() could call schedule()
> for non-"io" reasons and maybe that could upset stuff???

Yeah, good question that. Lemme ponder that a bit.

> I don't really know. I'm happy with your version. I don't suppose anyone
> else is paying attention and could give a third opinion....

:-)

2015-02-13 14:48:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, Feb 13, 2015 at 11:27:46AM +0100, Peter Zijlstra wrote:

> > I've moved blk_flush_plug to the beginning of the function.
>
> > I wondered if it really make sense to call blk_flush_plug with nr_iowait
> > elevated and delayacct_blkio active. blk_flush_plug() could call schedule()
> > for non-"io" reasons and maybe that could upset stuff???
>
> Yeah, good question that. Lemme ponder that a bit.

Yes, I thikn your version makes most sense as, you say, even regular
schedule() call nested in my version would go towards blk delayacct --
and I doubt that was the intent; even though the current kernel works
that way.

I'll move the now rudimentary io_schedule() into sched.h as an inline.

2015-02-18 01:09:48

by NeilBrown

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Fri, 13 Feb 2015 15:48:03 +0100 Peter Zijlstra <[email protected]>
wrote:

> On Fri, Feb 13, 2015 at 11:27:46AM +0100, Peter Zijlstra wrote:
>
> > > I've moved blk_flush_plug to the beginning of the function.
> >
> > > I wondered if it really make sense to call blk_flush_plug with nr_iowait
> > > elevated and delayacct_blkio active. blk_flush_plug() could call schedule()
> > > for non-"io" reasons and maybe that could upset stuff???
> >
> > Yeah, good question that. Lemme ponder that a bit.
>
> Yes, I thikn your version makes most sense as, you say, even regular
> schedule() call nested in my version would go towards blk delayacct --
> and I doubt that was the intent; even though the current kernel works
> that way.
>
> I'll move the now rudimentary io_schedule() into sched.h as an inline.

Thanks.

Are you OK with this going to -stable for 3.19?

NeilBrown


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-02-18 13:47:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RAID1 might_sleep() warning on 3.19-rc7

On Wed, Feb 18, 2015 at 12:09:36PM +1100, NeilBrown wrote:
> On Fri, 13 Feb 2015 15:48:03 +0100 Peter Zijlstra <[email protected]>
> wrote:
>
> > On Fri, Feb 13, 2015 at 11:27:46AM +0100, Peter Zijlstra wrote:
> >
> > > > I've moved blk_flush_plug to the beginning of the function.
> > >
> > > > I wondered if it really make sense to call blk_flush_plug with nr_iowait
> > > > elevated and delayacct_blkio active. blk_flush_plug() could call schedule()
> > > > for non-"io" reasons and maybe that could upset stuff???
> > >
> > > Yeah, good question that. Lemme ponder that a bit.
> >
> > Yes, I thikn your version makes most sense as, you say, even regular
> > schedule() call nested in my version would go towards blk delayacct --
> > and I doubt that was the intent; even though the current kernel works
> > that way.
> >
> > I'll move the now rudimentary io_schedule() into sched.h as an inline.
>
> Thanks.
>
> Are you OK with this going to -stable for 3.19?

OK, lets do that.

Subject: [tip:sched/core] sched: Prevent recursion in io_schedule()

Commit-ID: 9cff8adeaa34b5d2802f03f89803da57856b3b72
Gitweb: http://git.kernel.org/tip/9cff8adeaa34b5d2802f03f89803da57856b3b72
Author: NeilBrown <[email protected]>
AuthorDate: Fri, 13 Feb 2015 15:49:17 +1100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 14:27:44 +0100

sched: Prevent recursion in io_schedule()

io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.

Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.

This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().

Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.

So:
- use ->in_iowait to detect recursion. Set it earlier, and restore
it to the old value.
- move the call to "raw_rq" after the call to blk_flush_plug().
As this is some sort of per-cpu thing, we want some chance that
we are on the right CPU
- When io_schedule() is called recurively, use blk_schedule_flush_plug()
which cannot further recurse.
- as this makes io_schedule() a lot more complex and as io_schedule()
must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
and make io_schedule a simple wrapper for that.

Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Tony Battersby <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 10 +++++++---
kernel/sched/core.c | 31 ++++++++++++-------------------
2 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..cb5cdc7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -363,9 +363,6 @@ extern void show_regs(struct pt_regs *);
*/
extern void show_stack(struct task_struct *task, unsigned long *sp);

-void io_schedule(void);
-long io_schedule_timeout(long timeout);
-
extern void cpu_init (void);
extern void trap_init(void);
extern void update_process_times(int user);
@@ -422,6 +419,13 @@ extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);

+extern long io_schedule_timeout(long timeout);
+
+static inline void io_schedule(void)
+{
+ io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+}
+
struct nsproxy;
struct user_namespace;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c314000..daaea92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4358,36 +4358,29 @@ EXPORT_SYMBOL_GPL(yield_to);
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
*/
-void __sched io_schedule(void)
-{
- struct rq *rq = raw_rq();
-
- delayacct_blkio_start();
- atomic_inc(&rq->nr_iowait);
- blk_flush_plug(current);
- current->in_iowait = 1;
- schedule();
- current->in_iowait = 0;
- atomic_dec(&rq->nr_iowait);
- delayacct_blkio_end();
-}
-EXPORT_SYMBOL(io_schedule);
-
long __sched io_schedule_timeout(long timeout)
{
- struct rq *rq = raw_rq();
+ int old_iowait = current->in_iowait;
+ struct rq *rq;
long ret;

+ current->in_iowait = 1;
+ if (old_iowait)
+ blk_schedule_flush_plug(current);
+ else
+ blk_flush_plug(current);
+
delayacct_blkio_start();
+ rq = raw_rq();
atomic_inc(&rq->nr_iowait);
- blk_flush_plug(current);
- current->in_iowait = 1;
ret = schedule_timeout(timeout);
- current->in_iowait = 0;
+ current->in_iowait = old_iowait;
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
+
return ret;
}
+EXPORT_SYMBOL(io_schedule_timeout);

/**
* sys_sched_get_priority_max - return maximum RT priority.