2008-11-10 13:12:18

by Mikulas Patocka

[permalink] [raw]
Subject: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

Hi

The bug is correctly analyzed.

Regarding the patch --- you need to wake up md->wait if dec_pending drops
to zero, so that dm_suspend is not waiting forever.

Otherwise the patch is correct, but contains some unneeded parts: if
md->pending is used to block removal, we no longer need to hold the
reference on the table at all. If the code is already called under a
spinlock (from upstream kernel), we could execute everything under
md->map_lock spinlock and drop this reference counting at all. But the
code can take indefinite amount of time, so it would be best to not call
it under a spinlock in the first place.

For upstream Linux developers: you are holding a spinlock and calling
bdi*_congested functions that can take indefinite amount of time (there
are even users reporting having 50 disks in one logical volume or so). I
think it would be good to move these calls out of spinlocks.

What are the general locking rules for these block queue upcalls
(congested_fn, unplug_fn, merge_bvec_fn & others)? It looks like there may
be more bugs like this in another upcalls. Can you please somehow specify
what can the upcall do and what can't?

Mikulas

On Wed, 5 Nov 2008, Chandra Seetharaman wrote:

> dm_any_congested() just checks for the DMF_BLOCK_IO and has no
> code to make sure that suspend waits for dm_any_congested() to
> complete.
>
> This leads to a case where the table_destroy() and free_multipath()
> and some other sleeping functions are called (thru dm_table_put())
> from dm_any_congested.
>
> This leads to 2 problems:
> 1. Sleeping functions called from congested code, whose caller
> holds a spin lock.
> 2. An ABBA deadlock between pdflush and multipathd. The two locks
> in contention are inode lock and kernel lock.
> Here is the crash analysis:
> PID: 16879 TASK: ffff81013a06a140 CPU: 3 COMMAND: "pdflush"
> Owns inode_lock and waiting for kernel_sem
>
> PID: 8299 TASK: ffff81024f03e100 CPU: 2 COMMAND: "multipathd"
> Owns kernel_sem and waiting for inode_lock.
>
>
> PID: 8299 TASK: ffff81024f03e100 CPU: 2 COMMAND: "multipathd"^M
> #0 [ffff81024ad338c8] schedule at ffffffff8128534c^M
> #1 [ffff81024ad33980] rt_spin_lock_slowlock at ffffffff81286d15^M << Waiting
> for inode_lock
> #2 [ffff81024ad33a40] __rt_spin_lock at ffffffff812873b0^M
> #3 [ffff81024ad33a50] rt_spin_lock at ffffffff812873bb^M
> #4 [ffff81024ad33a60] ifind_fast at ffffffff810c32a1^M
> #5 [ffff81024ad33a90] iget_locked at ffffffff810c3be8^M
> #6 [ffff81024ad33ad0] proc_get_inode at ffffffff810ebaff^M
> #7 [ffff81024ad33b10] proc_lookup at ffffffff810f082a^M
> #8 [ffff81024ad33b40] proc_root_lookup at ffffffff810ec312^M
> #9 [ffff81024ad33b70] do_lookup at ffffffff810b78f7^M
> #10 [ffff81024ad33bc0] __link_path_walk at ffffffff810b9a93^M
> #11 [ffff81024ad33c60] link_path_walk at ffffffff810b9fc1^M
> #12 [ffff81024ad33d30] path_walk at ffffffff810ba073^M
> #13 [ffff81024ad33d40] do_path_lookup at ffffffff810ba37a^M
> #14 [ffff81024ad33d90] __path_lookup_intent_open at ffffffff810baeb0^M
> #15 [ffff81024ad33de0] path_lookup_open at ffffffff810baf60^M
> #16 [ffff81024ad33df0] open_namei at ffffffff810bb071^M
> #17 [ffff81024ad33e80] do_filp_open at ffffffff810ae610^M
> #18 [ffff81024ad33f30] do_sys_open at ffffffff810ae67f^M
> #19 [ffff81024ad33f70] sys_open at ffffffff810ae729^M
>
>
> PID: 16879 TASK: ffff81013a06a140 CPU: 3 COMMAND: "pdflush"^M
> #0 [ffff810023063aa0] schedule at ffffffff8128534c^M
> #1 [ffff810023063b58] rt_mutex_slowlock at ffffffff81286ac5^M << Waiting for
> Kernel Lock
> #2 [ffff810023063c28] rt_mutex_lock at ffffffff81285fb4^M
> #3 [ffff810023063c38] rt_down at ffffffff8105fec7^M
> #4 [ffff810023063c58] lock_kernel at ffffffff81287b8c^M
> #5 [ffff810023063c78] __blkdev_put at ffffffff810d5d31^M
> #6 [ffff810023063cb8] blkdev_put at ffffffff810d5e68^M
> #7 [ffff810023063cc8] close_dev at ffffffff8819e547^M
> #8 [ffff810023063ce8] dm_put_device at ffffffff8819e579^M
> #9 [ffff810023063d08] free_priority_group at ffffffff881c0e86^M
> #10 [ffff810023063d58] free_multipath at ffffffff881c0f11^M
> #11 [ffff810023063d78] multipath_dtr at ffffffff881c0f73^M
> #12 [ffff810023063d98] dm_table_put at ffffffff8819e347^M
> #13 [ffff810023063dc8] dm_any_congested at ffffffff8819d074^M
> #14 [ffff810023063df8] sync_sb_inodes at ffffffff810cd451^M
> #15 [ffff810023063e38] writeback_inodes at ffffffff810cd7b5^M
> #16 [ffff810023063e68] background_writeout at ffffffff8108be38^M
> #17 [ffff810023063ed8] pdflush at ffffffff8108c79a^M
> #18 [ffff810023063f28] kthread at ffffffff81051477^M
> #19 [ffff810023063f48] kernel_thread at ffffffff8100d048^M
>
> crash> kernel_sem
> kernel_sem = $5 = {
> count = {
> counter = 0
> },
> lock = {
> wait_lock = {
> raw_lock = {
> slock = 34952
> },
> break_lock = 0
> },
> wait_list = {
> prio_list = {
> next = 0xffff810023063b88,
> prev = 0xffff81014941fdc0
> },
> node_list = {
> next = 0xffff810023063b98,
> prev = 0xffff81014d04ddd0
> }
> },
> owner = 0xffff81024f03e102 << multipathd owns it. (task 0xffff81024f03e100)
> << last two bits are flags..
> so replace it with 0.
> }
> }
> crash> inode_lock
> inode_lock = $6 = {
> lock = {
> wait_lock = {
> raw_lock = {
> slock = 3341
> },
> break_lock = 0
> },
> wait_list = {
> prio_list = {
> next = 0xffff81024ad339a0,
> prev = 0xffff8100784fdd78
> },
> node_list = {
> next = 0xffff81024ad339b0,
> prev = 0xffff81024afb3a70
> }
> },
> owner = 0xffff81013a06a142 << pdflush owns it. (task 0xffff81013a06a140)
> },
> break_lock = 0
> }
> crash>
> ----------------
>
> Signed-off-by: Chandra Seetharaman <[email protected]>
> ----
>
> Index: linux-2.6.28-rc3/drivers/md/dm.c
> ===================================================================
> --- linux-2.6.28-rc3.orig/drivers/md/dm.c
> +++ linux-2.6.28-rc3/drivers/md/dm.c
> @@ -937,16 +937,21 @@ static void dm_unplug_all(struct request
>
> static int dm_any_congested(void *congested_data, int bdi_bits)
> {
> - int r;
> + int r = bdi_bits;
> struct mapped_device *md = (struct mapped_device *) congested_data;
> - struct dm_table *map = dm_get_table(md);
> + struct dm_table *map;
>
> - if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
> - r = bdi_bits;
> - else
> - r = dm_table_any_congested(map, bdi_bits);
> + atomic_inc(&md->pending);
> + if (test_bit(DMF_BLOCK_IO, &md->flags))
> + goto done;
>
> - dm_table_put(map);
> + map = dm_get_table(md);
> + if (map) {
> + r = dm_table_any_congested(map, bdi_bits);
> + dm_table_put(map);
> + }
> +done:
> + atomic_dec(&md->pending);
> return r;
> }
>
>
>
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel
>


2008-11-10 13:54:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> For upstream Linux developers: you are holding a spinlock and calling
> bdi*_congested functions that can take indefinite amount of time (there
> are even users reporting having 50 disks in one logical volume or so). I
> think it would be good to move these calls out of spinlocks.

Umm, they shouldn't block that long, as that completely defeats their
purpose. These functions are mostly used to avoid throwing more I/O at
a congested device if pdflush could do more useful things instead. But
if it blocks in those functions anyway we wouldn't have to bother using
them. Do you have more details about the uses cases when this happens
and where the routines spend so much time?

2008-11-10 14:19:38

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, 10 Nov 2008, Christoph Hellwig wrote:

> On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > For upstream Linux developers: you are holding a spinlock and calling
> > bdi*_congested functions that can take indefinite amount of time (there
> > are even users reporting having 50 disks in one logical volume or so). I
> > think it would be good to move these calls out of spinlocks.
>
> Umm, they shouldn't block that long, as that completely defeats their
> purpose. These functions are mostly used to avoid throwing more I/O at
> a congested device if pdflush could do more useful things instead. But
> if it blocks in those functions anyway we wouldn't have to bother using
> them. Do you have more details about the uses cases when this happens
> and where the routines spend so much time?

For device mapper, congested_fn asks every device in the tree and make OR
of their bits --- so if the user has 50 devices, it asks them all.

For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
same --- asking every device.

If you have a better idea how to implement congested_fn, say it.

Mikulas

2008-11-10 14:27:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
> On Mon, 10 Nov 2008, Christoph Hellwig wrote:
>
> > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > For upstream Linux developers: you are holding a spinlock and calling
> > > bdi*_congested functions that can take indefinite amount of time (there
> > > are even users reporting having 50 disks in one logical volume or so). I
> > > think it would be good to move these calls out of spinlocks.
> >
> > Umm, they shouldn't block that long, as that completely defeats their
> > purpose. These functions are mostly used to avoid throwing more I/O at
> > a congested device if pdflush could do more useful things instead. But
> > if it blocks in those functions anyway we wouldn't have to bother using
> > them. Do you have more details about the uses cases when this happens
> > and where the routines spend so much time?
>
> For device mapper, congested_fn asks every device in the tree and make OR
> of their bits --- so if the user has 50 devices, it asks them all.
>
> For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
> same --- asking every device.
>
> If you have a better idea how to implement congested_fn, say it.

Fix the infrastructure by adding a function call so that you can have
the individual devices report their congestion state to the aggregate.

Then congestion_fn can return a valid state in O(1) because the state is
keps up-to-date by the individual state changes.

IOW, add a set_congested_fn() and clear_congested_fn().

2008-11-10 14:27:35

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, Nov 10, 2008 at 08:54:01AM -0500, Christoph Hellwig wrote:
> On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > For upstream Linux developers: you are holding a spinlock and calling
> > bdi*_congested functions that can take indefinite amount of time (there
> > are even users reporting having 50 disks in one logical volume or so). I
> > think it would be good to move these calls out of spinlocks.
> Umm, they shouldn't block that long, as that completely defeats their
> purpose.

Indeed - the blocking was a bug for which there's a patch, but that doesn't
deal with how the function should be operating in the first place.

- If one device is found to be congested, why bother checking the remaining
devices?

- If the device is suspended, the response should be that it is congested, I'd
have thought.

Alasdair
--
[email protected]

2008-11-10 14:32:51

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)



On Mon, 10 Nov 2008, Peter Zijlstra wrote:

> On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
> > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
> >
> > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > > For upstream Linux developers: you are holding a spinlock and calling
> > > > bdi*_congested functions that can take indefinite amount of time (there
> > > > are even users reporting having 50 disks in one logical volume or so). I
> > > > think it would be good to move these calls out of spinlocks.
> > >
> > > Umm, they shouldn't block that long, as that completely defeats their
> > > purpose. These functions are mostly used to avoid throwing more I/O at
> > > a congested device if pdflush could do more useful things instead. But
> > > if it blocks in those functions anyway we wouldn't have to bother using
> > > them. Do you have more details about the uses cases when this happens
> > > and where the routines spend so much time?
> >
> > For device mapper, congested_fn asks every device in the tree and make OR
> > of their bits --- so if the user has 50 devices, it asks them all.
> >
> > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
> > same --- asking every device.
> >
> > If you have a better idea how to implement congested_fn, say it.
>
> Fix the infrastructure by adding a function call so that you can have
> the individual devices report their congestion state to the aggregate.
>
> Then congestion_fn can return a valid state in O(1) because the state is
> keps up-to-date by the individual state changes.
>
> IOW, add a set_congested_fn() and clear_congested_fn().

If you have a physical disk that has many LVM volumes on it, you end up in
a situation when disk congestion state change is reported to all the
volumes. So it will create O(n) problem at the other side.

Mikulas

2008-11-10 14:35:39

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, Nov 10, 2008 at 03:26:50PM +0100, Peter Zijlstra wrote:
> Fix the infrastructure by adding a function call so that you can have
> the individual devices report their congestion state to the aggregate.

Which requires knowing/accessing some device hierarchy, which would be nice
for other things too (like propagation of notification of changes to device
properties such as size, hardsect size etc.).

Alasdair
--
[email protected]

2008-11-10 14:40:34

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)



On Mon, 10 Nov 2008, Alasdair G Kergon wrote:

> On Mon, Nov 10, 2008 at 08:54:01AM -0500, Christoph Hellwig wrote:
> > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > For upstream Linux developers: you are holding a spinlock and calling
> > > bdi*_congested functions that can take indefinite amount of time (there
> > > are even users reporting having 50 disks in one logical volume or so). I
> > > think it would be good to move these calls out of spinlocks.
> > Umm, they shouldn't block that long, as that completely defeats their
> > purpose.
>
> Indeed - the blocking was a bug for which there's a patch, but that doesn't
> deal with how the function should be operating in the first place.

To me it looks like the bug is more severe --- holding a reference to
table can happen in other upcalls too and in many other dm places.

I'm considering if we could call the table destructor just at one place
(that would wait until all references are gone), instead of calling the
destructor at each dm_table_put point (there are many of these points and
it's hard to check all of them for correctness).

> - If one device is found to be congested, why bother checking the remaining
> devices?

Yes, that could be improved. But it doesn't solve the O(n) problem.

> - If the device is suspended, the response should be that it is
> congested, I'd have thought.

Yes, but these congestion upcalls are used only for optimization and the
device is suspended for so small time, that it doesn't matter if we
optimize io acces in small moment or not.

Mikulas

> Alasdair
> --
> [email protected]
>

2008-11-10 14:47:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, 2008-11-10 at 09:32 -0500, Mikulas Patocka wrote:
>
> On Mon, 10 Nov 2008, Peter Zijlstra wrote:
>
> > On Mon, 2008-11-10 at 09:19 -0500, Mikulas Patocka wrote:
> > > On Mon, 10 Nov 2008, Christoph Hellwig wrote:
> > >
> > > > On Mon, Nov 10, 2008 at 08:11:51AM -0500, Mikulas Patocka wrote:
> > > > > For upstream Linux developers: you are holding a spinlock and calling
> > > > > bdi*_congested functions that can take indefinite amount of time (there
> > > > > are even users reporting having 50 disks in one logical volume or so). I
> > > > > think it would be good to move these calls out of spinlocks.
> > > >
> > > > Umm, they shouldn't block that long, as that completely defeats their
> > > > purpose. These functions are mostly used to avoid throwing more I/O at
> > > > a congested device if pdflush could do more useful things instead. But
> > > > if it blocks in those functions anyway we wouldn't have to bother using
> > > > them. Do you have more details about the uses cases when this happens
> > > > and where the routines spend so much time?
> > >
> > > For device mapper, congested_fn asks every device in the tree and make OR
> > > of their bits --- so if the user has 50 devices, it asks them all.
> > >
> > > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
> > > same --- asking every device.
> > >
> > > If you have a better idea how to implement congested_fn, say it.
> >
> > Fix the infrastructure by adding a function call so that you can have
> > the individual devices report their congestion state to the aggregate.
> >
> > Then congestion_fn can return a valid state in O(1) because the state is
> > keps up-to-date by the individual state changes.
> >
> > IOW, add a set_congested_fn() and clear_congested_fn().
>
> If you have a physical disk that has many LVM volumes on it, you end up in
> a situation when disk congestion state change is reported to all the
> volumes. So it will create O(n) problem at the other side.

*sigh* I can almost understand why people want to use lvm to combine
multiple disks, but why make the partition thing even worse...

2008-11-10 14:56:17

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, 10 Nov 2008, Peter Zijlstra wrote:

> > If you have a physical disk that has many LVM volumes on it, you end up in
> > a situation when disk congestion state change is reported to all the
> > volumes. So it will create O(n) problem at the other side.
>
> *sigh* I can almost understand why people want to use lvm to combine
> multiple disks, but why make the partition thing even worse...

To protect the services from each other --- so that mail storm won't blow
user's directories, users blowing their space can't corrupt the system
updates, logs are kept safely even if other partition overflows etc.

I just know some admins who prefer to have separate filesystems instead of
quotas for this purpose.

Mikulas

2008-11-11 18:19:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, Nov 10, 2008 at 09:19:21AM -0500, Mikulas Patocka wrote:
> For device mapper, congested_fn asks every device in the tree and make OR
> of their bits --- so if the user has 50 devices, it asks them all.
>
> For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
> same --- asking every device.
>
> If you have a better idea how to implement congested_fn, say it.

Well, even asking 50 devices shouldn't take that long normally. Do you
take some sleeping lock that might block forever? In that case I would
just return congested for that device as it would delay pdflush
otherwise.

2008-11-11 18:21:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, Nov 10, 2008 at 02:27:15PM +0000, Alasdair G Kergon wrote:
> Indeed - the blocking was a bug for which there's a patch, but that doesn't
> deal with how the function should be operating in the first place.
>
> - If one device is found to be congested, why bother checking the remaining
> devices?
>
> - If the device is suspended, the response should be that it is congested, I'd
> have thought.

Yes. device congested is a quick an dirty check - if you encounter
anything long just return congested if I/O would hit this long delay,
too or not congested if you can't really know.

2008-11-11 18:22:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Mon, Nov 10, 2008 at 09:40:18AM -0500, Mikulas Patocka wrote:
> > - If the device is suspended, the response should be that it is
> > congested, I'd have thought.
>
> Yes, but these congestion upcalls are used only for optimization and the
> device is suspended for so small time, that it doesn't matter if we
> optimize io acces in small moment or not.

We use device congested to not block pdflush on a device that can't
currently accept I/O. For a suspended device congested is always
correct, pdflush will just come back a little later and in the mean time
write out stuff that doesn't delay it.

2008-11-11 22:09:00

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Queue upcall locking (was: [dm-devel] [RFC][PATCH] fix dm_any_congested() to properly sync up with suspend code path)

On Tue, 11 Nov 2008, Christoph Hellwig wrote:

> On Mon, Nov 10, 2008 at 09:19:21AM -0500, Mikulas Patocka wrote:
> > For device mapper, congested_fn asks every device in the tree and make OR
> > of their bits --- so if the user has 50 devices, it asks them all.
> >
> > For md-linear, md-raid0, md-raid1, md-raid10 and md-multipath it does the
> > same --- asking every device.
> >
> > If you have a better idea how to implement congested_fn, say it.
>
> Well, even asking 50 devices shouldn't take that long normally.

There are some atomic operations inside congested_fn.

I remember that there was some database benchmark on a setup with 48
devices. Just dropping one spinlock from unplug function resulted in
overall improvement (not microbenchmarks, the whole run) by 18%. The
slowdown with unplug is similar as with congested_fn --- unplugs walks all
the physical devices for a given logical device and unplugs them.

So simplifying congested_fn might help too. Do you have an idea, how often
is it called? I thought about caching the return value for one jiffy.

> Do you take some sleeping lock that might block forever? In that case I
> would just return congested for that device as it would delay pdflush
> otherwise.

Sometimes we do, and we shouldn't, because congested_fn is called with
spinlock. That's what I'm working on now.

BTW. how "realtime" do you expect the Linux kernel to be? Is there some
effort to make all spinlocks locked for finite time? (so that someone
could one day calculate the times of all spinlocks, take the maximum time
and give Linux a "realtime" label). If so, then bdi_congested call should
be moved out of spinlocks. If there are already many cases when spinlock
could be taken and list of indefinite length walked inside, it isn't worth
optimizing for it.

Mikulas