Hi,
Currently, there is a restriction on resizing device-mapper device:
it cannot be resized if 'noflush' option is given for suspend.
Since both multipath-tools (dm-multipath) and LVM2 mirror use
'noflush' suspend, the restriction limits the capability of
those tools.
The following patches remove the restriction.
[PATCH] (1/3) Add bdlookup()
[PATCH] (2/3) Use bdlookup() on noflush suspend
[PATCH] (3/3) Allow resizing table load on noflush suspend
The patches were tested on i686 and ia64 with LVM2,
for both 'noflush' case and normal case.
Background:
- For some device-mapper targets (multipath and mirror),
the mapping table sometimes has to be replaced to cope with device
failure.
OTOH, device-mapper flushes all pending I/Os upon table replacement
and may result in I/O errors, if there are device failures.
'noflush' suspend is used to let dm queue the pending I/Os
instead of flushing them.
Since it's not possible for user space program to tell whether
the suspend could cause I/O error, they always use
'noflush' to suspend mirror/multipath targets.
- Currently resizing is disabled for 'noflush' suspend.
Resizing occurs in the course of table replacement.
To resize the device under use, device-mapper needs to get its
bdev inode. However, using bdget() in this case could cause deadlock
by waiting for I_LOCK where an I/O process holding I_LOCK is
waiting for completion of table replacement.
Thanks,
--
Jun'ichi Nomura, NEC Corporation of America
On Wed, Oct 24, 2007 at 05:25:17PM -0400, Jun'ichi Nomura wrote:
> - For some device-mapper targets (multipath and mirror),
> the mapping table sometimes has to be replaced to cope with device
> failure.
> OTOH, device-mapper flushes all pending I/Os upon table replacement
> and may result in I/O errors, if there are device failures.
> 'noflush' suspend is used to let dm queue the pending I/Os
> instead of flushing them.
> Since it's not possible for user space program to tell whether
> the suspend could cause I/O error, they always use
> 'noflush' to suspend mirror/multipath targets.
>
> - Currently resizing is disabled for 'noflush' suspend.
> Resizing occurs in the course of table replacement.
> To resize the device under use, device-mapper needs to get its
> bdev inode. However, using bdget() in this case could cause deadlock
> by waiting for I_LOCK where an I/O process holding I_LOCK is
> waiting for completion of table replacement.
Before reviewing the details of the proposed workaround, I'd like to see
a deeper analysis of the problem to see that there isn't a cleaner way
to resolve this.
For example:
Question) What are the realistic situations we must support that lead to
a resize during table reload with I/O outstanding?
- The resize is the purpose of the reload; noflush is only set to avoid losing
I/O if a path should fail. So any outstanding I/O may be expected to be
consistent with both the old and new sizes of the device. E.g. If it's
beyond the end of a shrinking device and userspace cared about not
losing that I/O, it would have waited for that I/O to be flushed
*before* issuing the resize. If the I/O is beyond the end of the
existing device but within the new size, userspace would have waited for
the resize operation to complete before allowing the new I/O to be
issued.
=> Is it OK for device-mapper to handle the device size check
internally, rejecting any I/O that falls beyond the end of the table (it
already must do this lookup anyway), and to update the size recorded in
the inode later, after I/O is flowing through the device again, but (of
course) before reporting that the resize operation is complete?
I.e. does it eliminate deadlocks if the bdget() and i_size_write()
happen after the 'resume'?
Alasdair
--
[email protected]
Hi Alasdair,
Alasdair G Kergon wrote:
> Before reviewing the details of the proposed workaround, I'd like to see
> a deeper analysis of the problem to see that there isn't a cleaner way
> to resolve this.
OK. Let me try.
> For example:
>
> Question) What are the realistic situations we must support that lead to
> a resize during table reload with I/O outstanding?
'noflush' is currently the only option for userspace to suspend without
risking otherwise-avoidable I/O lost, because failure of underlying
device might occur *after* the userspace started normal suspend
but *before* the flushing completes. (normal suspend = flushing suspend)
E.g. if userspace wants to resize dm-mpath device with queue_if_no_path,
it has to use 'noflush' suspend on table swapping.
Otherwise, path failures during the suspend will cause I/O error,
though queue_if_no_path is specified.
Other possible solution would be allowing the suspend to fail if it
can't flush all I/O and letting userspace to retry after fixing
the device failure.
> - The resize is the purpose of the reload; noflush is only set to avoid losing
> I/O if a path should fail. So any outstanding I/O may be expected to be
> consistent with both the old and new sizes of the device. E.g. If it's
> beyond the end of a shrinking device and userspace cared about not
> losing that I/O, it would have waited for that I/O to be flushed
> *before* issuing the resize. If the I/O is beyond the end of the
> existing device but within the new size, userspace would have waited for
> the resize operation to complete before allowing the new I/O to be
> issued.
If userspace cares about not losing I/O, it should wait for the I/O
before trying to shrink the device size.
After the shrinking started, any I/O beyond the new end of the device
would have a possibility of lost.
Reducing the size of the device while actively running I/O on it
would anyway have a possibility of losing some of the I/O.
> If the I/O is beyond the end of the
> existing device but within the new size, userspace would have waited for
> the resize operation to complete before allowing the new I/O to be
> issued.
Issuing the I/O beyond the end of the existing device would get error.
If the issuer knows the device will be extended, it should wait for
the completion of the extention.
If it doesn't know, such I/O won't be issued.
> => Is it OK for device-mapper to handle the device size check
> internally, rejecting any I/O that falls beyond the end of the table (it
I think this check is needed in current dm, regardless of noflush.
> already must do this lookup anyway), and to update the size recorded in
> the inode later, after I/O is flowing through the device again, but (of
> course) before reporting that the resize operation is complete?
> I.e. does it eliminate deadlocks if the bdget() and i_size_write()
> happen after the 'resume'?
There is no guarantee that the I/O flowing through the device again.
The table might need be replaced again, but to do that, the resume
should have been completed to let the userspace know it.
bdget() in noflush suspend has a possibility of stall.
Once it stalls, the remedy is userspace to replace the table with
working one. However, since bdget() occurs inside of suspend_lock,
it's not possible to run other instance of suspend/resume.
OTOH, calling bdget() and i_size_write() outside of the lock
can cause race with other table swapping and may result in setting
wrong device size.
Also, bdget() allocates new bdev inode if there isn't.
But dm wants to access bdev inode only when it exists in memory.
So something like bdlookup() will fit for the purpose, IMO.
Thanks,
--
Jun'ichi Nomura, NEC Corporation of America
On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote:
> There is no guarantee that the I/O flowing through the device again.
> The table might need be replaced again, but to do that, the resume
> should have been completed to let the userspace know it.
Then the first attempt to set the size could be made to fail
(because it could not get the lock immediately) and the
size could be set after the second resume instead.
- Setting the size would lag behind the actual size the dm table was
supporting, but (given the usage cases discussed) this would not matter.
> bdget() in noflush suspend has a possibility of stall.
So we cannot avoid fixing that: we require immediate return
with failure instead of waiting.
> OTOH, calling bdget() and i_size_write() outside of the lock
> can cause race with other table swapping and may result in setting
> wrong device size.
If the size setting is removed from the lock, then it becomes
"set the inode size to match the current size of the table" and
races would not matter - each "set size" attempt would set it
to the instantaneous live table size, not a cached value that
could be out-of-date.
Alasdair
--
[email protected]
Hi Alasdair,
Thanks for the immediate reply.
So as far as I understand, the point is:
1. it's preferable to resize inode after the resume, if possible
2. it's necessary to have nowait version of bdget() to avoid stall
For the 1st item, I guess the reason is that we want to make the
table swapping code deterministic. Is it correct?
Even if we don't have to wait for bdget(), you like to have the resizing
code after the resume?
For the 2nd item, the patch included bdlookup() for that purpose.
What do you think about it?
I added below some additional comments and questions.
Alasdair G Kergon wrote:
> On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote:
>> There is no guarantee that the I/O flowing through the device again.
>> The table might need be replaced again, but to do that, the resume
>> should have been completed to let the userspace know it.
>
> Then the first attempt to set the size could be made to fail
> (because it could not get the lock immediately) and the
> size could be set after the second resume instead.
>
> - Setting the size would lag behind the actual size the dm table was
> supporting, but (given the usage cases discussed) this would not matter.
>
>> bdget() in noflush suspend has a possibility of stall.
>
> So we cannot avoid fixing that: we require immediate return
> with failure instead of waiting.
bdlookup() is the almost nowait version of bdget()
("almost" means it may wait for I_NEW turned off in case there is
bdget() doing its latter half. But it never stalls.)
Do you think we need something else?
>> OTOH, calling bdget() and i_size_write() outside of the lock
>> can cause race with other table swapping and may result in setting
>> wrong device size.
>
> If the size setting is removed from the lock, then it becomes
> "set the inode size to match the current size of the table" and
> races would not matter - each "set size" attempt would set it
> to the instantaneous live table size, not a cached value that
> could be out-of-date.
Even though, I think we still want set_capacity() during the table
swapping. So we need to call dm_table_get_size() twice. Is it ok?
Also, do you want to move the resizing code for normal suspend, too?
Otherwise, the code will be duplicated and I don't think you like it.
Thanks,
--
Jun'ichi Nomura, NEC Corporation of America
Alasdair,
Jun'ichi Nomura wrote:
> So as far as I understand, the point is:
> 1. it's preferable to resize inode after the resume, if possible
Attached is a modified version of the (2/3) patch.
- The logic is basically the same as before.
- Moved the set-size function outside of the table swapping.
- Moved the uevent to the end of resize from the end of resume.
- Lightly tested on LVM2 mirror resizing.
- (1/3) and (3/3) are unchanged.
What do you think about this?
---
drivers/md/dm-ioctl.c | 4 ++
drivers/md/dm.c | 61 ++++++++++++++++++++++++++----------------
include/linux/device-mapper.h | 5 +++
3 files changed, 46 insertions(+), 24 deletions(-)
Index: linux-2.6.23.work/drivers/md/dm.c
===================================================================
--- linux-2.6.23.work.orig/drivers/md/dm.c
+++ linux-2.6.23.work/drivers/md/dm.c
@@ -1099,31 +1099,11 @@ static void event_callback(void *context
wake_up(&md->eventq);
}
-static void __set_size(struct mapped_device *md, sector_t size)
-{
- set_capacity(md->disk, size);
-
- mutex_lock(&md->suspended_bdev->bd_inode->i_mutex);
- i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
- mutex_unlock(&md->suspended_bdev->bd_inode->i_mutex);
-}
-
static int __bind(struct mapped_device *md, struct dm_table *t)
{
struct request_queue *q = md->queue;
- sector_t size;
-
- size = dm_table_get_size(t);
-
- /*
- * Wipe any geometry if the size of the table changed.
- */
- if (size != get_capacity(md->disk))
- memset(&md->geometry, 0, sizeof(md->geometry));
- if (md->suspended_bdev)
- __set_size(md, size);
- if (size == 0)
+ if (!dm_table_get_size(t))
return 0;
dm_table_get(t);
@@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md)
dm_table_unplug_all(map);
- kobject_uevent(&md->disk->kobj, KOBJ_CHANGE);
-
r = 0;
out:
@@ -1508,6 +1486,43 @@ out:
return r;
}
+void dm_set_size(struct mapped_device *md)
+{
+ struct dm_table *map;
+ struct block_device *bdev;
+ sector_t size;
+
+ down(&md->suspend_lock);
+
+ map = dm_get_table(md);
+ if (!map)
+ goto out;
+
+ size = dm_table_get_size(map);
+
+ /*
+ * Wipe any geometry if the size of the table changed.
+ */
+ if (size != get_capacity(md->disk))
+ memset(&md->geometry, 0, sizeof(md->geometry));
+
+ set_capacity(md->disk, size);
+
+ bdev = bdlookup_disk(md->disk, 0);
+ if (bdev) {
+ mutex_lock(&bdev->bd_mutex);
+ i_size_write(bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
+ mutex_unlock(&bdev->bd_mutex);
+ bdput(bdev);
+ }
+
+ kobject_uevent(&md->disk->kobj, KOBJ_CHANGE);
+
+out:
+ dm_table_put(map);
+ up(&md->suspend_lock);
+}
+
/*-----------------------------------------------------------------
* Event notification.
*---------------------------------------------------------------*/
Index: linux-2.6.23.work/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.23.work/drivers/md/dm-ioctl.c
@@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa
if (dm_suspended(md))
r = dm_resume(md);
- if (!r)
+ if (!r) {
+ dm_set_size(md);
r = __dev_status(md, param);
+ }
dm_put(md);
return r;
Index: linux-2.6.23.work/include/linux/device-mapper.h
===================================================================
--- linux-2.6.23.work.orig/include/linux/device-mapper.h
+++ linux-2.6.23.work/include/linux/device-mapper.h
@@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ
int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo);
int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo);
+/*
+ * Update device size
+ */
+void dm_set_size(struct mapped_device *md);
+
/*-----------------------------------------------------------------
* Functions for manipulating device-mapper tables.
On Thu, Oct 25, 2007 at 02:46:17PM -0400, Jun'ichi Nomura wrote:
> So as far as I understand, the point is:
> 1. it's preferable to resize inode after the resume, if possible
Not quite - I'm not expressing a preference yet.
I'm saying the patches you presented were one option to resolve the
problem, but it was not obvious to me they are the only option and I
wanted to see a deeper analysis (keeping in mind the general direction
I want the device-mapper code to be heading in).
In other words, an analysis that draws out the conflict at the heart of
this problem, and whether the problem we have with this locking is
intrinsic to the current design or just a matter of implementation that
can be fixed with some re-ordering or tweaks to the usage rules or types
of the locks.
Alasdair
--
[email protected]