Hello,
These patches provide common representation of dependencies
between stacked devices (dm and md) in sysfs.
For example, if dm-0 maps to sda, we have the following symlinks;
/sys/block/dm-0/slaves/sda --> /sys/block/sda
/sys/block/sda/holders/dm-0 --> /sys/block/dm-0
This makes it easier for user space tools/scripts to find out
device dependencies.
Suppose complicated but quite normal situation like below:
We have a logical volume (dm-2) on md raid1 (md0) which is
build upon dm-multipath (dm-0, dm-1) on FC disks (sda .. sdd).
dm-2
+-- md0
|-- dm-0
| |-- sda
| +-- sdb
|
+-- dm-1
|-- sdc
+-- sdd
Though md0, dm-0, dm-1 and sd[a-d] contain same LVM2 meta data,
LVM2 should pick up md0 as PV, not dm-0, dm-1 and sdXs.
mdadm should build md0 from dm-0 and dm-1, not from sdXs.
Similar things will happen on 'mount' and 'fsck' if we use
file system labels instead of LVM2.
Currently, these relationships are determined by each tool
combining information like the existence of md metadata
and dm dependency ioctl.
With the patches, symlinks are created as shown below:
/sys/block/dm-2/slaves/md0 --> /sys/block/md0
/sys/block/md0/holders/dm-2 --> /sys/block/dm-2
/sys/block/md0/slaves/dm-1 --> /sys/block/dm-1
/sys/block/md0/slaves/dm-0 --> /sys/block/dm-0
/sys/block/dm-0/holders/md0 --> /sys/block/md0
/sys/block/dm-0/slaves/sda --> /sys/block/sda
/sys/block/sda/holders/dm-0 --> /sys/block/dm-0
...
thus we only need to check "holders" directory of the device
to decide whether the device is used by dm/md.
Also we can walk down the "slaves" directories to collect
the devices conposing the given dm/md device.
The idea was raised in dm-devel by Lars in the last year
but I couldn't find follow ups of actual implementation.
https://www.redhat.com/archives/dm-devel/2005-April/msg00040.html
Any comments?
--
Jun'ichi "Nick" Nomura, NEC Solutions (America), Inc.
This patch provides common functions to create symlinks in sysfs
between stacked device and its slaves.
I placed functions in fs/block_dev.c as some of them are
privately used by bd_claim().
I'm not sure if it's better to put them in other files.
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
This patch modifies dm driver to create symlinks to/from
underlying devices.
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
This patch modifies md driver to create symlinks to/from
underlying devices.
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
Make sure you test this properly under low memory situations.
On Fri, Feb 17, 2006 at 01:01:48PM -0500, Jun'ichi Nomura wrote:
> This patch provides common functions to create symlinks in sysfs
> between stacked device and its slaves.
dm_swap_table() mustn't block waiting for memory to become
free (except in a controlled way e.g. with a mempool, but
it would need more than that here).
Here, dm_swap_table() leads to kmalloc() getting called
in sysfs_add_link().
[e.g. Consider the extreme case where the dm device
you're changing is your swap device. While dm_swap_table()
runs, no I/O will get through to your swap device.]
If you can't avoid the sysfs code allocating memory, then
you must find a way of doing it before the dm suspend or
after the dm resume.
e.g. Do the sysfs memory allocations for the links prior to
the dm suspend [which may have happened in a previous system call]
and then use a different function to move them into place during
dm_swap_table() without performing further memory allocations?
[Lazy workaround is to set PF_MEMALLOC again...]
Alasdair
--
[email protected]
On Fri, Feb 17, 2006 at 01:00:17PM -0500, Jun'ichi Nomura wrote:
> These patches provide common representation of dependencies
> between stacked devices (dm and md) in sysfs.
I'm neutral on this change so long as it can be done without
introducing problems for device-mapper.
> Though md0, dm-0, dm-1 and sd[a-d] contain same LVM2 meta data,
> LVM2 should pick up md0 as PV, not dm-0, dm-1 and sdXs.
> mdadm should build md0 from dm-0 and dm-1, not from sdXs.
> Similar things will happen on 'mount' and 'fsck' if we use
> file system labels instead of LVM2.
I can't speak for the 'mount' code base, but I don't think it'll
make any significant difference to LVM2 - we'd still have to do
all the same device scanning as we do now because we have to be
aware of md devices defined in on-disk metadata regardless of
whether or not the kernel knows about them at the time the
command is run.
> Currently, these relationships are determined by each tool
> combining information like the existence of md metadata
> and dm dependency ioctl.
And attempts to open a device exclusively. That's one check LVM2
does before running 'pvcreate' on a device.
> thus we only need to check "holders" directory of the device
> to decide whether the device is used by dm/md.
> Also we can walk down the "slaves" directories to collect
> the devices conposing the given dm/md device.
For device-mapper devices, 'dmsetup deps' and ls --tree already
gives you this information reasonably efficiently.
Would others find the proposal useful for non-dm devices?
And rather than adding code just to dm and md, would it be better
to implement it by enhancing bd_claim()?
Alasdair
--
[email protected]
Hi Alasdair,
Thank you for the comments.
Alasdair G Kergon wrote:
> Make sure you test this properly under low memory situations.
OK.
> dm_swap_table() mustn't block waiting for memory to become
> free (except in a controlled way e.g. with a mempool, but
> it would need more than that here).
>
> Here, dm_swap_table() leads to kmalloc() getting called
> in sysfs_add_link().
I moved the sysfs_add_link() to the last part of dm_resume().
Directory creation/deletion are also moved in the alloc_dev/
free_dev function as kobject_register/unregister may allocate
memory.
I didn't change the part of removing symlink as it doesn't
allocate memory.
Do you think this is an acceptable approach?
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
Hi,
Alasdair G Kergon wrote:
> I can't speak for the 'mount' code base, but I don't think it'll
> make any significant difference to LVM2 - we'd still have to do
> all the same device scanning as we do now because we have to be
> aware of md devices defined in on-disk metadata regardless of
> whether or not the kernel knows about them at the time the
> command is run.
Actually, as you say, LVM2 already does the relationship analysis
correctly by itself. So it's not 'good' example...
The point was that dm and md have similar dependency
structure but currently we have to scan all devices to
find out the upward relationship using different method
for dm and md.
>>thus we only need to check "holders" directory of the device
>>to decide whether the device is used by dm/md.
>>Also we can walk down the "slaves" directories to collect
>>the devices conposing the given dm/md device.
>
> For device-mapper devices, 'dmsetup deps' and ls --tree already
> gives you this information reasonably efficiently.
Speaking about the efficiency, 'dmsetup ls --tree' works well.
However, I haven't yet found a efficient way to implement
'dmsetup info --tree -o inverted dm-0', for example.
Deps ioctl provides downward information for a given dm device
but there is no method for upward information.
Providing reverse-deps ioctl in dm may be alternative solution.
But it still doesn't provide the holders of non-dm devices.
So I feel sysfs solution is appealing.
> Would others find the proposal useful for non-dm devices?
I would appreciate comments from others as well.
> And rather than adding code just to dm and md, would it be better
> to implement it by enhancing bd_claim()?
It may be possible if I can extend the bd_claim to accept
additional parameter because all dm devices use same 'holder'
signature for bd_claim but actual owner of the claim should
be determined to create symlinks.
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
On Feb 17, 2006, at 14:42, Alasdair G Kergon wrote:
> On Fri, Feb 17, 2006 at 01:00:17PM -0500, Jun'ichi Nomura wrote:
>> Though md0, dm-0, dm-1 and sd[a-d] contain same LVM2 meta data,
>> LVM2 should pick up md0 as PV, not dm-0, dm-1 and sdXs. mdadm
>> should build md0 from dm-0 and dm-1, not from sdXs. Similar things
>> will happen on 'mount' and 'fsck' if we use file system labels
>> instead of LVM2.
>
> I can't speak for the 'mount' code base, but I don't think it'll
> make any significant difference to LVM2 - we'd still have to do all
> the same device scanning as we do now because we have to be aware
> of md devices defined in on-disk metadata regardless of whether or
> not the kernel knows about them at the time the command is run.
Aha! This is a very valid reason why we should export partition
types from the kernel to userspace: Partitions/devices that appear
to have 2 different filesystems/formats. The _kernel_ cannot
reliably tell which to use. On the other hand, a properly configured
_userspace_ initramfs could use configured partition-type
information, a small config file, and a user-configurable detection
algorithm to figure out that the device is _actually_ the first
segment of an ext3-on-LVM-on-RAID1, instead of a raw ext3, and mount
it appropriately. Now, this requires that the admin correctly
specify the partition types, but that seems a bit more reliable than
depending on the probe-order to get things right.
Cheers,
Kyle Moffett
--
Unix was not designed to stop people from doing stupid things,
because that would also stop them from doing clever things.
-- Doug Gwyn
On Fri, Feb 17, 2006 at 08:03:48PM -0500, Jun'ichi Nomura wrote:
> I moved the sysfs_add_link() to the last part of dm_resume().
Test with trees of devices too - where a whole tree is suspended -
I don't think you can allocate anywhere in dm_swap_table()
without PF_MEMALLOC (which I recently removed and am reluctant
to reinstate).
Have you considered if anything is feasible based around bd_claim()?
Doesn't it make more sense for the links to be set up at table
load time - i.e. superset of both tables if present?
Alasdair
--
[email protected]
On Fri, Feb 17, 2006 at 08:21:32PM -0500, Jun'ichi Nomura wrote:
> Speaking about the efficiency, 'dmsetup ls --tree' works well.
> However, I haven't yet found a efficient way to implement
> 'dmsetup info --tree -o inverted dm-0', for example.
Indeed - but what needs this that doesn't also need to scan
everything? mount?
Alasdair
--
[email protected]
On Friday February 17, [email protected] wrote:
> Hello,
>
> These patches provide common representation of dependencies
> between stacked devices (dm and md) in sysfs.
> For example, if dm-0 maps to sda, we have the following symlinks;
> /sys/block/dm-0/slaves/sda --> /sys/block/sda
> /sys/block/sda/holders/dm-0 --> /sys/block/dm-0
I happy with the idea of having these links.
I agree that it would be nice to have this very strongly based on the
bd_claim infrastructure.
It would be really nice if bd_claim took a "kobject *" rather than a
"void *" and put a link in there. This would be easy for dm and md,
but awkward for other claimers like filesystems and open file
descriptors as they don't currently have kobjects.
Possibly an extra flag that says if the 'holder' is a kobject or not,
and if it is, appropriate symlinks are created...??
NeilBrown
Thanks Alasdair and Neil,
Alasdair G Kergon wrote:
> Test with trees of devices too - where a whole tree is suspended -
Suspending maps in the tree and reload one of them?
I'll try that.
> I don't think you can allocate anywhere in dm_swap_table()
> without PF_MEMALLOC (which I recently removed and am reluctant
> to reinstate).
I understand your reluctance and I don't want to revive it either.
I think moving sysfs_add_link() outside of dm_swap_table() solves
this. Am I right?
Or do you want to eliminate the possibility that sysfs_remove_symlink()
may require memory allocation in future?
Anyway, I'll seek for bd_claim based approach.
> Have you considered if anything is feasible based around bd_claim()?
> Doesn't it make more sense for the links to be set up at table
> load time - i.e. superset of both tables if present?
I think it makes sense. But I have difficulty with it.
What I once thought was extending bd_claim() like:
bd_claim_with_owner(bdev, void *holder, struct kobject *owner)
where "owner" is a kobject for "slaves" directory.
We may have the object embedded in gendisk structure.
Then we can create symlinks like:
/sys/block/<bdev>/holders/<owner> --> /sys/block/<owner>
/sys/block/<owner>/slaves/<bdev> --> /sys/block/<bdev>
This should work for md.
However, dm needs more for its flexibility.
Because multiple dm devices can hold one device and one dm device
can hold a device twice (i.e. current table and new table),
we need to reference-count per relationship basis, not per slave
device.
This might be solved by allocating management struct in bd_claim()
to reference-counting the relationship.
I'll try this. Comments are welcome.
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
Alasdair G Kergon wrote:
>>Speaking about the efficiency, 'dmsetup ls --tree' works well.
>>However, I haven't yet found a efficient way to implement
>>'dmsetup info --tree -o inverted dm-0', for example.
>
> Indeed - but what needs this that doesn't also need to scan
> everything? mount?
mount, fsck and other blkid based tools could be optimized with it.
However, what I had in mind was system administration like just
using dmsetup or looking /sys to check where a device belongs.
--
Jun'ichi Nomura, NEC Solutions (America), Inc.
On Tue, Feb 21, 2006 at 10:33:40AM -0500, Jun'ichi Nomura wrote:
> Alasdair G Kergon wrote:
> >Test with trees of devices too - where a whole tree is suspended -
> Suspending maps in the tree and reload one of them?
Reload a complete tree of devices like lvm2 does:
It loads inactivate tables wherever it needs to in the tree,
then suspends the devices in the correct order (according to
the dependencies of the live tables to avoid ever 'trapping' I/O
between two devices), then resumes them in order.
> >I don't think you can allocate anywhere in dm_swap_table()
> >without PF_MEMALLOC (which I recently removed and am reluctant
> >to reinstate).
> I understand your reluctance and I don't want to revive it either.
> I think moving sysfs_add_link() outside of dm_swap_table() solves
> this. Am I right?
I should have said: try hard to avoid allocations in any code run
during the 'DM_SUSPEND' ioctl - if you really have to, your options
include PF_MEMALLOC or a mempool, as appropriate.
> Or do you want to eliminate the possibility that sysfs_remove_symlink()
> may require memory allocation in future?
Either that, or:
> Anyway, I'll seek for bd_claim based approach.
This dodges the allocation problem because it happens in the DM_TABLE_LOAD
ioctl where I was able to remove the restriction recently.
Alasdair
--
[email protected]