2008-08-25 20:48:39

by Karel Zak

[permalink] [raw]
Subject: [PATCH] blkid: optimize dm_device_is_leaf() usage

* devname.c: probe_one(): call dm_device_is_leaf() with *constant*
devno argument in the list_for_each_safe loop does not make sense at
all.

* devname.c: probe_one(): call dm_device_is_leaf() for already
checked DM devices is unnecessary. DM devices are already checked
in dm_probe_all(). The dm_probe_all() function never returns slave
devices.

* devname.c: dm_device_is_leaf(): stop checking for dependences when
a first dependence is detected.

Performance test:

# for i in $(seq 1 100); do dmsetup create $i --table "0 1 zero"; done

new version:

# blkid &> /dev/null

real 0m0.267s
user 0m0.047s
sys 0m0.212s

old version:

# blkid &> /dev/null

real 0m2.149s
user 0m0.291s
sys 0m1.820s

Signed-off-by: Karel Zak <[email protected]>
---
lib/blkid/devname.c | 37 ++++++++++++++++++++-----------------
1 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..4967b96 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -133,25 +133,27 @@ static void probe_one(blkid_cache cache, const char *ptname,
const char **dir;
char *devname = NULL;

- /* See if we already have this device number in the cache. */
- list_for_each_safe(p, pnext, &cache->bic_devs) {
- blkid_dev tmp = list_entry(p, struct blkid_struct_dev,
- bid_devs);
#ifdef HAVE_DEVMAPPER
- if (!dm_device_is_leaf(devno))
- continue;
+ /* call dm_device_is_leaf() for non-DM device only */
+ if (pri == BLKID_PRI_DM || dm_device_is_leaf(devno))
#endif
- if (tmp->bid_devno == devno) {
- if (only_if_new && !access(tmp->bid_name, F_OK))
- return;
- dev = blkid_verify(cache, tmp);
- if (dev && (dev->bid_flags & BLKID_BID_FL_VERIFIED))
- break;
- dev = 0;
+ {
+ /* See if we already have this device number in the cache. */
+ list_for_each_safe(p, pnext, &cache->bic_devs) {
+ blkid_dev tmp = list_entry(p, struct blkid_struct_dev,
+ bid_devs);
+ if (tmp->bid_devno == devno) {
+ if (only_if_new && !access(tmp->bid_name, F_OK))
+ return;
+ dev = blkid_verify(cache, tmp);
+ if (dev && (dev->bid_flags & BLKID_BID_FL_VERIFIED))
+ break;
+ dev = 0;
+ }
}
+ if (dev && dev->bid_devno == devno)
+ goto set_pri;
}
- if (dev && dev->bid_devno == devno)
- goto set_pri;

/*
* Take a quick look at /dev/ptname for the device number. We check
@@ -271,9 +273,10 @@ static int dm_device_is_leaf(const dev_t dev)
do {
names = (struct dm_names *) ((char *)names + next);

- if (dm_device_has_dep(dev, names->name))
+ if (dm_device_has_dep(dev, names->name)) {
ret = 0;


2008-08-26 12:24:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

Hi Karel,

Thanks! Your patch forced me to look much more closely at the device
mapper code, and after looking at it, I've been led to the inescapable
conclusion that the whole thing is total cr*p. :-)

My only excuse was that at the time when I accepted the patch, I
wasn't using a device-mapper based system, and couldn't do live
testing of the code, so I didn't realize how much of what it was doing
was totally unnecessary.

What I've commited into the git tree completely rips out the need to
use the device mapper library, and instead relies on the already
existing and well-tested code paths that supports non-dm devices.
This has the nice side benefit that initrd's that want to use blkid
will no longer need to pull in libdevmapper, libselinux, and libsepol
--- or at least, if initrd's need them, it will no longer be on
/sbin/blkid's account. And, we can yank all of the configure
machinery for including libdevmapper, et. al., from e2fsprogs.

It has an even more salutory performance benefit (about 3-6 times
faster than yours, I estimate, since we don't end up calling into
libdevmapper *at* *all*). On an X61s laptop, with:

# for i in $(seq 1 100); do dmsetup create $i --table "0 1 zero"; done

old version:

# time /sbin/blkid >& /dev/null

real 0m4.227s
user 0m0.623s
sys 0m3.406s

new version:

# time ./misc/blkid >& /dev/null

real 0m0.080s
user 0m0.000s
sys 0m0.010s

- Ted


commit 3f66ecf24e896377997b909edef040be98ac76b3
Author: Theodore Ts'o <[email protected]>
Date: Tue Aug 26 08:13:56 2008 -0400

libblkid: Optimize devicemapper support

This commit works by removing all calls from libdevicemapper
altogether, and using the standard support for "normal" non-dm
devices.

It depends on dm devices being placed in /dev/mapper (but the previous
code had this dependency anyway), and /proc/partitions containing dm
devices.

We don't actually rip out the libdevicemapper code in this commit, but
just disable it via #undef HAVE_DEVMAPPER, just so it's easier to
review and understand the changes that were made. A subsequent commit
will remove the libdevicemapper code, as well as unexport blkid_devdirs.

Thanks to Karel Zak for inspiring me to look at the dm code in blkid,
so I could realize how much it deserved to ripped out by its roots. :-)

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..5f9c228 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -38,6 +38,8 @@

#include "blkidP.h"

+#undef HAVE_DEVMAPPER
+
#ifdef HAVE_DEVMAPPER
#include <libdevmapper.h>
#endif
@@ -122,6 +124,9 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
static int dm_device_is_leaf(const dev_t dev);
#endif

+/* Directories where we will try to search for device names */
+static const char *dirlist[] = { "/dev", "/dev/mapper", "/devfs", "/devices", NULL };
+
/*
* Probe a single block device to add to the device cache.
*/
@@ -159,17 +164,17 @@ static void probe_one(blkid_cache cache, const char *ptname,
* the stat information doesn't check out, use blkid_devno_to_devname()
* to find it via an exhaustive search for the device major/minor.
*/
- for (dir = blkid_devdirs; *dir; dir++) {
+ for (dir = dirlist; *dir; dir++) {
struct stat st;
char device[256];

- sprintf(device, "%s/%s", *dir, ptname);
if ((dev = blkid_get_dev(cache, device, BLKID_DEV_FIND)) &&
dev->bid_devno == devno)
goto set_pri;

if (stat(device, &st) == 0 && S_ISBLK(st.st_mode) &&
st.st_rdev == devno) {
+ sprintf(device, "%s/%s", *dir, ptname);
devname = blkid_strdup(device);
break;
}
@@ -183,10 +188,14 @@ static void probe_one(blkid_cache cache, const char *ptname,
free(devname);

set_pri:
- if (!pri && !strncmp(ptname, "md", 2))
- pri = BLKID_PRI_MD;
- if (dev)
- dev->bid_pri = pri;
+ if (dev) {
+ if (pri)
+ dev->bid_pri = pri;
+ else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
+ dev->bid_pri = BLKID_PRI_DM;
+ else if (!strncmp(ptname, "md", 2))
+ dev->bid_pri = BLKID_PRI_MD;
+ }
return;
}


2008-08-26 13:51:07

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Tue, Aug 26, 2008 at 08:24:05AM -0400, Theodore Tso wrote:
> commit 3f66ecf24e896377997b909edef040be98ac76b3
> Author: Theodore Ts'o <[email protected]>
> Date: Tue Aug 26 08:13:56 2008 -0400
>
> libblkid: Optimize devicemapper support
>
> This commit works by removing all calls from libdevicemapper
> altogether, and using the standard support for "normal" non-dm
> devices.
>
> It depends on dm devices being placed in /dev/mapper (but the previous
> code had this dependency anyway), and /proc/partitions containing dm
> devices.

Well, I see few problems:

* /proc/partitions containing internal dm device names (e.g. dm-0).
The libdevmapper provides translation from internal to the "real"
names (e.g /dev/mapper/foo). I guess (hope:-) /sys provides the
real names too.

* we probably need to resolve dependencies for multi-path devices
where the same FS is accessable by more than one physical device.
If I good remember it was the original purpose for DM support
in libblkid.

# mount LABEL=blabla /mnt

has to mount the "right" device. I guess that only DM is able to
answer which device is the "right" one ;-)

The /sys/block/<devname>/slaves/ provides information about
dependencies.

I see these two things as critical for fsck, mount, ...

>
> +/* Directories where we will try to search for device names */
> +static const char *dirlist[] = { "/dev", "/dev/mapper", "/devfs", "/devices", NULL };

I think "/dev/mapper" does not make any sense here, ...because the
names from /proc/partitions are in dm-<N> format, but the names in
/dev/mapper uses different format.

> - if (!pri && !strncmp(ptname, "md", 2))
> - pri = BLKID_PRI_MD;
> - if (dev)
> - dev->bid_pri = pri;
> + if (dev) {
> + if (pri)
> + dev->bid_pri = pri;
> + else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
> + dev->bid_pri = BLKID_PRI_DM;

the same problem

> + else if (!strncmp(ptname, "md", 2))
> + dev->bid_pri = BLKID_PRI_MD;
> + }
> return;
> }

Karel

--
Karel Zak <[email protected]>

2008-08-26 14:47:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Tue, Aug 26, 2008 at 03:51:02PM +0200, Karel Zak wrote:
> Well, I see few problems:
>
> * /proc/partitions containing internal dm device names (e.g. dm-0).
> The libdevmapper provides translation from internal to the "real"
> names (e.g /dev/mapper/foo). I guess (hope:-) /sys provides the
> real names too.

You're right. So seaching for /dev/mapper/dm-n doesn't make any
sense; adding /dev/mapper to the dirlist doesn't help, and in fact is
a waste of time. However, the patch actually *did* work, and the
reason why it does is because we are also are searching /dev/mapper by
device number, and so we are finding the device name that way.


> * we probably need to resolve dependencies for multi-path devices
> where the same FS is accessable by more than one physical device.
> If I good remember it was the original purpose for DM support
> in libblkid.
>
> # mount LABEL=blabla /mnt
>
> has to mount the "right" device. I guess that only DM is able to
> answer which device is the "right" one ;-)

I don't think you mean multipath support in terms of where there are
multiple paths to the same physical device ala fiber channel, but
rather where are multiple devices which are built on each other,
right? So where /dev/sda is used to create the LVM PV's, and so on,
right?

> The /sys/block/<devname>/slaves/ provides information about
> dependencies.
>
> I see these two things as critical for fsck, mount, ...

Right. And here, we are solving the problem already in that we prefer
the /dev/mapper/* names over names like /dev/sda and /dev/sdb. That's
what the priority field is all about.

What we don't solve is the problem where one devicemapper device is
used to build another device mapper device. This could happen in a
number of circumstances. You might have some wierd circumstance where
/dev/mapper/part1 and /dev/mapper/part2 are glued together to make
/dev/mapper/whole-filesystem. Why you might do this instead of simply
using something like lvextend is beyond me, but that is something
legitimate can be done with the low-level device mapper primitives.

But, #1, there are times when picking a leaf dm device over a non-leaf
dm device is not the right thing to do (which would be the case when
you make a live snapshot of a filesystem), and #2, your patch only
checks non-leaf dm devices for non-dm devices, probably because of #1.

So with both of our patches, we have the problem where we could pick
the wrong dm device if the user builds one dm device on top of another
dm device. (Although for transient devices, such as temporary
snapshots, if the permanent devices is already in /etc/blkid.tab, the
cache makes us win since we'll prefer the device already in the cache
file to the other.)

But in terms of making sure we pick the device mapper file over the
raw file, which is the 99.99% common case, we do the right thing, and
we can do the right thing without calling the devmapper library.

> > +/* Directories where we will try to search for device names */
> > +static const char *dirlist[] = { "/dev", "/dev/mapper", "/devfs", "/devices", NULL };
>
> I think "/dev/mapper" does not make any sense here, ...because the
> names from /proc/partitions are in dm-<N> format, but the names in
> /dev/mapper uses different format.

Yep, I should remove that.

> > + if (dev) {
> > + if (pri)
> > + dev->bid_pri = pri;
> > + else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
> > + dev->bid_pri = BLKID_PRI_DM;
>
> the same problem

This does work, because we do find the /dev/mapper name via a
brute-force search of /dev looking for a matching devno when we call
blkid_devno_to_devname(). What I *can* do is do a special search of
/dev/mapper first, but instead of looking for /dev/mapper/<ptname>, to
do a readdir search of /dev/mapper looking for the matching devno.
That's merely an optimization, which doesn't really matter for Linux
since stat's are so fast. But it's probably worth doing.

- Ted

2008-08-26 18:04:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

Here's the new patch.

- Ted

commit 2e6c2735e19ec19821eeff1cbdf11c09c25540f0
Author: Theodore Ts'o <[email protected]>
Date: Tue Aug 26 08:13:56 2008 -0400

libblkid: Optimize devicemapper support

This commit works by removing all calls from libdevmapper altogether,
and using the standard support for "normal" non-dm devices.

It depends on dm devices being placed in /dev/mapper (but the previous
code had this dependency anyway), and /proc/partitions containing dm
devices.

We don't actually rip out the libdevmapper code in this commit, but
just disable it via #undef HAVE_DEVMAPPER, just so it's easier to
review and understand the fundamental code changes. A subsequent
commit will remove the libdevmapper code, as well as unexport
the blkid_devdirs string array.

Thanks to Karel Zak for inspiring me to look at the dm code in blkid,
so I could realize how much it deserved to ripped out by its roots. :-)

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/lib/blkid/blkidP.h b/lib/blkid/blkidP.h
index dfe0eca..9d1e459 100644
--- a/lib/blkid/blkidP.h
+++ b/lib/blkid/blkidP.h
@@ -153,6 +153,13 @@ extern void blkid_debug_dump_dev(blkid_dev dev);
extern void blkid_debug_dump_tag(blkid_tag tag);
#endif

+/* devno.c */
+struct dir_list {
+ char *name;
+ struct dir_list *next;
+};
+extern void blkid__scan_dir(char *, dev_t, struct dir_list **, char **);
+
/* lseek.c */
extern blkid_loff_t blkid_llseek(int fd, blkid_loff_t offset, int whence);

diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 86fd44c..ec3cff3 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -38,6 +38,8 @@

#include "blkidP.h"

+#undef HAVE_DEVMAPPER
+
#ifdef HAVE_DEVMAPPER
#include <libdevmapper.h>
#endif
@@ -122,6 +124,9 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
static int dm_device_is_leaf(const dev_t dev);
#endif

+/* Directories where we will try to search for device names */
+static const char *dirlist[] = { "/dev", "/devfs", "/devices", NULL };
+
/*
* Probe a single block device to add to the device cache.
*/
@@ -159,7 +164,7 @@ static void probe_one(blkid_cache cache, const char *ptname,
* the stat information doesn't check out, use blkid_devno_to_devname()
* to find it via an exhaustive search for the device major/minor.
*/
- for (dir = blkid_devdirs; *dir; dir++) {
+ for (dir = dirlist; *dir; dir++) {
struct stat st;
char device[256];

@@ -174,6 +179,9 @@ static void probe_one(blkid_cache cache, const char *ptname,
break;
}
}
+ /* Do a short-cut scan of /dev/mapper first */
+ if (!devname)
+ blkid__scan_dir("/dev/mapper", devno, 0, &devname);
if (!devname) {
devname = blkid_devno_to_devname(devno);
if (!devname)
@@ -183,10 +191,14 @@ static void probe_one(blkid_cache cache, const char *ptname,
free(devname);

set_pri:
- if (!pri && !strncmp(ptname, "md", 2))
- pri = BLKID_PRI_MD;
- if (dev)
- dev->bid_pri = pri;
+ if (dev) {
+ if (pri)
+ dev->bid_pri = pri;
+ else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
+ dev->bid_pri = BLKID_PRI_DM;
+ else if (!strncmp(ptname, "md", 2))
+ dev->bid_pri = BLKID_PRI_MD;
+ }
return;
}

diff --git a/lib/blkid/devno.c b/lib/blkid/devno.c
index 61b34bf..1962c8d 100644
--- a/lib/blkid/devno.c
+++ b/lib/blkid/devno.c
@@ -33,11 +33,6 @@

#include "blkidP.h"

-struct dir_list {
- char *name;
- struct dir_list *next;
-};
-
char *blkid_strndup(const char *s, int length)
{
char *ret;
@@ -95,8 +90,8 @@ static void free_dirlist(struct dir_list **list)
*list = NULL;
}

-static void scan_dir(char *dirname, dev_t devno, struct dir_list **list,
- char **devname)
+void blkid__scan_dir(char *dirname, dev_t devno, struct dir_list **list,
+ char **devname)
{
DIR *dir;
struct dirent *dp;
@@ -127,7 +122,7 @@ static void scan_dir(char *dirname, dev_t devno, struct dir_list **list,
path, *devname));
break;
}
- if (S_ISDIR(st.st_mode) && !lstat(path, &st) &&
+ if (list && S_ISDIR(st.st_mode) && !lstat(path, &st) &&
S_ISDIR(st.st_mode))
add_to_dirlist(path, list);
}
@@ -161,7 +156,7 @@ char *blkid_devno_to_devname(dev_t devno)

list = list->next;
DBG(DEBUG_DEVNO, printf("directory %s\n", current->name));
- scan_dir(current->name, devno, &new_list, &devname);
+ blkid__scan_dir(current->name, devno, &new_list, &devname);
free(current->name);
free(current);
if (devname)

2008-08-26 19:44:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Aug 26, 2008 10:47 -0400, Theodore Ts'o wrote:
> I don't think you mean multipath support in terms of where there are
> multiple paths to the same physical device ala fiber channel, but
> rather where are multiple devices which are built on each other,
> right? So where /dev/sda is used to create the LVM PV's, and so on,
> right?

No, in fact DM has actual mutliple-paths-to-the-same-device support,
via "dm-multipath".

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-26 20:00:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Tue, Aug 26, 2008 at 01:44:45PM -0600, Andreas Dilger wrote:
> On Aug 26, 2008 10:47 -0400, Theodore Ts'o wrote:
> > I don't think you mean multipath support in terms of where there are
> > multiple paths to the same physical device ala fiber channel, but
> > rather where are multiple devices which are built on each other,
> > right? So where /dev/sda is used to create the LVM PV's, and so on,
> > right?
>
> No, in fact DM has actual mutliple-paths-to-the-same-device support,
> via "dm-multipath".

Well, *that* we have, as long as the parent devices are real
(non-devicemapper) devices.

So if /dev/sdc and /dev/sdg are both paths to the same filesystem, and
dm-multipath has created /dev/mapper/filesystem as the multipath
device to that filesystem, any devices with /dev/mapper have priority
over non-dm devices, so /dev/mapper/filesystem will get returned
first.

- Ted

2008-08-26 20:47:42

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Tue, Aug 26, 2008 at 10:47:21AM -0400, Theodore Tso wrote:
> On Tue, Aug 26, 2008 at 03:51:02PM +0200, Karel Zak wrote:
> > Well, I see few problems:
> >
> > * /proc/partitions containing internal dm device names (e.g. dm-0).
> > The libdevmapper provides translation from internal to the "real"
> > names (e.g /dev/mapper/foo). I guess (hope:-) /sys provides the
> > real names too.
>
> You're right. So seaching for /dev/mapper/dm-n doesn't make any
> sense; adding /dev/mapper to the dirlist doesn't help, and in fact is
> a waste of time. However, the patch actually *did* work, and the
> reason why it does is because we are also are searching /dev/mapper by
> device number, and so we are finding the device name that way.

OK.

> I don't think you mean multipath support in terms of where there are
> multiple paths to the same physical device ala fiber channel, but

Ignore this point. You are right. The physical devices are slaves to
the final DM device (when dm-multipath is on). BTW, I look forward to
see multiple paths vs. udev (e.g. /dev/disk/by-* ) :-)

> What we don't solve is the problem where one devicemapper device is
> used to build another device mapper device. This could happen in a
> number of circumstances. You might have some wierd circumstance where
> /dev/mapper/part1 and /dev/mapper/part2 are glued together to make
> /dev/mapper/whole-filesystem. Why you might do this instead of simply
> using something like lvextend is beyond me, but that is something
> legitimate can be done with the low-level device mapper primitives.

There is worse scenario (thanks to Milan Broz from DM camp):

dmsetup create x --table "0 100 linear /dev/sdb 0"
dmsetup create y --table "0 100 linear /dev/mapper/x 0"
dmsetup create z --table "0 100 linear /dev/mapper/y 0"

# dmsetup ls --tree
z (254:3)
`-y (254:2)
`-x (254:1)
`- (8:16)

it means all these devices are exactly same, but

mount LABEL=foo

has to mount /dev/mapper/z (from top of the tree). The sdb, x and
y should be invisible for the mount(8).

> But, #1, there are times when picking a leaf dm device over a non-leaf
> dm device is not the right thing to do (which would be the case when
> you make a live snapshot of a filesystem), and #2, your patch only
> checks non-leaf dm devices for non-dm devices, probably because of #1.
>
> So with both of our patches, we have the problem where we could pick
> the wrong dm device if the user builds one dm device on top of another

I don't think so. The dm_probe_all() function never returns any
DM device which is slave to any other device. It means it always
returns the device from top of the hierarchy.

All devices from dm_probe_all() have greater priority than other
stuff from /proc/partitions (for example dm-N devs).

So back to your example...

/dev/mapper/part1 + /dev/mapper/part2 = /dev/mapper/whole-filesystem

the /dev/mapper/part1 and /dev/mapper/part2 will be visible for the
library (e.g. blkid.tab), but with *smaller priority* than
/dev/mapper/whole-filesystem.

In your non-libdevmapper implementation you need to check
/sys/block/dm-N/holders/ and prefer devices without holders.

I think we can ignore this minor problem for now. I'll try to found a
better solution for dependencies resolution without libdevmapper. My
wish is to avoid libdevmapper in libfsprobe.

> > > + if (dev) {
> > > + if (pri)
> > > + dev->bid_pri = pri;
> > > + else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))

what about "if (major(devno) == DMMAJOR)" rather than strcmp()?

> > > + dev->bid_pri = BLKID_PRI_DM;
> >
> > the same problem
>
> This does work, because we do find the /dev/mapper name via a
> brute-force search of /dev looking for a matching devno when we call
> blkid_devno_to_devname(). What I *can* do is do a special search of
> /dev/mapper first, but instead of looking for /dev/mapper/<ptname>, to
> do a readdir search of /dev/mapper looking for the matching devno.

Not elegant, but... good enough :-)

It would be nice to have /sys/block/dm-N/name where you can translate
the internal dm-N name to the real device name. Alasdair? Milan? :-)

Karel

--
Karel Zak <[email protected]>

2008-08-26 23:32:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Tue, Aug 26, 2008 at 10:47:37PM +0200, Karel Zak wrote:
> There is worse scenario (thanks to Milan Broz from DM camp):
>
> dmsetup create x --table "0 100 linear /dev/sdb 0"
> dmsetup create y --table "0 100 linear /dev/mapper/x 0"
> dmsetup create z --table "0 100 linear /dev/mapper/y 0"
>
> # dmsetup ls --tree
> z (254:3)
> `-y (254:2)
> `-x (254:1)
> `- (8:16)
>
> it means all these devices are exactly same, but
>
> mount LABEL=foo
>
> has to mount /dev/mapper/z (from top of the tree). The sdb, x and
> y should be invisible for the mount(8).

Sure, but consider what happens when you create a snapshot (either
read-only or read-write) of an existing filesystem? In that case,
both the parent and the child filesystem is mountable, and if the
child filesystem is transient, the praent one may not want to be
transient.

In fact, suppose the scenario is a virtualization scenario, where you
create the parent, create child snapshots, then use "tune2fs -U random
-L virt1 /dev/mapper/snap1" ; "tune2fs -U random -L virt2
/dev/mapper/snap2" and so on, so each of the child snapshots have
their own independent identity separate from the parent, it may very
*well* be the case that the parent device should be visible to mount.

I don't think we can make the general argument that the leaf device is
always mountable, and anything above it is *never* mountable. Maybe
that's the default, but it's certainly not always true.

I'm beginning the right answer is we need some assist from the device
mapper infrastructure, where when we create the device mapper device,
we specify whether or not it is mountable, and this information is
made available somehow, either by trying to sneaking it into
/proc/partitions (which will be tricky without breaking legacy
programs), or by making it visible in /sys.

> I think we can ignore this minor problem for now. I'll try to found a
> better solution for dependencies resolution without libdevmapper. My
> wish is to avoid libdevmapper in libfsprobe.

Speaking of which, what is your plan for caching versus non-caching in
libfsprobe? It seems to me that if you are going to be caching,
you'll just be re-inventing blkid. If you don't cache, you'll either
(a) have to iterate over all possible devices, which is what we did
before blkid (it was Ric Wheeler pointed out to me this problem and I
wrote blkid in response to his request, because it becomes a problem
if you have hundred of LUN's getting exported by a large EMC storage
array :-), or (b) do what vol_id does, which is depend on
/dev/disk/by-label and /dev/disk/by-uuid, which has the charming
Windows-like attribute of not getting updated until the next reboot
--- which means after you create a new filesystem or swap device on an
existing device, or change a label or UUID using tune2fs, vol_id never
notices until the next reboot or until you physically unplug and
reinsert the device.

Or is the answer that you expect libfsprobe to only do filesystem
type, uuid, and label detection, and not solve the "find a device
given a uuid/label" problem?

> > This does work, because we do find the /dev/mapper name via a
> > brute-force search of /dev looking for a matching devno when we call
> > blkid_devno_to_devname(). What I *can* do is do a special search of
> > /dev/mapper first, but instead of looking for /dev/mapper/<ptname>, to
> > do a readdir search of /dev/mapper looking for the matching devno.
>
> Not elegant, but... good enough :-)
>
> It would be nice to have /sys/block/dm-N/name where you can translate
> the internal dm-N name to the real device name. Alasdair? Milan? :-)

Or maybe the right answer is /proc/partitions should only export
devicemapper devices that are "supposed" to be visible to mount, and
instead of exporting dm-0, dm-1...., we export the real name via
/proc/partitions? Or do you not want to have the user-visible name
get pushed into the kernel?

- Ted

2008-08-27 00:19:50

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Tue, Aug 26, 2008 at 07:32:25PM -0400, Theodore Tso wrote:
> On Tue, Aug 26, 2008 at 10:47:37PM +0200, Karel Zak wrote:
> > There is worse scenario (thanks to Milan Broz from DM camp):
> >
> > dmsetup create x --table "0 100 linear /dev/sdb 0"
> > dmsetup create y --table "0 100 linear /dev/mapper/x 0"
> > dmsetup create z --table "0 100 linear /dev/mapper/y 0"
> >
> > # dmsetup ls --tree
> > z (254:3)
> > `-y (254:2)
> > `-x (254:1)
> > `- (8:16)
> >
> > it means all these devices are exactly same, but
> >
> > mount LABEL=foo
> >
> > has to mount /dev/mapper/z (from top of the tree). The sdb, x and
> > y should be invisible for the mount(8).

Well, s/invisible/less important/. Sorry.

> Sure, but consider what happens when you create a snapshot (either
> read-only or read-write) of an existing filesystem? In that case,

That's misunderstanding. I'm talking about LABEL/UUId resolution
where we need *priorities* for duplicate tags. I think dep-tree is
good enough for this purpose.

The snapshot (or arbitrary other device) is mountable when you
explicitly use device "mount /dev/mapper/the_snapshot".

> both the parent and the child filesystem is mountable, and if the
> child filesystem is transient, the praent one may not want to be
> transient.

[...]
> Speaking of which, what is your plan for caching versus non-caching in
> libfsprobe? It seems to me that if you are going to be caching,

Both. I think you remember our (+ Kay Sievers) discussion about it.
We need a library which provides both ways. The smart way (cache,
dependencies, ...) for mount(8) and others standard utils, and the
low-level way for udev (no cache, direct FS probing, ...).

> you'll just be re-inventing blkid. If you don't cache, you'll either

Hehe.. I will directly copy code from blkid and vol_id. It's open
source. I needn't re-inventing ;-)

> (a) have to iterate over all possible devices, which is what we did
> before blkid (it was Ric Wheeler pointed out to me this problem and I
> wrote blkid in response to his request, because it becomes a problem
> if you have hundred of LUN's getting exported by a large EMC storage
> array :-), or (b) do what vol_id does, which is depend on
> /dev/disk/by-label and /dev/disk/by-uuid, which has the charming
> Windows-like attribute of not getting updated until the next reboot

What about fix mkfs tools and send relevant events to udev?

> > It would be nice to have /sys/block/dm-N/name where you can translate
> > the internal dm-N name to the real device name. Alasdair? Milan? :-)
>
> Or maybe the right answer is /proc/partitions should only export
> devicemapper devices that are "supposed" to be visible to mount, and

Yes, I have no clue why the dm-N crap is in /proc/partitions.
Probably any legacy...

> instead of exporting dm-0, dm-1...., we export the real name via
> /proc/partitions? Or do you not want to have the user-visible name
> get pushed into the kernel?

I'm wrong person for these questions. DM guys are in CC: :-)

Karel

--
Karel Zak <[email protected]>

2008-08-27 01:21:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Wed, Aug 27, 2008 at 02:19:42AM +0200, Karel Zak wrote:
>
> That's misunderstanding. I'm talking about LABEL/UUId resolution
> where we need *priorities* for duplicate tags. I think dep-tree is
> good enough for this purpose.

OK, so what you're saying is that that a leaf dm-device is always more
important (and should therefore have a higher priority) than a
non-leaf device.

But I'm not sure that is *still* not always the right thing to do.
Suppose someone creates a snapshot of a device in order to run e2fsck,
or to do create a coherent snapshot. Now suppose the machine crashes
while the snapshot still exists; even though the read-only snapshot is
the "leaf" device, you don't want to try use that snapshot to be
mounted as the root filesystem.

There are a number of solutions of course --- most of which do not
require adding more smarts into blkid (or some other probing library).
We could make the scripts that create the snapshots update the UUID
and LABEL of the snapshot, although that means adding some kind
filesystem-specific hook to lvcreate. Or we could create the concept
of "ephemeral snapshots" that don't survive a reboot. Or we could try
to mark certain LVM volumes with explicit priorities that would be
pulled into blkid (possibly via the dm interfaces). Or we could try
to put a lot of that smarts into the blkid library.

Personally, I like the idea of emphermal snapshots as the best
system-wide solution, but the point is we need to think about this not
from a single component's point of view, but what is the best solution
from a systems perspective.

> Both. I think you remember our (+ Kay Sievers) discussion about it.
> We need a library which provides both ways. The smart way (cache,
> dependencies, ...) for mount(8) and others standard utils, and the
> low-level way for udev (no cache, direct FS probing, ...).

Sure, but if that's the case, we already have most of the "smart way"
from blkid. What's the point of making fsprobe re-invent the caching
solution? I could just point blkid at fsprobe.

>
> What about fix mkfs tools and send relevant events to udev?
>

Well, for one, it means changing every single mkfs/mkswap/tune*fs
program on the system; so it seems more than a bit of kludge.
Sometimes these tools are run by an unprivileged user, so there are
some security problems have to be carefully thought out. Obviously
you can't just tell udev the new label and uuid, since the source
might be untrusted. The userspace program send an event to udevd that
a device has changed, but that means that you have to allow an
unprivileged process to kick udevd and then reprobe a device, which
means there is a possibility of a denial of service attack. It also
makes it easier to exploit any potential buffer overrun, since the
attacker can now set up the buggered block device image and then
politely ask udev to call fsprobe (possibly running with root prives)
to access the attack image.

Of course, if there is a security bug in a filesystem probing code,
we're in deep trouble if there are user-writable devices. A push
model just means that the fs probing code can get triggered at a time
of the attacker's choosing, as opposed at some point in the future.

I also don't think it's realistic to assume that we can sweep through
all of the possibile tool that creates or modifies filesystem/swap
superblocks in order for the tool to work correctly. If a tool sends
a hint that to udev so that the cache and /dev/disk/* can get updated,
that may be fine; but we shouldn't be dependent on that hint. (And
maybe system administrators or security officers will have policies
not allowing non-privileged users running the tool to send that hint.)

BTW, I think /dev/disk/by-label is a really nasty idea. Suppose you
have hundreds of LUN's and/or device mapper devices. Now suppose we
change the label, or become aware that the label of a device has
changed. Right now, if you don't know the previous label (which will
often be the case), the only way to install the new label and remove
the old label is to iterate over all of the symlinks in
/dev/disk/by-label and calling readlink() on each one. It's
convenient and maybe it's a nice to have on disktop systems, but if
you have a huge number of devices, it's not very scalable at all.

- Ted

2008-08-27 04:40:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

> I think we can ignore this minor problem for now. I'll try to found a
> better solution for dependencies resolution without libdevmapper. My
> wish is to avoid libdevmapper in libfsprobe.

Here's a patch that I've been working on which gives a priority bonus
to dm leaf devices, without needing libdevmapper. As I've mentioned
before, I'm not 100% convinced this is always the right thing. But
it's probably a not-half-bad hueristic...

- Ted

diff --git a/lib/blkid/devname.c b/lib/blkid/devname.c
index 48a1dcc..2b3855b 100644
--- a/lib/blkid/devname.c
+++ b/lib/blkid/devname.c
@@ -25,6 +25,7 @@
#if HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
+#include <dirent.h>
#if HAVE_SYS_STAT_H
#include <sys/stat.h>
#endif
@@ -117,6 +118,38 @@ blkid_dev blkid_get_dev(blkid_cache cache, const char *devname, int flags)
/* Directories where we will try to search for device names */
static const char *dirlist[] = { "/dev", "/devfs", "/devices", NULL };

+static int is_dm_leaf(const char *devname)
+{
+ struct dirent *de, *d_de;
+ DIR *dir, *d_dir;
+ char path[256];
+ int ret = 1;
+
+ if ((dir = opendir("/sys/block")) == NULL)
+ return 0;
+ while ((de = readdir(dir)) != NULL) {
+ if (!strcmp(de->d_name, ".") || !strcmp(de->d_name, "..") ||
+ !strcmp(de->d_name, devname) ||
+ strncmp(de->d_name, "dm-", 3) ||
+ strlen(de->d_name) > sizeof(path)-32)
+ continue;
+ sprintf(path, "/sys/block/%s/slaves", de->d_name);
+ if ((d_dir = opendir(path)) == NULL)
+ continue;
+ while ((d_de = readdir(d_dir)) != NULL) {
+ if (!strcmp(d_de->d_name, devname)) {
+ ret = 0;
+ break;
+ }
+ }
+ closedir(d_dir);
+ if (!ret)
+ break;
+ }
+ closedir(dir);
+ return ret;
+}
+
/*
* Probe a single block device to add to the device cache.
*/
@@ -180,9 +213,11 @@ set_pri:
if (dev) {
if (pri)
dev->bid_pri = pri;
- else if (!strncmp(dev->bid_name, "/dev/mapper/", 11))
+ else if (!strncmp(dev->bid_name, "/dev/mapper/", 11)) {
dev->bid_pri = BLKID_PRI_DM;
- else if (!strncmp(ptname, "md", 2))
+ if (is_dm_leaf(ptname))
+ dev->bid_pri += 5;
+ } else if (!strncmp(ptname, "md", 2))
dev->bid_pri = BLKID_PRI_MD;
}
return;
@@ -198,7 +233,6 @@ set_pri:
* safe thing to do?)
*/
#ifdef VG_DIR
-#include <dirent.h>
static dev_t lvm_get_devno(const char *lvm_device)
{
FILE *lvf;

2008-08-27 07:26:46

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Aug 27, 2008 02:19 +0200, Karel Zak wrote:
> On Tue, Aug 26, 2008 at 07:32:25PM -0400, Theodore Tso wrote:
> > you'll just be re-inventing blkid. If you don't cache, you'll either
>
> Hehe.. I will directly copy code from blkid and vol_id. It's open
> source. I needn't re-inventing ;-)

Couldn't you just change libblkid to export the probe functions? It
always makes me cringe when code like this is copied, because I just
_know_ one or the other will become out of date, and it will take
twice as much effort to keep them in sync. I'd rather see people
doing "high value" work instead of watching for and copying patches
around.

> > Or maybe the right answer is /proc/partitions should only export
> > devicemapper devices that are "supposed" to be visible to mount, and
>
> Yes, I have no clue why the dm-N crap is in /proc/partitions.
> Probably any legacy...

I've wanted that for so long... It's always a pain that LVM tools
work by using /dev/vgfoo/lvfoo (which is fine) but /proc/partitions
doesn't give any clue to what each dm-X device is.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-27 08:10:35

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Wed, Aug 27, 2008 at 01:26:43AM -0600, Andreas Dilger wrote:
> On Aug 27, 2008 02:19 +0200, Karel Zak wrote:
> > On Tue, Aug 26, 2008 at 07:32:25PM -0400, Theodore Tso wrote:
> > > you'll just be re-inventing blkid. If you don't cache, you'll either
> >
> > Hehe.. I will directly copy code from blkid and vol_id. It's open
> > source. I needn't re-inventing ;-)
>
> Couldn't you just change libblkid to export the probe functions? It
> always makes me cringe when code like this is copied, because I just
> _know_ one or the other will become out of date, and it will take
> twice as much effort to keep them in sync. I'd rather see people
> doing "high value" work instead of watching for and copying patches
> around.

Yes, good point. Frankly, I think about it in last weeks. I have done
work on the low-lever part of libfsprobe -- it shouldn't be a problem
port this code to libblkid. The advantage will be a library that is
usable for udev and backwardly compatible for the current blkid
applications. I will try it...

Karel

--
Karel Zak <[email protected]>

2008-08-27 08:32:58

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH] blkid: optimize dm_device_is_leaf() usage

On Wed, Aug 27, 2008 at 12:40:54AM -0400, Theodore Tso wrote:
> > I think we can ignore this minor problem for now. I'll try to found a
> > better solution for dependencies resolution without libdevmapper. My
> > wish is to avoid libdevmapper in libfsprobe.
>
> Here's a patch that I've been working on which gives a priority bonus
> to dm leaf devices, without needing libdevmapper. As I've mentioned
> before, I'm not 100% convinced this is always the right thing. But
> it's probably a not-half-bad hueristic...
[...]
>
> +static int is_dm_leaf(const char *devname)
> +{
> + struct dirent *de, *d_de;
> + DIR *dir, *d_dir;
> + char path[256];
> + int ret = 1;
> +
> + if ((dir = opendir("/sys/block")) == NULL)
> + return 0;
> + while ((de = readdir(dir)) != NULL) {
> + if (!strcmp(de->d_name, ".") || !strcmp(de->d_name, "..") ||
> + !strcmp(de->d_name, devname) ||
> + strncmp(de->d_name, "dm-", 3) ||
> + strlen(de->d_name) > sizeof(path)-32)
> + continue;
> + sprintf(path, "/sys/block/%s/slaves", de->d_name);
> + if ((d_dir = opendir(path)) == NULL)
> + continue;
> + while ((d_de = readdir(d_dir)) != NULL) {
> + if (!strcmp(d_de->d_name, devname)) {
> + ret = 0;
> + break;
> + }
> + }
> + closedir(d_dir);
> + if (!ret)
> + break;
> + }
> + closedir(dir);
> + return ret;
> +}

Seems good. This patch and the brute-force for conversion from dm-N to
real names is the way how replace the HAVE_DEVMAPPER crap in libblkid.
Thanks!

Karel

--
Karel Zak <[email protected]>