This patch adds ZFS filesystem detection to libblkid.
It probes for VDEV_BOOT_MAGIC in the first 2 ZFS labels in big-endian
and little-endian formats.
Unfortunately the probe table doesn't support probing from the end of
the device, otherwise we could also probe in the 3rd and 4th labels (in
case the first 2 labels were accidentally overwritten)..
Eventually we would set the UUID from the ZFS pool GUID and the LABEL tag
from the pool name, but that requires parsing an XDR encoding of the pool
configuration which is not trivial.
Signed-off-by: Ricardo M. Correia <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
---
diff --git a/lib/blkid/probe.c b/lib/blkid/probe.c
index c8bc840..cad0860 100644
--- a/lib/blkid/probe.c
+++ b/lib/blkid/probe.c
@@ -774,6 +774,18 @@ static int probe_jfs(struct blkid_probe *probe,
return 0;
}
+static int probe_zfs(struct blkid_probe *probe, struct blkid_magic *id,
+ unsigned char *buf)
+{
+ char *vdev_label;
+ const char *pool_name = 0;
+
+ /* read nvpair data for pool name, pool GUID (complex) */
+ //blkid_set_tag(probe->dev, "LABEL", pool_name, sizeof(pool_name));
+ //set_uuid(probe->dev, pool_guid, 0);
+ return 0;
+}
+
static int probe_luks(struct blkid_probe *probe,
struct blkid_magic *id __BLKID_ATTR((unused)),
unsigned char *buf)
@@ -1095,15 +1107,6 @@ static int probe_lvm2(struct blkid_probe *probe,
return 0;
}
/*
- * BLKID_BLK_OFFS is at least as large as the highest bim_kboff defined
- * in the type_array table below + bim_kbalign.
- *
- * When probing for a lot of magics, we handle everything in 1kB
buffers so
- * that we don't have to worry about reading each combination of block
sizes.
- */
-#define BLKID_BLK_OFFS 64 /* currently reiserfs */
-
-/*
* Various filesystem magics that we can check for. Note that kboff
and
* sboff are in kilobytes and bytes respectively. All magics are in
* byte strings so we don't worry about endian issues.
@@ -1153,6 +1156,10 @@ static struct blkid_magic type_array[] = {
{ "iso9660", 32, 1, 5, "CD001", probe_iso9660 },
{ "iso9660", 32, 9, 5, "CDROM", probe_iso9660 },
{ "jfs", 32, 0, 4, "JFS1", probe_jfs },
+ { "zfs", 8, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c",
probe_zfs },
+ { "zfs", 8, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0",
probe_zfs },
+ { "zfs", 264, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c",
probe_zfs },
+ { "zfs", 264, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0",
probe_zfs },
{ "hfsplus", 1, 0, 2, "BD", probe_hfsplus },
{ "hfsplus", 1, 0, 2, "H+", 0 },
{ "hfs", 1, 0, 2, "BD", 0 },
@@ -1288,7 +1295,7 @@ try_again:
if (!buf)
continue;
- if (memcmp(id->bim_magic, buf + (id->bim_sboff&0x3ff),
+ if (memcmp(id->bim_magic, buf + (id->bim_sboff & 0x3ff),
id->bim_len))
continue;
@@ -1318,7 +1325,7 @@ try_again:
dev = 0;
goto found_type;
}
-
+
found_type:
if (dev && type) {
dev->bid_devno = st.st_rdev;
@@ -1327,7 +1334,7 @@ found_type:
cache->bic_flags |= BLKID_BIC_FL_CHANGED;
blkid_set_tag(dev, "TYPE", type, 0);
-
+
DBG(DEBUG_PROBE, printf("%s: devno 0x%04llx, type %s\n",
dev->bid_name, (long long)st.st_rdev, type));
}
On Fri, May 30, 2008 at 07:20:47PM +0100, Ricardo M. Correia wrote:
> This patch adds ZFS filesystem detection to libblkid.
It'd be nice to add ZFS detection to libvolume_id from udev package.
> Unfortunately the probe table doesn't support probing from the end of
> the device, otherwise we could also probe in the 3rd and 4th labels (in
> case the first 2 labels were accidentally overwritten)..
This shouldn't be a problem in libvolume_id.
> +static int probe_zfs(struct blkid_probe *probe, struct blkid_magic *id,
> + unsigned char *buf)
> +{
> + char *vdev_label;
> + const char *pool_name = 0;
> +
> + /* read nvpair data for pool name, pool GUID (complex) */
> + //blkid_set_tag(probe->dev, "LABEL", pool_name, sizeof(pool_name));
> + //set_uuid(probe->dev, pool_guid, 0);
C++ ?
> - if (memcmp(id->bim_magic, buf + (id->bim_sboff&0x3ff),
> + if (memcmp(id->bim_magic, buf + (id->bim_sboff & 0x3ff),
???
Karel
--
Karel Zak <[email protected]>
Hi Karel,
On Seg, 2008-06-02 at 09:41 +0200, Karel Zak wrote:
> It'd be nice to add ZFS detection to libvolume_id from udev package.
Interesting, I didn't know libvolume_id existed.
Is there a reason why there should be 2 different libraries doing what
appears to be essentially the same thing?
> > + /* read nvpair data for pool name, pool GUID (complex) */
> > + //blkid_set_tag(probe->dev, "LABEL", pool_name, sizeof(pool_name));
> > + //set_uuid(probe->dev, pool_guid, 0);
>
> C++ ?
Actually, those are valid C99 tokens for comments. I don't see any style
guide in the source of e2fsprogs, and they do appear to be used in some
places (like lib/uuid/uuid_gen_nt.c), but if necessary that can be
easily changed.
> > - if (memcmp(id->bim_magic, buf + (id->bim_sboff&0x3ff),
> > + if (memcmp(id->bim_magic, buf + (id->bim_sboff & 0x3ff),
>
> ???
That's unrelated to the purpose of this patch, it seems to be just a
style fix (inherited from Andreas' patch) :-)
Thanks,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email [email protected]
On Mon, Jun 02, 2008 at 04:01:03PM +0100, Ricardo M. Correia wrote:
> Hi Karel,
>
> On Seg, 2008-06-02 at 09:41 +0200, Karel Zak wrote:
>
> > It'd be nice to add ZFS detection to libvolume_id from udev package.
>
>
> Interesting, I didn't know libvolume_id existed.
> Is there a reason why there should be 2 different libraries doing what
> appears to be essentially the same thing?
I'm just merging these two libraries to libfsprobe :-)
http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=shortlog;h=topic/fsprobe
Karel
--
Karel Zak <[email protected]>
On Jun 02, 2008 22:58 +0200, Karel Zak wrote:
> On Mon, Jun 02, 2008 at 04:01:03PM +0100, Ricardo M. Correia wrote:
> > Hi Karel,
> >
> > On Seg, 2008-06-02 at 09:41 +0200, Karel Zak wrote:
> >
> > > It'd be nice to add ZFS detection to libvolume_id from udev package.
> >
> >
> > Interesting, I didn't know libvolume_id existed.
> > Is there a reason why there should be 2 different libraries doing what
> > appears to be essentially the same thing?
>
> I'm just merging these two libraries to libfsprobe :-)
> http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=shortlog;h=topic/fsprobe
Hrm, having "fs" in the name is a bit unfortunate, since several of the
things that libblkid detects are not filesystems (e.g. MD, swap).
Similarly, libvolume_id also does not contain "fs" in the name...
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Mon, Jun 02, 2008 at 03:38:48PM -0600, Andreas Dilger wrote:
> On Jun 02, 2008 22:58 +0200, Karel Zak wrote:
> > On Mon, Jun 02, 2008 at 04:01:03PM +0100, Ricardo M. Correia wrote:
> > > Hi Karel,
> > >
> > > On Seg, 2008-06-02 at 09:41 +0200, Karel Zak wrote:
> > >
> > > > It'd be nice to add ZFS detection to libvolume_id from udev package.
> > >
> > >
> > > Interesting, I didn't know libvolume_id existed.
> > > Is there a reason why there should be 2 different libraries doing what
> > > appears to be essentially the same thing?
> >
> > I'm just merging these two libraries to libfsprobe :-)
> > http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=shortlog;h=topic/fsprobe
>
> Hrm, having "fs" in the name is a bit unfortunate, since several of the
> things that libblkid detects are not filesystems (e.g. MD, swap).
> Similarly, libvolume_id also does not contain "fs" in the name...
Any suggestion? (Frankly, I don't care about names...)
Karel
--
Karel Zak <[email protected]>
On Jun 03, 2008 00:31 +0200, Karel Zak wrote:
> On Mon, Jun 02, 2008 at 03:38:48PM -0600, Andreas Dilger wrote:
> > On Jun 02, 2008 22:58 +0200, Karel Zak wrote:
> > > On Mon, Jun 02, 2008 at 04:01:03PM +0100, Ricardo M. Correia wrote:
> > > > Hi Karel,
> > > >
> > > > On Seg, 2008-06-02 at 09:41 +0200, Karel Zak wrote:
> > > >
> > > > > It'd be nice to add ZFS detection to libvolume_id from udev package.
> > > >
> > > >
> > > > Interesting, I didn't know libvolume_id existed.
> > > > Is there a reason why there should be 2 different libraries doing what
> > > > appears to be essentially the same thing?
> > >
> > > I'm just merging these two libraries to libfsprobe :-)
> > > http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=shortlog;h=topic/fsprobe
> >
> > Hrm, having "fs" in the name is a bit unfortunate, since several of the
> > things that libblkid detects are not filesystems (e.g. MD, swap).
> > Similarly, libvolume_id also does not contain "fs" in the name...
>
> Any suggestion? (Frankly, I don't care about names...)
To be honest, I'd prefer if you just kept the existing name(s) and merged
the functionality of the two libraries into one. That avoids the need to
re-code either set of applications, and also avoids the risk of a yet
_another_ library in addition to libblkid and libvolume_id that is used
in a few places each.
Failing that, how about "libdevprobe"?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Mon, Jun 02, 2008 at 10:58:09PM +0200, Karel Zak wrote:
>
> I'm just merging these two libraries to libfsprobe :-)
> http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=shortlog;h=topic/fsprobe
>
Note that libfsprobe does not currently have proper support for ext4
and ext4dev filesystem types.
- Ted
On Fri, May 30, 2008 at 07:20:47PM +0100, Ricardo M. Correia wrote:
> This patch adds ZFS filesystem detection to libblkid.
>
> It probes for VDEV_BOOT_MAGIC in the first 2 ZFS labels in big-endian
> and little-endian formats.
Applied, with a few minor changes.
- Ted
Ricardo M. Correia wrote:
> This patch adds ZFS filesystem detection to libblkid.
>
> It probes for VDEV_BOOT_MAGIC in the first 2 ZFS labels in big-endian
> and little-endian formats.
> Unfortunately the probe table doesn't support probing from the end of
> the device, otherwise we could also probe in the 3rd and 4th labels (in
> case the first 2 labels were accidentally overwritten)..
>
> Eventually we would set the UUID from the ZFS pool GUID and the LABEL tag
> from the pool name, but that requires parsing an XDR encoding of the pool
> configuration which is not trivial.
Hi Ricardo:
I've got a bug on fedora now where a partition w/ zfs on it is being
recognized as ext3. VDEV_BOOT_MAGIC isn't anywhere in the first 1M of
the device, though an old ext3 superblock is intact. :(
# hexdump -C sda2-first-1M | grep "0c b1 07 b0 f5 02 00 00"
# hexdump -C sda2-first-1M | grep "00 00 02 f5 b0 07 b1 0c"
#
(so a plea - can you guys change mkfs to zero old signatures? maybe
256k at the front & end of the device in question?)
there is apparently an uberblock signature in it though:
# hexdump -C sda2-first-1M | grep "0c b1 ba 00"
00015e30 30 e8 ff ff 85 c0 5b 75 48 8b 06 35 0c b1 ba 00
does blkid need more work here, I guess?
Thanks,
-Eric
Eric Sandeen wrote:
> Ricardo M. Correia wrote:
>> This patch adds ZFS filesystem detection to libblkid.
>>
>> It probes for VDEV_BOOT_MAGIC in the first 2 ZFS labels in big-endian
>> and little-endian formats.
>> Unfortunately the probe table doesn't support probing from the end of
>> the device, otherwise we could also probe in the 3rd and 4th labels (in
>> case the first 2 labels were accidentally overwritten)..
>>
>> Eventually we would set the UUID from the ZFS pool GUID and the LABEL tag
>> from the pool name, but that requires parsing an XDR encoding of the pool
>> configuration which is not trivial.
>
> Hi Ricardo:
>
> I've got a bug on fedora now where a partition w/ zfs on it is being
> recognized as ext3. VDEV_BOOT_MAGIC isn't anywhere in the first 1M of
> the device, though an old ext3 superblock is intact. :(
>
> # hexdump -C sda2-first-1M | grep "0c b1 07 b0 f5 02 00 00"
> # hexdump -C sda2-first-1M | grep "00 00 02 f5 b0 07 b1 0c"
> #
>
> (so a plea - can you guys change mkfs to zero old signatures? maybe
> 256k at the front & end of the device in question?)
>
> there is apparently an uberblock signature in it though:
>
> # hexdump -C sda2-first-1M | grep "0c b1 ba 00"
> 00015e30 30 e8 ff ff 85 c0 5b 75 48 8b 06 35 0c b1 ba 00
>
> does blkid need more work here, I guess?
And from another report, another user's zfs partition:
# hexdump -C first_400K | grep "0c b1 07 b0 f5 02 00 00"
# hexdump -C first_400K | grep "00 00 02 f5 b0 07 b1 0c"
# hexdump -C first_400K | grep "0c b1 ba 00"
00015e30 30 e8 ff ff 85 c0 5b 75 48 8b 06 35 0c b1 ba 00
Should we be looking for 0x00babloc at offset 00015e30?
Thanks,
-Eric
On Apr 04, 2009 08:04 -0500, Eric Sandeen wrote:
> And from another report, another user's zfs partition:
>
> # hexdump -C first_400K | grep "0c b1 07 b0 f5 02 00 00"
> # hexdump -C first_400K | grep "00 00 02 f5 b0 07 b1 0c"
> # hexdump -C first_400K | grep "0c b1 ba 00"
> 00015e30 30 e8 ff ff 85 c0 5b 75 48 8b 06 35 0c b1 ba 00
>
> Should we be looking for 0x00babloc at offset 00015e30?
In probe.c I see the magic (that I've been assuming is correct, because
it isn't really that easy to read):
{ "zfs", 8, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", probe_zfs },
{ "zfs", 8, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", probe_zfs },
{ "zfs", 264, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", probe_zfs },
{ "zfs", 264, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", probe_zfs },
but looking at this I have no idea what this magic value is supposed
to represent, from reading the ondiskformat0822.pdf (ZFS spec) document.
According to that the __u64 ub_magic is supposed to be 0x0000000000bab10c
(big endian) or 0x0xb1ba0000000000 (little endian), at 128kB offset, and
128*1kB thereafter (assuming the filesystem has been used enough to
write 128 transactions). This should repeat again at 256kB + 128kB.
In the zfs.img.bz2 that I have (and sent Karel recently for util-linux-ng)
I don't see the ub_magic at the right offset. The NVpair data IS
in the right 16kB offset, and contains all of the expected data (version,
name, pool_guid, etc) so it isn't just an issue of the device having the
wrong offset. The ub_magic is at 0x21000, 0x21400, ..., and for the
second ?berblock at 0x61000, 0x61400, so this is off by 0x1000 or 16kB.
I _suppose_ there is no hard requirement that the ub_magic is present in
the first ?berblock slot at 128kB, but that does make it harder to find.
In theory we would need to add 256 magic value checks, which seems
unreasonable. Ricardo, do you know why the zfs.img.bz2 has bad ?berblocks
for the first 4 slots?
In any case, the ZFS magic above is completely broken and needs fixing,
as does the patch I sent Karel for util-linux-ng. For e2fsprogs it seems
it should be something like the following:
{ "zfs", 128, 0, 8, "\0\0\0\0\0\xba\xb1\xc",probe_zfs },
{ "zfs", 128, 0, 8, "\xc\xb1\xba\0\0\0\0\0",probe_zfs },
{ "zfs", 132, 0, 8, "\0\0\0\0\0\xba\xb1\xc",probe_zfs },
{ "zfs", 132, 0, 8, "\xc\xb1\xba\0\0\0\0\0",probe_zfs },
{ "zfs", 136, 0, 8, "\0\0\0\0\0\xba\xb1\xc",probe_zfs },
{ "zfs", 136, 0, 8, "\xc\xb1\xba\0\0\0\0\0",probe_zfs },
{ "zfs", 384, 0, 8, "\0\0\0\0\0\xba\xb1\xc",probe_zfs },
{ "zfs", 384, 0, 8, "\xc\xb1\xba\0\0\0\0\0",probe_zfs },
{ "zfs", 388, 0, 8, "\0\0\0\0\0\xba\xb1\xc",probe_zfs },
{ "zfs", 388, 0, 8, "\xc\xb1\xba\0\0\0\0\0",probe_zfs },
{ "zfs", 392, 0, 8, "\0\0\0\0\0\xba\xb1\xc",probe_zfs },
{ "zfs", 392, 0, 8, "\xc\xb1\xba\0\0\0\0\0",probe_zfs },
and for util-linux-ng it should be like the following:
{ .magic = "\0\0\0\0\0\xba\xb1\xc", .len = 8,
.kboff = 128, .sboff = 0 },
{ .magic = "\xc\xb1\xba\0\0\0\0\0", .len = 8,
.kboff = 128, .sboff = 0 },
{ .magic = "\0\0\0\0\0\xba\xb1\xc", .len = 8,
.kboff = 132, .sboff = 0 },
{ .magic = "\xc\xb1\xba\0\0\0\0\0", .len = 8,
.kboff = 132, .sboff = 0 },
{ .magic = "\0\0\0\0\0\xba\xb1\xc", .len = 8,
.kboff = 136, .sboff = 0 },
{ .magic = "\xc\xb1\xba\0\0\0\0\0", .len = 8,
.kboff = 136, .sboff = 0 },
{ .magic = "\0\0\0\0\0\xba\xb1\xc", .len = 8,
.kboff = 384, .sboff = 0 },
{ .magic = "\xc\xb1\xba\0\0\0\0\0", .len = 8,
.kboff = 384, .sboff = 0 },
{ .magic = "\0\0\0\0\0\xba\xb1\xc", .len = 8,
.kboff = 388, .sboff = 0 },
{ .magic = "\xc\xb1\xba\0\0\0\0\0", .len = 8,
.kboff = 388, .sboff = 0 },
{ .magic = "\0\0\0\0\0\xba\xb1\xc", .len = 8,
.kboff = 392, .sboff = 0 },
{ .magic = "\xc\xb1\xba\0\0\0\0\0", .len = 8,
.kboff = 392, .sboff = 0 },
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> On Apr 04, 2009 08:04 -0500, Eric Sandeen wrote:
>> And from another report, another user's zfs partition:
>>
>> # hexdump -C first_400K | grep "0c b1 07 b0 f5 02 00 00"
>> # hexdump -C first_400K | grep "00 00 02 f5 b0 07 b1 0c"
>> # hexdump -C first_400K | grep "0c b1 ba 00"
>> 00015e30 30 e8 ff ff 85 c0 5b 75 48 8b 06 35 0c b1 ba 00
>>
>> Should we be looking for 0x00babloc at offset 00015e30?
>
> In probe.c I see the magic (that I've been assuming is correct, because
> it isn't really that easy to read):
>
> { "zfs", 8, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", probe_zfs },
> { "zfs", 8, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", probe_zfs },
> { "zfs", 264, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", probe_zfs },
> { "zfs", 264, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", probe_zfs },
Yep and that's different from the patch you originally submitted, Andreas...
> but looking at this I have no idea what this magic value is supposed
> to represent, from reading the ondiskformat0822.pdf (ZFS spec) document.
VDEV_BOOT_MAGIC ...
333 /* ZFS boot block */
334 #define VDEV_BOOT_MAGIC 0x2f5b007b10cULL
but I can't find much in that spec about where it's supposed to be, at
least from a very quick skim.
I'll let you IBM er... SUN guys sort it out ;)
-Eric
On Apr 04, 2009 16:46 -0500, Eric Sandeen wrote:
> Andreas Dilger wrote:
> > On Apr 04, 2009 08:04 -0500, Eric Sandeen wrote:
> >> And from another report, another user's zfs partition:
> >>
> >> # hexdump -C first_400K | grep "0c b1 07 b0 f5 02 00 00"
> >> # hexdump -C first_400K | grep "00 00 02 f5 b0 07 b1 0c"
> >> # hexdump -C first_400K | grep "0c b1 ba 00"
> >> 00015e30 30 e8 ff ff 85 c0 5b 75 48 8b 06 35 0c b1 ba 00
> >>
> >> Should we be looking for 0x00babloc at offset 00015e30?
> >
> > In probe.c I see the magic (that I've been assuming is correct, because
> > it isn't really that easy to read):
> >
> > { "zfs", 8, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", probe_zfs },
> > { "zfs", 8, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", probe_zfs },
> > { "zfs", 264, 0, 8, "\0\0\x02\xf5\xb0\x07\xb1\x0c", probe_zfs },
> > { "zfs", 264, 0, 8, "\x0c\xb1\x07\xb0\xf5\x02\0\0", probe_zfs },
>
> Yep and that's different from the patch you originally submitted, Andreas...
>
> > but looking at this I have no idea what this magic value is supposed
> > to represent, from reading the ondiskformat0822.pdf (ZFS spec) document.
>
> VDEV_BOOT_MAGIC ...
>
> 333 /* ZFS boot block */
> 334 #define VDEV_BOOT_MAGIC 0x2f5b007b10cULL
>
> but I can't find much in that spec about where it's supposed to be, at
> least from a very quick skim.
According to ondiskformat0822.pdf this is supposed to be at an 8kB
offset in each of the 4 per-disk labels, but it isn't clear that this
will be in every ZFS filesystem:
Section 1.3.2: Boot Block Header
The boot block header is an 8K structure that is reserved for
future use. The contents of this block will be described in a
future appendix of this paper.
so I'd prefer to just stick with the ?berblock magic, which should
be present in all versions of ZFS.
> I'll let you IBM er... SUN guys sort it out ;)
I'll defer to Ricardo, he is the ZFS expert.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Hi,
On Sáb, 2009-04-04 at 15:25 -0600, Andreas Dilger wrote:
> I _suppose_ there is no hard requirement that the ub_magic is present in
> the first überblock slot at 128kB, but that does make it harder to find.
> In theory we would need to add 256 magic value checks, which seems
> unreasonable. Ricardo, do you know why the zfs.img.bz2 has bad überblocks
> for the first 4 slots?
Your supposition is correct - there's no requirement that the first
uberblock that gets written to the uberblock array has to be in the
first slot.
The reason that this image has bad uberblocks in the first 4 slots is
that, in the current ZFS implementation, when you create a ZFS pool, the
first uberblock that gets written to disk has txg number 4, and the slot
that gets chosen for each uberblock is "txg_nr % nr_of_uberblock_slots".
So in fact, it's not that the first 4 uberblocks are bad, it's just that
the first 4 slots don't have any uberblocks in them yet.
However, even though currently it's txg nr 4 that gets written first,
this is an implementation-specific detail that we cannot (or should not)
rely upon.
So I think you're (mostly) right - in theory, a correct implementation
would have to search all the uberblock slots in all the 4 labels (2 at
the beginning of the partition and 2 at the end), for a total of 512
magic offsets, but this is not easy to do with libblkid because it only
looks for the magic values at hard-coded offsets (as opposed to being
able to implement a routine to look for a filesystem, which could use a
simple "for" statement).
This is why I decided to change your patch to look for VDEV_BOOT_MAGIC,
which I assumed was always there in the same place, but apparently this
does not seem to be the case.
Eric, do you know how this ZFS pool/filesystem was created?
Specifically, which Solaris/OpenSolaris version/build, or maybe zfs-fuse
version? Also, details about which partitioning scheme is being used and
whether this is a root pool would also help a lot.
BTW, I also agree that it would be useful for ext3's mkfs to zero-out
the first and last 512 KB of the partition, to get rid of the ZFS labels
and magic values, although if it detects these magic values, it would be
quite useful for mkfs to refuse to format the partition, forcing the
user to specify some "--force" flag (like "zpool create" does), or at
least ask the user for confirmation (if mkfs is being used in
interactive mode), to avoid accidental data destruction.
If this is not done, then maybe leaving the ZFS labels intact could be
better, so that the user has a chance to recover (some/most) of it's
data, in case he made a mistake.
Cheers,
Ricardo
Ricardo M. Correia wrote:
...
> Eric, do you know how this ZFS pool/filesystem was created?
> Specifically, which Solaris/OpenSolaris version/build, or maybe zfs-fuse
> version? Also, details about which partitioning scheme is being used and
> whether this is a root pool would also help a lot.
Can you perhaps just chime in on these bugs & ask? you speak zfs better
than I do...
https://bugzilla.redhat.com/show_bug.cgi?id=494070
https://bugzilla.redhat.com/show_bug.cgi?id=490795
Thanks,
-Eric
On Mon, Apr 06, 2009 at 08:22:38PM +0100, Ricardo M. Correia wrote:
> So I think you're (mostly) right - in theory, a correct implementation
> would have to search all the uberblock slots in all the 4 labels (2 at
> the beginning of the partition and 2 at the end), for a total of 512
> magic offsets, but this is not easy to do with libblkid because it only
> looks for the magic values at hard-coded offsets (as opposed to being
> able to implement a routine to look for a filesystem, which could use a
> simple "for" statement).
Note that the new libblkid version (in util-linux-ng) could use a
simple "for" statement (see ufs.c or *_raid.c), but the "magic value
at hard-coded offset" is a better (cheaper) method.
Karel
--
Karel Zak <[email protected]>
On Apr 06, 2009 20:22 +0100, Ricardo Correia wrote:
> On S?b, 2009-04-04 at 15:25 -0600, Andreas Dilger wrote:
> > I _suppose_ there is no hard requirement that the ub_magic is present in
> > the first ?berblock slot at 128kB, but that does make it harder to find.
> > In theory we would need to add 256 magic value checks, which seems
> > unreasonable. Ricardo, do you know why the zfs.img.bz2 has bad ?berblocks
> > for the first 4 slots?
>
> Your supposition is correct - there's no requirement that the first
> uberblock that gets written to the uberblock array has to be in the
> first slot.
>
> The reason that this image has bad uberblocks in the first 4 slots is
> that, in the current ZFS implementation, when you create a ZFS pool, the
> first uberblock that gets written to disk has txg number 4, and the slot
> that gets chosen for each uberblock is "txg_nr % nr_of_uberblock_slots".
>
> So in fact, it's not that the first 4 uberblocks are bad, it's just that
> the first 4 slots don't have any uberblocks in them yet.
>
> However, even though currently it's txg nr 4 that gets written first,
> this is an implementation-specific detail that we cannot (or should not)
> rely upon.
So my proposal to check the 0th, 4th, and 8th ?berblock in both
the first and second VDEV label should be pretty safe.
> BTW, I also agree that it would be useful for ext3's mkfs to zero-out
> the first and last 512 KB of the partition, to get rid of the ZFS labels
> and magic values, although if it detects these magic values, it would be
> quite useful for mkfs to refuse to format the partition, forcing the
> user to specify some "--force" flag (like "zpool create" does), or at
> least ask the user for confirmation (if mkfs is being used in
> interactive mode), to avoid accidental data destruction.
>
> If this is not done, then maybe leaving the ZFS labels intact could be
> better, so that the user has a chance to recover (some/most) of it's
> data, in case he made a mistake.
Well, if ZFS is currently using this filesystem, then the kernel will
have the block device open O_EXCL, which will prevent the mkfs from
happening.
Whether it will be a feature to "--force" mkfs to overwrite an ext3
superblock or ZFS superblock is questionable. The problem with needing
"--force" is that people tend to hard-code this into their scripts
(so that their script always works) and then due to only having a single
"--force" flag it also forces other, possibly more destructive, behaviour
(e.g. --force is needed to mke2fs on a file so that it can be mounted
via loopback, but will also force mke2fs on a filesystem that actually
IS in use, etc).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Hi Eric,
On Seg, 2009-04-06 at 13:13 -0700, Eric Sandeen wrote:
> Can you perhaps just chime in on these bugs & ask? you speak zfs better
> than I do...
>
> https://bugzilla.redhat.com/show_bug.cgi?id=494070
> https://bugzilla.redhat.com/show_bug.cgi?id=490795
Sorry for taking a while to get back to you.
I've just looked at these bugs and everything seems to make more sense
now. Those partitions are actually Solaris partitions, which contain
their own partition table inside.
So in bug 494070, the /dev/sda2 partition is internally subdivided into
what Solaris calls "slices" (which are similar to partitions) and then
the root ZFS pool/filesystem is stored inside one of these slices. So in
fact, the ZFS pool is not directly in /dev/sda2, it's somewhere inside
it. This is why the ZFS magic numbers don't seem to be in the right
place.
These slices don't seem to be visible to that Linux system, which I
suspect is due to the kernel not being compiled with Solaris partition
table support. If it were, other partitions (including the ZFS one)
would show up (if my assumption is correct).
So from looking at the libblkid magic offsets, it seems that ext3 magic
value is stored between 1K-2K, which means that it will fall into the
boot slice, not in the ZFS slice. So AFAICS this bug doesn't have
anything to do with ZFS, i.e., the same thing would happen if the root
filesystem of the OpenSolaris installation was UFS.
It'd be nice if OpenSolaris zeroed the boot slice when it is installed,
but on the other hand, should the Anaconda installer fail just because
it can't mount a (possibly corrupted/leftover) filesystem?
I can file a bug for OpenSolaris if you feel strongly about this.
Thanks,
Ricardo
Ricardo M. Correia wrote:
> Hi Eric,
>
> On Seg, 2009-04-06 at 13:13 -0700, Eric Sandeen wrote:
>> Can you perhaps just chime in on these bugs & ask? you speak zfs better
>> than I do...
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=494070
>> https://bugzilla.redhat.com/show_bug.cgi?id=490795
>
> Sorry for taking a while to get back to you.
>
> I've just looked at these bugs and everything seems to make more sense
> now. Those partitions are actually Solaris partitions, which contain
> their own partition table inside.
>
> So in bug 494070, the /dev/sda2 partition is internally subdivided into
> what Solaris calls "slices" (which are similar to partitions) and then
> the root ZFS pool/filesystem is stored inside one of these slices. So in
> fact, the ZFS pool is not directly in /dev/sda2, it's somewhere inside
> it. This is why the ZFS magic numbers don't seem to be in the right
> place.
>
> These slices don't seem to be visible to that Linux system, which I
> suspect is due to the kernel not being compiled with Solaris partition
> table support. If it were, other partitions (including the ZFS one)
> would show up (if my assumption is correct).
It actually is compiled with that... but then we didn't look at
/proc/partitions (the kernel's view) in the bug but rather fdisk output,
which probably doesn't understand this at all.
> So from looking at the libblkid magic offsets, it seems that ext3 magic
> value is stored between 1K-2K, which means that it will fall into the
> boot slice, not in the ZFS slice. So AFAICS this bug doesn't have
> anything to do with ZFS, i.e., the same thing would happen if the root
> filesystem of the OpenSolaris installation was UFS.
ok, makes sense. (maybe blkid should recognize sda2 as being this
special sort of partition and stop there...)
> It'd be nice if OpenSolaris zeroed the boot slice when it is installed,
> but on the other hand, should the Anaconda installer fail just because
> it can't mount a (possibly corrupted/leftover) filesystem?
Generally, no; and anaconda now will not fail if the mount simply fails.
But if the mount attempt results in a kernel oops there's not much
anaconda can do... the filesystem that attempts the mount should be
robust enough to not oops as well, of course...
> I can file a bug for OpenSolaris if you feel strongly about this.
Well, it's always nice to be able to recognize a partition; blkid just
can't do this reliably if some tools leave old signatures lying around.
So zeroing it out would be the "polite" thing to do in any case. :)
Thanks,
-Eric
> Thanks,
> Ricardo
>
>
On Ter, 2009-04-07 at 00:40 -0700, Andreas Dilger wrote:
> > However, even though currently it's txg nr 4 that gets written first,
> > this is an implementation-specific detail that we cannot (or should not)
> > rely upon.
>
> So my proposal to check the 0th, 4th, and 8th überblock in both
> the first and second VDEV label should be pretty safe.
Yes, that should be relatively safe :)
Where should I be sending patches for this? e2fsprogs, util-linux-ng,
libvolume_id in udev, ... all of them?
> > If this is not done, then maybe leaving the ZFS labels intact could be
> > better, so that the user has a chance to recover (some/most) of it's
> > data, in case he made a mistake.
>
> Well, if ZFS is currently using this filesystem, then the kernel will
> have the block device open O_EXCL, which will prevent the mkfs from
> happening.
Not if the pool is exported :)
> Whether it will be a feature to "--force" mkfs to overwrite an ext3
> superblock or ZFS superblock is questionable. The problem with needing
> "--force" is that people tend to hard-code this into their scripts
> (so that their script always works) and then due to only having a single
> "--force" flag it also forces other, possibly more destructive, behaviour
> (e.g. --force is needed to mke2fs on a file so that it can be mounted
> via loopback, but will also force mke2fs on a filesystem that actually
> IS in use, etc).
What about '--force-overwrite' or something similar?
It wasn't scripts that I was so concerned about. I think this filesystem
detection would be more useful when running mkfs in a shell, where it is
more likely for the user to make mistakes (e.g. mistype /dev/sdd1
as /dev/sde1 or /dev/sda1 as /dev/sda2).
Thanks,
Ricardo
On Mon, Apr 13, 2009 at 08:29:22PM +0100, Ricardo M. Correia wrote:
> On Ter, 2009-04-07 at 00:40 -0700, Andreas Dilger wrote:
> > > However, even though currently it's txg nr 4 that gets written first,
> > > this is an implementation-specific detail that we cannot (or should not)
> > > rely upon.
> >
> > So my proposal to check the 0th, 4th, and 8th überblock in both
> > the first and second VDEV label should be pretty safe.
>
> Yes, that should be relatively safe :)
>
> Where should I be sending patches for this? e2fsprogs, util-linux-ng,
> libvolume_id in udev, ... all of them?
Please,
git://git.kernel.org/pub/scm/utils/util-linux-ng/util-linux-ng.git
currently the libs/blkid/src/probers/zfs.c file contains the original
Andreas's probing code. Note that you needn't to use hard-coded
offsets (see for example ufs.c).
I'll eventually port your patch to libvolume_id and e2fsprogs.
Thanks.
Karel
--
Karel Zak <[email protected]>