2009-09-17 22:22:46

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH, RFC] mke2fs: get device topology values from blkid

This is just a rough cut, due to the blkid header selection
issues I mentioned earlier on the list. It'll also need
some config-fu to be sure we've got a blkid which has these
calls, but with it in place, we'll finally have automatic
selection of stride/stripe:

# misc/mke2fs -b 4096 /dev/md0
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=32 blocks
...

Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 84c4361..8fd3cd8 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -49,6 +49,7 @@ extern int optind;
#include <sys/types.h>
#include <libgen.h>
#include <limits.h>
+#include <blkid/blkid.h>

#include "ext2fs/ext2_fs.h"
#include "et/com_err.h"
@@ -614,6 +615,8 @@ static void show_stats(ext2_filsys fs)
s->s_log_block_size);
printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
s->s_log_frag_size);
+ printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
+ s->s_raid_stride, s->s_raid_stripe_width);
printf(_("%u inodes, %u blocks\n"), s->s_inodes_count,
s->s_blocks_count);
printf(_("%u blocks (%2.2f%%) reserved for the super user\n"),
@@ -1073,6 +1076,51 @@ static int get_bool_from_profile(char **fs_types, const char *opt, int def_val)
extern const char *mke2fs_default_profile;
static const char *default_files[] = { "<default>", 0 };

+/*
+ * Sets the geometry of a device (stripe/stride)
+ */
+static errcode_t ext2fs_get_device_geometry(const char *file,
+ struct ext2_super_block *fs_param)
+{
+ int fd;
+ int rc = 1;
+ int blocksize;
+ blkid_probe pr;
+ blkid_topology tp;
+ unsigned long min_io;
+ unsigned long opt_io;
+
+#ifdef HAVE_OPEN64
+ fd = open64(file, O_RDONLY);
+#else
+ fd = open(file, O_RDONLY);
+#endif
+ if (fd < 0)
+ return errno;
+
+ pr = blkid_new_probe();
+ if (!pr)
+ goto out;
+
+ rc = blkid_probe_set_device(pr, fd, 0, 0);
+ if (rc)
+ goto out;
+
+ tp = blkid_probe_get_topology(pr);
+ if (!tp)
+ goto out;
+
+ min_io = blkid_topology_get_minimum_io_size(tp);
+ opt_io = blkid_topology_get_optimal_io_size(tp);
+ blocksize = EXT2_BLOCK_SIZE(fs_param);
+
+ fs_param->s_raid_stride = min_io / blocksize;
+ fs_param->s_raid_stripe_width = opt_io / blocksize;
+ rc = 0;
+out:
+ close(fd);
+ return rc;
+}
static void PRS(int argc, char *argv[])
{
int b, c;
@@ -1633,6 +1681,8 @@ got_size:
fs_param.s_log_frag_size = fs_param.s_log_block_size =
int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);

+ ext2fs_get_device_geometry(device_name, &fs_param);
+
blocksize = EXT2_BLOCK_SIZE(&fs_param);

lazy_itable_init = get_bool_from_profile(fs_types,



2009-09-18 05:55:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On Sep 17, 2009 17:22 -0500, Eric Sandeen wrote:
> This is just a rough cut, due to the blkid header selection
> issues I mentioned earlier on the list. It'll also need
> some config-fu to be sure we've got a blkid which has these
> calls, but with it in place, we'll finally have automatic
> selection of stride/stripe:
>
> # misc/mke2fs -b 4096 /dev/md0
> mke2fs 1.41.9 (22-Aug-2009)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=16 blocks, Stripe width=32 blocks
> ...

Cool.

> Signed-off-by: Eric Sandeen <[email protected]>
> printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
> s->s_log_frag_size);
> + printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
> + s->s_raid_stride, s->s_raid_stripe_width);

I was going to say we should only print these if non-zero, but for
Pete's sake we print the "Fragment size", which has never been useful,
so I don't see any harm in this. Maybe "RAID Stride ..."?

> +static errcode_t ext2fs_get_device_geometry(const char *file,
> + struct ext2_super_block *fs_param)
> +{
> + rc = blkid_probe_set_device(pr, fd, 0, 0);

Is this in the e2fsprogs blkid code yet? I'm guessing not, since you
are getting build problems, which means anyone building with something
other than latest & greatest util-linux will also get failures.

Either this needs to be configured in, or you need to upgrade the
blkid included with e2fsprogs to handle this.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-18 06:14:05

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

>>>>> "Eric" == Eric Sandeen <[email protected]> writes:

Eric> This is just a rough cut, due to the blkid header selection issues
Eric> I mentioned earlier on the list. It'll also need some config-fu
Eric> to be sure we've got a blkid which has these calls, but with it in
Eric> place, we'll finally have automatic selection of stride/stripe:

What about alignment?

I know that in our friendly DM universe the volume will be aligned. But
what if the user does mkfs on /dev/sdX and the drive isn't naturally
aligned?

How flexible is the extN on-disk format? Can you pad and shift things
if need be?

Also, are you guys affected by the previously-acked-sectors-are-now-gone
problems with 512-byte logical/4KB physical drives?

--
Martin K. Petersen Oracle Linux Engineering

2009-09-18 14:04:39

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

Andreas Dilger wrote:
> On Sep 17, 2009 17:22 -0500, Eric Sandeen wrote:
>> This is just a rough cut, due to the blkid header selection
>> issues I mentioned earlier on the list. It'll also need
>> some config-fu to be sure we've got a blkid which has these
>> calls, but with it in place, we'll finally have automatic
>> selection of stride/stripe:
>>
>> # misc/mke2fs -b 4096 /dev/md0
>> mke2fs 1.41.9 (22-Aug-2009)
>> Filesystem label=
>> OS type: Linux
>> Block size=4096 (log=2)
>> Fragment size=4096 (log=2)
>> Stride=16 blocks, Stripe width=32 blocks
>> ...
>
> Cool.
>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
>> s->s_log_frag_size);
>> + printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
>> + s->s_raid_stride, s->s_raid_stripe_width);
>
> I was going to say we should only print these if non-zero, but for
> Pete's sake we print the "Fragment size", which has never been useful,
> so I don't see any harm in this. Maybe "RAID Stride ..."?
>
>> +static errcode_t ext2fs_get_device_geometry(const char *file,
>> + struct ext2_super_block *fs_param)
>> +{
>> + rc = blkid_probe_set_device(pr, fd, 0, 0);
>
> Is this in the e2fsprogs blkid code yet? I'm guessing not, since you

right, not yet. I don't know what the long-term plan is for e2fsprogs
blkid; keeping 2 trees in sync seems like a lot of work w/o much gain...

> are getting build problems, which means anyone building with something
> other than latest & greatest util-linux will also get failures.
>
> Either this needs to be configured in, or you need to upgrade the
> blkid included with e2fsprogs to handle this.

Yep, that's why I said "It'll also need some config-fu to be sure we've
got a blkid which has these calls...." :) and why I was asking about
moving the in-tree headers slightly out of the way.

-Eric

> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>


2009-09-18 14:18:08

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On Sep 18, 2009 02:13 -0400, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Sandeen <[email protected]> writes:
> Eric> This is just a rough cut, due to the blkid header selection issues
> Eric> I mentioned earlier on the list. It'll also need some config-fu
> Eric> to be sure we've got a blkid which has these calls, but with it in
> Eric> place, we'll finally have automatic selection of stride/stripe:
>
> What about alignment?

As yet we don't handle wacky alignment. For Lustre customers we tell them
not to create partition tables, but it would be nice to handle all of the
strange alignment issues internally.

> I know that in our friendly DM universe the volume will be aligned. But
> what if the user does mkfs on /dev/sdX and the drive isn't naturally
> aligned?
>
> How flexible is the extN on-disk format? Can you pad and shift things
> if need be?

Not in sub-block offsets, which means that partition tables and drive
geometry, etc. should at least align on multiples of the blocksize,
and ideally multiples of the "minimum" IO size.

The mballoc allocator CAN handle non-power-of-two allocations, if the
geometry tells it the minimum/optimum IO size needs it, but as yet it
doesn't have an "offset" parameter. It just assumes that block 0 is
aligned properly. I suspect it wouldn't be hard to add this, though
to make it efficient it would require munging the buddy bitmaps.

> Also, are you guys affected by the previously-acked-sectors-are-now-gone
> problems with 512-byte logical/4KB physical drives?

Not that I'm aware of. The ext4 journal commit block is below 512 bytes,
and virtually all ext4 filesystems are using 4kB blocks.

Maybe that was XFS?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-18 14:20:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On Sep 18, 2009 09:04 -0500, Eric Sandeen wrote:
> Andreas Dilger wrote:
>>> + rc = blkid_probe_set_device(pr, fd, 0, 0);
>>
>> Is this in the e2fsprogs blkid code yet? I'm guessing not, since you
>
> right, not yet. I don't know what the long-term plan is for e2fsprogs
> blkid; keeping 2 trees in sync seems like a lot of work w/o much gain...

I thought that was Ted's plan? We usually recommend people to update to
the latest e2fsprogs, yet not everyone will be able to upgrade to the
latest util-linux, especially if it is pulling in other libraries that
can cause package conflicts (e.g. with xfstools or similar).

>> Either this needs to be configured in, or you need to upgrade the
>> blkid included with e2fsprogs to handle this.
>
> Yep, that's why I said "It'll also need some config-fu to be sure we've
> got a blkid which has these calls...." :) and why I was asking about
> moving the in-tree headers slightly out of the way.

Sigh, it helps if I read the whole email, I guess... :-)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-18 14:23:52

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

Martin K. Petersen wrote:
>>>>>> "Eric" == Eric Sandeen <[email protected]> writes:
>
> Eric> This is just a rough cut, due to the blkid header selection issues
> Eric> I mentioned earlier on the list. It'll also need some config-fu
> Eric> to be sure we've got a blkid which has these calls, but with it in
> Eric> place, we'll finally have automatic selection of stride/stripe:
>
> What about alignment?
>
> I know that in our friendly DM universe the volume will be aligned. But
> what if the user does mkfs on /dev/sdX and the drive isn't naturally
> aligned?

heh, mkfs.ext* complains if you point it at a whole disk, though it will
proceed...

> How flexible is the extN on-disk format? Can you pad and shift things
> if need be?

shifting by 63 sectors would be tough I think; maybe a stern warning and
a "proceed y/N?" would be better.

I had sort of assumed that -we- could assume the lower layers
(partitions, volumes) got well-aligned before mkfs, but didn't think
about making the fs on the whole device.

> Also, are you guys affected by the previously-acked-sectors-are-now-gone
> problems with 512-byte logical/4KB physical drives?

previously-acked-sectors-are-now-gone? I guess I haven't been keeping
up. What do you mean by this?

-Eric

2009-09-18 14:30:28

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

Andreas Dilger wrote:
> On Sep 18, 2009 09:04 -0500, Eric Sandeen wrote:
>> Andreas Dilger wrote:
>>>> + rc = blkid_probe_set_device(pr, fd, 0, 0);
>>> Is this in the e2fsprogs blkid code yet? I'm guessing not, since you
>> right, not yet. I don't know what the long-term plan is for e2fsprogs
>> blkid; keeping 2 trees in sync seems like a lot of work w/o much gain...
>
> I thought that was Ted's plan? We usually recommend people to update to
> the latest e2fsprogs, yet not everyone will be able to upgrade to the
> latest util-linux, especially if it is pulling in other libraries that
> can cause package conflicts (e.g. with xfstools or similar).

It doesn't much matter to me personally since my distro ;) seems pretty
well committed to using the blkid & libuuid now in util-linux-ng.

I guess I'm not thinking too much about a world where people are pulling
down tarballs & doing make install; I'm sure that still goes on but
those folks are smart enough to handle things moving around, or should be.

*shrug*

As long as we can teach the e2fsprogs build to optionally pull libblkid
& libuuid headers from the system rather than the e2fsprogs tree, it'll
give the flexibility we need.

-Eric

2009-09-18 16:43:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On Fri, Sep 18, 2009 at 08:20:24AM -0600, Andreas Dilger wrote:
> On Sep 18, 2009 09:04 -0500, Eric Sandeen wrote:
> > Andreas Dilger wrote:
> >>> + rc = blkid_probe_set_device(pr, fd, 0, 0);
> >>
> >> Is this in the e2fsprogs blkid code yet? I'm guessing not, since you
> >
> > right, not yet. I don't know what the long-term plan is for e2fsprogs
> > blkid; keeping 2 trees in sync seems like a lot of work w/o much gain...
>
> I thought that was Ted's plan? We usually recommend people to update to
> the latest e2fsprogs, yet not everyone will be able to upgrade to the
> latest util-linux, especially if it is pulling in other libraries that
> can cause package conflicts (e.g. with xfstools or similar).

I wasn't planning on keeping the blkid in e2fsprogs in sync with
util-linux-ng; however, I did want to allow people who wanted to
recompile the latest version of e2fsprogs (to pick up critical bug
fixes, for example) on existing distributions (you know, for those
silly Enterprise Linux Distro customers running RHEL 5 or SLES 11 that
pay some of our collective salaries :-) to be able to do so.

If there is an issue that not all versions of util-linux-ng have the
topology calls available, we'll need some autoconf magic, of course
--- especially if some community distro's have shipped or will have
shipped versions of util-linux-ng w/o the topology functions.

I don't particularly much care about screwing over a Linux From
Scratch or Gentoo user who refuses to upgrade. I care a tiny bit more
about supporting a community distro (it's nice if we can support a
given Community distro for at least a 12 months or so before we start
making life difficult for them). But to make life easier for
Enterprise Linux users, and the Level 3 help desk people at IBM,
Novell, and Red Hat who might need the latest version of e2fsprogs to
fix some customer's file system corruption problem, we should provide
an easy way for them to download and update that customer to the
latest version of e2fsprogs. The simplest way to do that is to stub
out the new blkid topology functions has null inline functions in
blkid.h.

Does that make sense to everyone?

- Ted

2009-09-18 16:57:14

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

Theodore Tso wrote:
> On Fri, Sep 18, 2009 at 08:20:24AM -0600, Andreas Dilger wrote:
>> On Sep 18, 2009 09:04 -0500, Eric Sandeen wrote:
>>> Andreas Dilger wrote:
>>>>> + rc = blkid_probe_set_device(pr, fd, 0, 0);
>>>> Is this in the e2fsprogs blkid code yet? I'm guessing not, since you
>>> right, not yet. I don't know what the long-term plan is for e2fsprogs
>>> blkid; keeping 2 trees in sync seems like a lot of work w/o much gain...
>> I thought that was Ted's plan? We usually recommend people to update to
>> the latest e2fsprogs, yet not everyone will be able to upgrade to the
>> latest util-linux, especially if it is pulling in other libraries that
>> can cause package conflicts (e.g. with xfstools or similar).
>
> I wasn't planning on keeping the blkid in e2fsprogs in sync with
> util-linux-ng; however, I did want to allow people who wanted to
> recompile the latest version of e2fsprogs (to pick up critical bug
> fixes, for example) on existing distributions (you know, for those
> silly Enterprise Linux Distro customers running RHEL 5 or SLES 11 that
> pay some of our collective salaries :-) to be able to do so.
>
> If there is an issue that not all versions of util-linux-ng have the
> topology calls available, we'll need some autoconf magic, of course
> --- especially if some community distro's have shipped or will have
> shipped versions of util-linux-ng w/o the topology functions.

Yes, I'll do that for my changes in the next patch I send (Fedora is in
exactly this boat).

> I don't particularly much care about screwing over a Linux From
> Scratch or Gentoo user who refuses to upgrade. I care a tiny bit more
> about supporting a community distro (it's nice if we can support a
> given Community distro for at least a 12 months or so before we start
> making life difficult for them). But to make life easier for
> Enterprise Linux users, and the Level 3 help desk people at IBM,
> Novell, and Red Hat who might need the latest version of e2fsprogs to
> fix some customer's file system corruption problem, we should provide
> an easy way for them to download and update that customer to the
> latest version of e2fsprogs. The simplest way to do that is to stub
> out the new blkid topology functions has null inline functions in
> blkid.h.

I guess I'd still like to see lib/blkid move to lib/local-blkid/blkid or
something, so that when --disable-libblkid is passed, we will never find
local headers. Without the --disable-libblkid, we would add
-I./lib/local-blkid/ to the include search path. Ditto for libuuid.

That way you don't even have to keep up with the stubs, right? And
autoconf tests for new interfaces will just gracefully fail if building
everything against the in-tree headers.

Or if my build-fu is weak and there's a better way, I'm all ears. :)

-Eric

> Does that make sense to everyone?
>
> - Ted


2009-09-18 19:41:09

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

>>>>> "Eric" == Eric Sandeen <[email protected]> writes:

>> Also, are you guys affected by the
>> previously-acked-sectors-are-now-gone problems with 512-byte
>> logical/4KB physical drives?

Eric> previously-acked-sectors-are-now-gone? I guess I haven't been
Eric> keeping up. What do you mean by this?

We already discussed this on irc. But in case anybody else are
wondering...

On a disk with 4KB physical blocks emulating 512-byte logical blocks the
drive firmware must resort to read-modify-write cycles and that opens up
a new error scenario:

4KB physical block: | 0 |
512b logical block: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
^ ERROR

In this case we have successfully written LBA 0 - 6. However, when the
drive attempts to write LBA 7 it gets an I/O error and the previous 7
logical blocks (that have previously been acknowledged as written) are
lost. IOW, the drive write atomicity is at the physical block level and
not the logical ditto.

See http://oss.oracle.com/~mkp/docs/ls09-topology.pdf page 19 for
prettier graphics.

--
Martin K. Petersen Oracle Linux Engineering

2009-09-18 20:28:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On Sep 18, 2009 15:40 -0400, Martin K. Petersen wrote:
> >>>>> "Eric" == Eric Sandeen <[email protected]> writes:
> >> Also, are you guys affected by the
> >> previously-acked-sectors-are-now-gone problems with 512-byte
> >> logical/4KB physical drives?
>
> Eric> previously-acked-sectors-are-now-gone? I guess I haven't been
> Eric> keeping up. What do you mean by this?
>
> We already discussed this on irc. But in case anybody else are
> wondering...
>
> On a disk with 4KB physical blocks emulating 512-byte logical blocks the
> drive firmware must resort to read-modify-write cycles and that opens up
> a new error scenario:
>
> 4KB physical block: | 0 |
> 512b logical block: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
> ^ ERROR
>
> In this case we have successfully written LBA 0 - 6. However, when the
> drive attempts to write LBA 7 it gets an I/O error and the previous 7
> logical blocks (that have previously been acknowledged as written) are
> lost. IOW, the drive write atomicity is at the physical block level and
> not the logical ditto.

In some sense, this is no worse than "real" sectors 0-6 going bad right
after they were written, or in fact even having the writes silently fail.

Yes, there is more chance that writing sector 7 (due to 4k sector r-m-w)
will cause collateral damage, but the truth even today is that disks
are not going to fail a single 512-byte sector at one time, but more
likely 64kB (or whatever the remapping unit size is), so this isn't
really introducing a new failure mode.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-18 23:59:50

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On Thu, Sep 17, 2009 at 05:22:49PM -0500, Eric Sandeen wrote:
> +static errcode_t ext2fs_get_device_geometry(const char *file,
> + struct ext2_super_block *fs_param)
> +{
> + int fd;
> + int rc = 1;
> + int blocksize;
> + blkid_probe pr;
> + blkid_topology tp;
> + unsigned long min_io;
> + unsigned long opt_io;
> +
> +#ifdef HAVE_OPEN64
> + fd = open64(file, O_RDONLY);
> +#else
> + fd = open(file, O_RDONLY);
> +#endif
> + if (fd < 0)
> + return errno;
> +
> + pr = blkid_new_probe();
> + if (!pr)
> + goto out;
> +
> + rc = blkid_probe_set_device(pr, fd, 0, 0);
> + if (rc)
> + goto out;
> +
> + tp = blkid_probe_get_topology(pr);
> + if (!tp)
> + goto out;
> +
> + min_io = blkid_topology_get_minimum_io_size(tp);
> + opt_io = blkid_topology_get_optimal_io_size(tp);
> + blocksize = EXT2_BLOCK_SIZE(fs_param);
> +
> + fs_param->s_raid_stride = min_io / blocksize;
> + fs_param->s_raid_stripe_width = opt_io / blocksize;
> + rc = 0;
> +out:

blkid_free_probe(pr);

just in case you want to keep valgrind mem-leak checkers silent :-)

> + close(fd);
> + return rc;
> +}

Karel
--
Karel Zak <[email protected]>

2009-09-19 03:03:52

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

Karel Zak wrote:
> On Thu, Sep 17, 2009 at 05:22:49PM -0500, Eric Sandeen wrote:
>> +static errcode_t ext2fs_get_device_geometry(const char *file,
>> + struct ext2_super_block *fs_param)
>> +{

...

>> + fs_param->s_raid_stride = min_io / blocksize;
>> + fs_param->s_raid_stripe_width = opt_io / blocksize;
>> + rc = 0;
>> +out:
>
> blkid_free_probe(pr);
>
> just in case you want to keep valgrind mem-leak checkers silent :-)

ok :) If you haven't already updated your mkfs sample
(blkid/samples/mkfs.c), now's the time to do so; I'll blame my error on
cut and paste. :)

Thanks,
-Eric

>> + close(fd);
>> + return rc;
>> +}
>
> Karel


2009-09-20 20:47:07

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

>>>>> "Andreas" == Andreas Dilger <[email protected]> writes:

Andreas> Yes, there is more chance that writing sector 7 (due to 4k
Andreas> sector r-m-w) will cause collateral damage, but the truth even
Andreas> today is that disks are not going to fail a single 512-byte
Andreas> sector at one time, but more likely 64kB (or whatever the
Andreas> remapping unit size is), so this isn't really introducing a new
Andreas> failure mode.

I keep hearing this disk drive "internal block size" of 32KB or 64KB
being mentioned. And yet none of the drive firmware engineers I talk to
on a regular basis have ever heard of such a thing...

--
Martin K. Petersen Oracle Linux Engineering

2009-09-21 17:06:01

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH V2] mke2fs: get device topology values from blkid

Handle automatic selection of stride/stripe:

# misc/mke2fs /dev/md0
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=32 blocks
...

And warn on block device misalignment:

# misc/mke2fs /dev/sdc1
mke2fs 1.41.9 (22-Aug-2009)
/dev/sdc1 alignment is offset by 32256 bytes.
This may result in very poor performance, (re)-partitioning suggested.
Proceed anyway? (y,n)

V2:
Add blkid_free_probe() per kzak
Add alignment check and warning message for misalignment

Signed-off-by: Eric Sandeen <[email protected]>
---

(Note that this will cause the build to fail if the topo stuff is
found in the system's libblkid but the local lib/blkid headers don't
define it; either needs stubs or a library move first).









diff --git a/configure.in b/configure.in
index 4bb5b08..3319f70 100644
--- a/configure.in
+++ b/configure.in
@@ -802,7 +802,11 @@ AC_CHECK_MEMBER(struct sockaddr.sa_len,
[#include <sys/types.h>
#include <sys/socket.h>])
dnl
-AC_CHECK_FUNCS(chflags getrusage llseek lseek64 open64 fstat64 ftruncate64 getmntinfo strtoull strcasecmp srandom jrand48 fchown mallinfo fdatasync strnlen strptime strdup sysconf pathconf posix_memalign memalign valloc __secure_getenv prctl mmap utime setresuid setresgid usleep nanosleep getdtablesize getrlimit sync_file_range posix_fadvise fallocate)
+dnl This will add -lblkid to the AC_CHECK_FUNCS search
+dnl
+AC_SEARCH_LIBS([blkid_probe_all], [blkid])
+dnl
+AC_CHECK_FUNCS(chflags getrusage llseek lseek64 open64 fstat64 ftruncate64 getmntinfo strtoull strcasecmp srandom jrand48 fchown mallinfo fdatasync strnlen strptime strdup sysconf pathconf posix_memalign memalign valloc __secure_getenv prctl mmap utime setresuid setresgid usleep nanosleep getdtablesize getrlimit sync_file_range posix_fadvise fallocate blkid_probe_get_topology2)
dnl
dnl Check to see if -lsocket is required (solaris) to make something
dnl that uses socket() to compile; this is needed for the UUID library
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 84c4361..a09f4d4 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -49,6 +49,7 @@ extern int optind;
#include <sys/types.h>
#include <libgen.h>
#include <limits.h>
+#include <blkid/blkid.h>

#include "ext2fs/ext2_fs.h"
#include "et/com_err.h"
@@ -614,6 +615,8 @@ static void show_stats(ext2_filsys fs)
s->s_log_block_size);
printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
s->s_log_frag_size);
+ printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
+ s->s_raid_stride, s->s_raid_stripe_width);
printf(_("%u inodes, %u blocks\n"), s->s_inodes_count,
s->s_blocks_count);
printf(_("%u blocks (%2.2f%%) reserved for the super user\n"),
@@ -1073,6 +1076,57 @@ static int get_bool_from_profile(char **fs_types, const char *opt, int def_val)
extern const char *mke2fs_default_profile;
static const char *default_files[] = { "<default>", 0 };

+#ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
+/*
+ * Sets the geometry of a device (stripe/stride), and returns the
+ * device's alignment offset, if any, or a negative error.
+ */
+static int ext2fs_get_device_geometry(const char *file,
+ struct ext2_super_block *fs_param)
+{
+ int fd;
+ int rc = -1;
+ int blocksize;
+ blkid_probe pr;
+ blkid_topology tp;
+ unsigned long min_io;
+ unsigned long opt_io;
+
+#ifdef HAVE_OPEN64
+ fd = open64(file, O_RDONLY);
+#else
+ fd = open(file, O_RDONLY);
+#endif
+ if (fd < 0)
+ return rc;
+
+ pr = blkid_new_probe();
+ if (!pr)
+ goto out;
+
+ rc = blkid_probe_set_device(pr, fd, 0, 0);
+ if (rc)
+ goto out;
+
+ tp = blkid_probe_get_topology(pr);
+ if (!tp)
+ goto out;
+
+ min_io = blkid_topology_get_minimum_io_size(tp);
+ opt_io = blkid_topology_get_optimal_io_size(tp);
+ blocksize = EXT2_BLOCK_SIZE(fs_param);
+
+ fs_param->s_raid_stride = min_io / blocksize;
+ fs_param->s_raid_stripe_width = opt_io / blocksize;
+
+ rc = blkid_topology_get_alignment_offset(tp);
+out:
+ blkid_free_probe(pr);
+ close(fd);
+ return rc;
+}
+#endif
+
static void PRS(int argc, char *argv[])
{
int b, c;
@@ -1633,6 +1687,21 @@ got_size:
fs_param.s_log_frag_size = fs_param.s_log_block_size =
int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);

+#ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
+ retval = ext2fs_get_device_geometry(device_name, &fs_param);
+ if (retval < 0) {
+ fprintf(stderr,
+ _("warning: Unable to get device geometry for %s"),
+ device_name);
+ } else if (retval) {
+ printf(_("%s alignment is offset by %lu bytes.\n"),
+ device_name, retval);
+ printf(_("This may result in very poor performance, "
+ "(re)-partitioning suggested.\n"));
+ proceed_question();
+ }
+#endif
+
blocksize = EXT2_BLOCK_SIZE(&fs_param);

lazy_itable_init = get_bool_from_profile(fs_types,


2009-09-22 14:30:17

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH, RFC] mke2fs: get device topology values from blkid

On 09/20/2009 01:46 PM, Martin K. Petersen wrote:
>>>>>> "Andreas" == Andreas Dilger<[email protected]> writes:
>>>>>>
> Andreas> Yes, there is more chance that writing sector 7 (due to 4k
> Andreas> sector r-m-w) will cause collateral damage, but the truth even
> Andreas> today is that disks are not going to fail a single 512-byte
> Andreas> sector at one time, but more likely 64kB (or whatever the
> Andreas> remapping unit size is), so this isn't really introducing a new
> Andreas> failure mode.
>
> I keep hearing this disk drive "internal block size" of 32KB or 64KB
> being mentioned. And yet none of the drive firmware engineers I talk to
> on a regular basis have ever heard of such a thing...
>

I have not heard of drives with that kind of internal block size for
single drives.

What is common is that reads (often) are often done in much larger
chunks like this (basically, if you are reading, it normally makes sense
to read a whole track from the platter :-)

Array vendors definitely do have larger internal block sizes.

ric



2009-10-02 16:32:41

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH V3] mke2fs: get device topology values from blkid

Handle automatic selection of stride/stripe:

# misc/mke2fs /dev/md0
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=32 blocks
...

And warn on block device misalignment:

# misc/mke2fs /dev/sdc1
mke2fs 1.41.9 (22-Aug-2009)
/dev/sdc1 alignment is offset by 32256 bytes.
This may result in very poor performance, (re)-partitioning suggested.
Proceed anyway? (y,n)

V2:
Add blkid_free_probe() per kzak
Add alignment check and warning message for misalignment

V3: use new blkid_new_probe_from_filename to drop some LOC

Signed-off-by: Eric Sandeen <[email protected]>
---

(Note that this will cause the build to fail if the topo stuff is
found in the system's libblkid but the local lib/blkid headers don't
define it; either needs stubs or a library move first).









diff --git a/configure.in b/configure.in
index 4bb5b08..3319f70 100644
--- a/configure.in
+++ b/configure.in
@@ -802,7 +802,11 @@ AC_CHECK_MEMBER(struct sockaddr.sa_len,
[#include <sys/types.h>
#include <sys/socket.h>])
dnl
-AC_CHECK_FUNCS(chflags getrusage llseek lseek64 open64 fstat64 ftruncate64 getmntinfo strtoull strcasecmp srandom jrand48 fchown mallinfo fdatasync strnlen strptime strdup sysconf pathconf posix_memalign memalign valloc __secure_getenv prctl mmap utime setresuid setresgid usleep nanosleep getdtablesize getrlimit sync_file_range posix_fadvise fallocate)
+dnl This will add -lblkid to the AC_CHECK_FUNCS search
+dnl
+AC_SEARCH_LIBS([blkid_probe_all], [blkid])
+dnl
+AC_CHECK_FUNCS(chflags getrusage llseek lseek64 open64 fstat64 ftruncate64 getmntinfo strtoull strcasecmp srandom jrand48 fchown mallinfo fdatasync strnlen strptime strdup sysconf pathconf posix_memalign memalign valloc __secure_getenv prctl mmap utime setresuid setresgid usleep nanosleep getdtablesize getrlimit sync_file_range posix_fadvise fallocate blkid_probe_get_topology2)
dnl
dnl Check to see if -lsocket is required (solaris) to make something
dnl that uses socket() to compile; this is needed for the UUID library
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 84c4361..b8f3410 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -49,6 +49,7 @@ extern int optind;
#include <sys/types.h>
#include <libgen.h>
#include <limits.h>
+#include <blkid/blkid.h>

#include "ext2fs/ext2_fs.h"
#include "et/com_err.h"
@@ -614,6 +615,8 @@ static void show_stats(ext2_filsys fs)
s->s_log_block_size);
printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
s->s_log_frag_size);
+ printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
+ s->s_raid_stride, s->s_raid_stripe_width);
printf(_("%u inodes, %u blocks\n"), s->s_inodes_count,
s->s_blocks_count);
printf(_("%u blocks (%2.2f%%) reserved for the super user\n"),
@@ -1073,6 +1076,43 @@ static int get_bool_from_profile(char **fs_types, const char *opt, int def_val)
extern const char *mke2fs_default_profile;
static const char *default_files[] = { "<default>", 0 };

+#ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
+/*
+ * Sets the geometry of a device (stripe/stride), and returns the
+ * device's alignment offset, if any, or a negative error.
+ */
+static int ext2fs_get_device_geometry(const char *file,
+ struct ext2_super_block *fs_param)
+{
+ int rc = -1;
+ int blocksize;
+ blkid_probe pr;
+ blkid_topology tp;
+ unsigned long min_io;
+ unsigned long opt_io;
+
+ pr = blkid_new_probe_from_filename(file);
+ if (!pr)
+ goto out;
+
+ tp = blkid_probe_get_topology(pr);
+ if (!tp)
+ goto out;
+
+ min_io = blkid_topology_get_minimum_io_size(tp);
+ opt_io = blkid_topology_get_optimal_io_size(tp);
+ blocksize = EXT2_BLOCK_SIZE(fs_param);
+
+ fs_param->s_raid_stride = min_io / blocksize;
+ fs_param->s_raid_stripe_width = opt_io / blocksize;
+
+ rc = blkid_topology_get_alignment_offset(tp);
+out:
+ blkid_free_probe(pr);
+ return rc;
+}
+#endif
+
static void PRS(int argc, char *argv[])
{
int b, c;
@@ -1633,6 +1673,21 @@ got_size:
fs_param.s_log_frag_size = fs_param.s_log_block_size =
int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);

+#ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
+ retval = ext2fs_get_device_geometry(device_name, &fs_param);
+ if (retval < 0) {
+ fprintf(stderr,
+ _("warning: Unable to get device geometry for %s"),
+ device_name);
+ } else if (retval) {
+ printf(_("%s alignment is offset by %lu bytes.\n"),
+ device_name, retval);
+ printf(_("This may result in very poor performance, "
+ "(re)-partitioning suggested.\n"));
+ proceed_question();
+ }
+#endif
+
blocksize = EXT2_BLOCK_SIZE(&fs_param);

lazy_itable_init = get_bool_from_profile(fs_types,


2009-10-04 19:17:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH V3] mke2fs: get device topology values from blkid

On Fri, Oct 02, 2009 at 11:32:42AM -0500, Eric Sandeen wrote:
> Handle automatic selection of stride/stripe:
>
> # misc/mke2fs /dev/md0 mke2fs 1.41.9 (22-Aug-2009)
> Filesystem label=
> OS type: Linux
> Block size=4096 (log=2)
> Fragment size=4096 (log=2)
> Stride=16 blocks, Stripe width=32 blocks
> ...
>
> And warn on block device misalignment:
>
> # misc/mke2fs /dev/sdc1 mke2fs 1.41.9 (22-Aug-2009)
> /dev/sdc1 alignment is offset by 32256 bytes.
> This may result in very poor performance, (re)-partitioning suggested.
> Proceed anyway? (y,n)
>
> V2:
> Add blkid_free_probe() per kzak
> Add alignment check and warning message for misalignment
>
> V3: use new blkid_new_probe_from_filename to drop some LOC
>
> Signed-off-by: Eric Sandeen <[email protected]>

Thanks, applied to the maint branch. I added (in the previous commit)
a fairly simple fix to the build system to make sure that we use the
system-provided header files for blkid or uuid if we are using the
system-provided blkid or uuid libraries. I've tested most of the
combinations except for with the latest bleeding edge blkidv2
libraries --- can you double check to make it works in that case.
>From what I can tell, it should.

- Ted