2010-09-27 23:15:57

by Mike Snitzer

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Mon, Sep 27 2010 at 6:36pm -0400,
Martin K. Petersen <[email protected]> wrote:

> >>>>> "Jens" == Jens Axboe <[email protected]> writes:
> Jens> Does mkfs do the right thing?
>
> Depends on which mkfs it is. Mike has tested things and can chip in
> here...

I haven't test all mkfs.* but...

mkfs.xfs just works with 1M physical_block_size.

mkfs.ext4 won't by default but -F "fixes" that:

# mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
mke2fs 1.41.12 (17-May-2010)
Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue
...

I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
physical_block_size).


2010-09-28 04:30:34

by Jens Axboe

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On 2010-09-28 08:15, Mike Snitzer wrote:
> On Mon, Sep 27 2010 at 6:36pm -0400,
> Martin K. Petersen <[email protected]> wrote:
>
>>>>>>> "Jens" == Jens Axboe <[email protected]> writes:
>> Jens> Does mkfs do the right thing?
>>
>> Depends on which mkfs it is. Mike has tested things and can chip in
>> here...
>
> I haven't test all mkfs.* but...
>
> mkfs.xfs just works with 1M physical_block_size.
>
> mkfs.ext4 won't by default but -F "fixes" that:
>
> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
> mke2fs 1.41.12 (17-May-2010)
> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue

OK, so that's not exactly doing the right thing, but at least you can
work around it with a parameter. So I'd say that is good enough.

> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
> physical_block_size).

Thanks!

--
Jens Axboe


2010-09-28 05:20:26

by Eric Sandeen

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

Jens Axboe wrote:
> On 2010-09-28 08:15, Mike Snitzer wrote:
>> On Mon, Sep 27 2010 at 6:36pm -0400,
>> Martin K. Petersen <[email protected]> wrote:
>>
>>>>>>>> "Jens" == Jens Axboe <[email protected]> writes:
>>> Jens> Does mkfs do the right thing?
>>>
>>> Depends on which mkfs it is. Mike has tested things and can chip in
>>> here...
>> I haven't test all mkfs.* but...
>>
>> mkfs.xfs just works with 1M physical_block_size.
>>
>> mkfs.ext4 won't by default but -F "fixes" that:
>>
>> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
>> mke2fs 1.41.12 (17-May-2010)
>> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue
>
> OK, so that's not exactly doing the right thing, but at least you can
> work around it with a parameter. So I'd say that is good enough.

Which part of it is the wrong thing...?

Today mkfs.ext4 refuses to create an fs blocksize which is smaller than logical
or physical by default, because one is suboptimal and the other is impossible.
-F (force) can override the suboptimal fs blocksize < logical blocksize case...

Should we change something?

Thanks,
-Eric

>> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
>> physical_block_size).
>
> Thanks!
>


2010-09-28 14:15:45

by Mike Snitzer

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Tue, Sep 28 2010 at 1:20am -0400,
Eric Sandeen <[email protected]> wrote:

> Jens Axboe wrote:
> > On 2010-09-28 08:15, Mike Snitzer wrote:
> >> On Mon, Sep 27 2010 at 6:36pm -0400,
> >> Martin K. Petersen <[email protected]> wrote:
> >>
> >>>>>>>> "Jens" == Jens Axboe <[email protected]> writes:
> >>> Jens> Does mkfs do the right thing?
> >>>
> >>> Depends on which mkfs it is. Mike has tested things and can chip in
> >>> here...
> >> I haven't test all mkfs.* but...
> >>
> >> mkfs.xfs just works with 1M physical_block_size.
> >>
> >> mkfs.ext4 won't by default but -F "fixes" that:
> >>
> >> # mkfs.ext4 -b 4096 -F /dev/mapper/20017380023360006
> >> mke2fs 1.41.12 (17-May-2010)
> >> Warning: specified blocksize 4096 is less than device physical sectorsize 1048576, forced to continue
> >
> > OK, so that's not exactly doing the right thing, but at least you can
> > work around it with a parameter. So I'd say that is good enough.
>
> Which part of it is the wrong thing...?
>
> Today mkfs.ext4 refuses to create an fs blocksize which is smaller than logical
> or physical by default, because one is suboptimal and the other is impossible.
> -F (force) can override the suboptimal fs blocksize < logical blocksize case...

Actually, -F allows one to override fs blocksize < physical_block_size.

In this instance we have the following:
# cat /sys/block/dm-2/queue/physical_block_size
1048576
# cat /sys/block/dm-2/queue/logical_block_size
512

> Should we change something?

Unclear. I could see maybe automatically capping the fs block size at
4096 if physical_block_size is larger and is a multiple of 4096?

> >> I'll check fdisk and parted tomorrow (I know lvm2 doesn't look at
> >> physical_block_size).

Both fdisk and parted look good (partitions are physical_block_size
aligned, will warn if you attempt to stray from that alignment). I'll
spare you detials of the creation steps...

Results of fdisk:
-----------------

# fdisk /dev/sdb
...
The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.
...

# fdisk -l -u /dev/sdb

Disk /dev/sdb: 17.2 GB, 17179869184 bytes
255 heads, 63 sectors/track, 2088 cylinders, total 33554432 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 1048576 bytes
I/O size (minimum/optimal): 1048576 bytes / 1048576 bytes
Disk identifier: 0x0009bf46

Device Boot Start End Blocks Id System
/dev/sdb1 2048 16775167 8386560 83 Linux


Results of parted:
------------------
Also looks good, doesn't care about physical_block_size. Is more
concerned with {minimum,optimal}_io_size.

(parted) unit MiB
(parted) p
Model: XXXXXXXXXXXXX
Disk /dev/sdb: 16384MiB
Sector size (logical/physical): 512B/1048576B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1.00MiB 8191MiB 8190MiB primary

2010-09-28 20:57:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Tue, Sep 28, 2010 at 10:15:45AM -0400, Mike Snitzer wrote:
> Actually, -F allows one to override fs blocksize < physical_block_size.
>
> In this instance we have the following:
> # cat /sys/block/dm-2/queue/physical_block_size
> 1048576
> # cat /sys/block/dm-2/queue/logical_block_size
> 512
>
> > Should we change something?
>
> Unclear. I could see maybe automatically capping the fs block size at
> 4096 if physical_block_size is larger and is a multiple of 4096?

Can we decide soon what the right thing should be? I'm about to
release e2fsrogs 1.41.13, and if I should put in some sanity checking
code so mke2fs does something sane when it sees a 1M physical block
size, I can do that.

Or if the kernel is going to do that, it's fine too....

- Ted

2010-09-28 21:26:47

by Martin K. Petersen

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

>>>>> "Ted" == Ted Ts'o <[email protected]> writes:

Ted> Can we decide soon what the right thing should be? I'm about to
Ted> release e2fsrogs 1.41.13, and if I should put in some sanity
Ted> checking code so mke2fs does something sane when it sees a 1M
Ted> physical block size, I can do that.

I don't think it's entirely clear what the "right thing" would be.

Let's ignore the 1MB block size for now. That's clearly a fluke and a
buggy device. But there are SSDs that will advertise an 8KiB physical
block size. And apparently 16KiB devices are in the pipeline.

How do we want to handle these devices? Allowing blocks bigger than the
page size is going to be painful.

So the question is whether we can tweak the filesystem layout in a way
that would alleviate the pain without having to change the filesystem
block size in the traditional sense.

At least we're talking about SSDs and arrays here. I assume the partial
block write penalty for these devices would be smaller than it is for
rotating media.

--
Martin K. Petersen Oracle Linux Engineering

2010-09-28 21:36:52

by Eric Sandeen

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

Martin K. Petersen wrote:

>>>>>> "Ted" == Ted Ts'o <[email protected]> writes:
>>>>>>
>
> Ted> Can we decide soon what the right thing should be? I'm about to
> Ted> release e2fsrogs 1.41.13, and if I should put in some sanity
> Ted> checking code so mke2fs does something sane when it sees a 1M
> Ted> physical block size, I can do that.
>
> I don't think it's entirely clear what the "right thing" would be.
>
> Let's ignore the 1MB block size for now. That's clearly a fluke and a
> buggy device. But there are SSDs that will advertise an 8KiB physical
> block size. And apparently 16KiB devices are in the pipeline.
>
Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
than device physical sectorsize without -F is broken, and that should
be removed. I'd say issue a warning in the case but if there's a 16k
physical device maybe there's no point in warning either?

> How do we want to handle these devices? Allowing blocks bigger than the
> page size is going to be painful.
>
> So the question is whether we can tweak the filesystem layout in a way
> that would alleviate the pain without having to change the filesystem
> block size in the traditional sense.
>
> At least we're talking about SSDs and arrays here. I assume the partial
> block write penalty for these devices would be smaller than it is for
> rotating media.
>
>
I guess it must be.

Anyway here's a patch to remove the force requirement and just give the
user whatever they want, since apparently we can't avoid fs blocksize
less than physical sector size in general. It does still warn
that the fs blocksize is less than physical sectorsize, but *shrug*


diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index add7c0c..6010fc1 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -1634,17 +1634,15 @@ static void PRS(int argc, char *argv[])
ext2fs_blocks_count(&fs_param) /
(blocksize / 1024));
} else {
- if (blocksize < lsector_size || /* Impossible */
- (!force && (blocksize < psector_size))) { /* Suboptimal */
+ if (blocksize < lsector_size) { /* Impossible */
com_err(program_name, EINVAL,
_("while setting blocksize; too small "
"for device\n"));
exit(1);
- } else if (blocksize < psector_size) {
+ } else if (blocksize < psector_size) { /* Suboptimal */
fprintf(stderr, _("Warning: specified blocksize %d is "
- "less than device physical sectorsize %d, "
- "forced to continue\n"), blocksize,
- psector_size);
+ "less than device physical sectorsize %d\n")
+ blocksize, psector_size);
}
}


-Eric



2010-09-30 16:31:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote:
> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
> than device physical sectorsize without -F is broken, and that should
> be removed. I'd say issue a warning in the case but if there's a 16k
> physical device maybe there's no point in warning either?

If the device physical sectorsize is that big, should we perhaps use
that as a hint to align writes to that blocks aligned with that
physical sectorsize? Right now we use the optimal I/O size, but if
the optimal I/O size is not specified and the physical sectorsize is,
say, 16k or 32k, maybe we should use to calculate for
s_raid_stripe_width?

- Ted

2010-09-30 17:07:09

by Eric Sandeen

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On 09/30/2010 11:30 AM, Ted Ts'o wrote:
> On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote:
>> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
>> than device physical sectorsize without -F is broken, and that should
>> be removed. I'd say issue a warning in the case but if there's a 16k
>> physical device maybe there's no point in warning either?
>
> If the device physical sectorsize is that big, should we perhaps use
> that as a hint to align writes to that blocks aligned with that
> physical sectorsize? Right now we use the optimal I/O size, but if
> the optimal I/O size is not specified and the physical sectorsize is,

I can't keep track of all the parameters, is it ever true that optimal
I/O size isn't specified?

> say, 16k or 32k, maybe we should use to calculate for
> s_raid_stripe_width?

Perhaps, though really ext4 still doesn't do -that- much with the value,
anyway...

-Eric

2010-09-30 17:33:43

by Mike Snitzer

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Thu, Sep 30 2010 at 1:07pm -0400,
Eric Sandeen <[email protected]> wrote:

> On 09/30/2010 11:30 AM, Ted Ts'o wrote:
> > On Tue, Sep 28, 2010 at 04:36:42PM -0500, Eric Sandeen wrote:
> >> Ok, then it sounds like mkfs.ext4's refusal to make fs blocksize less
> >> than device physical sectorsize without -F is broken, and that should
> >> be removed. I'd say issue a warning in the case but if there's a 16k
> >> physical device maybe there's no point in warning either?
> >
> > If the device physical sectorsize is that big, should we perhaps use
> > that as a hint to align writes to that blocks aligned with that
> > physical sectorsize? Right now we use the optimal I/O size, but if
> > the optimal I/O size is not specified and the physical sectorsize is,
>
> I can't keep track of all the parameters, is it ever true that optimal
> I/O size isn't specified?

Yes optimal_io_size may be 0. But minimum_io_size will always be scaled
up to at least match physical_block_size.

In any case: this 1MB physical_block_size device, which started this
thread, also has 1MB for both minimum_io_size and optimal_io_size.

Mike

2010-10-01 14:24:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Thu, Sep 30, 2010 at 01:33:43PM -0400, Mike Snitzer wrote:
>
> Yes optimal_io_size may be 0. But minimum_io_size will always be scaled
> up to at least match physical_block_size.

Woah! Are we sure we want to do that? According to Jens, 8k physical
blockes are here already and 16k physical blocks sizes are right
around the corner. If we scale minimum_io_size up to the physical
block size, then even though these devices will have 512 or 4k logical
block sizes, minimum_io_size will be 16k? That sounds wrong,
incorrect, and given that the Linux VM can't handle file system block
sizes greater than page size. And if we scale the minimum_io_size to
the physical block size, mke2fs will refuse to create a 4k blocksize
filesystem --- since presumably "minimum io size" means we can't do
I/O's smaller than that.

Please tell me you meant to say __logical__ blocksize above?

Or am I misunderstanding what you meant?

- Ted

2010-10-01 22:19:21

by Martin K. Petersen

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

>>>>> "Ted" == Ted Ts'o <[email protected]> writes:

Ted> If we scale minimum_io_size up to the physical block size, then
Ted> even though these devices will have 512 or 4k logical block sizes,
Ted> minimum_io_size will be 16k? That sounds wrong, incorrect, and
Ted> given that the Linux VM can't handle file system block sizes
Ted> greater than page size. And if we scale the minimum_io_size to the
Ted> physical block size, mke2fs will refuse to create a 4k blocksize
Ted> filesystem --- since presumably "minimum io size" means we can't do
Ted> I/O's smaller than that.

logical <= physical <= minimum

logical is the smallest unit we can address. Usually 512 bytes.

physical is the allocation unit the device claims to use
internally. Typically 512 or 4096. 8 and 16 KiB coming.

minimal is the device's preferred minimum random I/O unit. This
is usually identical to the physical block size. Arrays might
report a multiple of the physical block size here (stripe chunk
size).

optimal (if provided) is the preferred sequential I/O unit and a
multiple of minimal (stripe width).

The logical and physical parameters are device protocol-centric values.
The minimum and optimal I/O sizes are the two "soft" values that
filesystems should be looking at for layout hints.

A filesystem should use minimal as a cue for block size and optimal as a
cue for stripe width. minimum may indeed be bigger than page size and
this discussion was started to figure out if there were thing we could
do to accommodate these device without actually changing the filesystem
block size in the traditional sense.

Since not all drives guarantee that read-modify-write cycle on a 4 KiB
physical block won't clobber adjacent 512-byte logical blocks it may be
a good idea to look at physical block size if there are atomicity
concerns. I.e. filesystems that depend on atomic journal writes may
want to look at the reported physical block size.

--
Martin K. Petersen Oracle Linux Engineering

2010-10-02 02:31:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

On Fri, Oct 01, 2010 at 06:19:21PM -0400, Martin K. Petersen wrote:
> Since not all drives guarantee that read-modify-write cycle on a 4 KiB
> physical block won't clobber adjacent 512-byte logical blocks it may be
> a good idea to look at physical block size if there are atomicity
> concerns. I.e. filesystems that depend on atomic journal writes may
> want to look at the reported physical block size.

OK, but what do we do when we start seeing devices with 8k or 16k
physical block sizes? The VM doesn't deal well with block sizes >
page size.

- Ted

2010-10-02 03:15:38

by Daniel Taylor

[permalink] [raw]
Subject: RE: I/O topology fixes for big physical block size



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Ted Ts'o
> Sent: Friday, October 01, 2010 7:31 PM
> To: Martin K. Petersen
> Cc: Mike Snitzer; Eric Sandeen; Jens Axboe;
> [email protected];
> [email protected]; [email protected]
> Subject: Re: I/O topology fixes for big physical block size
>
> On Fri, Oct 01, 2010 at 06:19:21PM -0400, Martin K. Petersen wrote:
> > Since not all drives guarantee that read-modify-write cycle
> on a 4 KiB
> > physical block won't clobber adjacent 512-byte logical
> blocks it may be
> > a good idea to look at physical block size if there are atomicity
> > concerns. I.e. filesystems that depend on atomic journal writes may
> > want to look at the reported physical block size.
>
> OK, but what do we do when we start seeing devices with 8k or 16k
> physical block sizes? The VM doesn't deal well with block sizes >
> page size.

This is a very real concern.

Those drives already exist, in essence, in RAID configurations, and
we have had to do a workaround that complicates our production process
to handle file systems for embedded devices where the file system block
size is 64K (the kernel block size for the device is also 64K), but
there's no corresponding x86 block size available.

BTW, not all drives with 4096-byte physical blocks are reporting themselves
as such. Some of them report as 512-byte physical.

>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2010-10-04 19:49:11

by Martin K. Petersen

[permalink] [raw]
Subject: Re: I/O topology fixes for big physical block size

>>>>> "Ted" == Ted Ts'o <[email protected]> writes:

Ted,

Ted> OK, but what do we do when we start seeing devices with 8k or 16k
Ted> physical block sizes? The VM doesn't deal well with block sizes >
Ted> page size.

I don't think we're going to see devices reporting logical blocks bigger
than 4KiB anytime soon. Too much pain for everybody in the industry
(most other operating systems can't even deal with 4KiB logical blocks
yet). Eventually we will have to do the required page cache surgery to
support filesystem block sizes bigger than the page size. But I don't
think that's something we'll have to deal with in the immediate future.

In the meantime, however, the question is whether there is something we
can do in the allocators to mitigate effects of devices reporting
physical blocks bigger than PAGE_CACHE_SIZE. Obviously this would be in
the I/O hint/alignment category and not something which would guarantee
that all writes would be aligned multiples of that physical block size.

--
Martin K. Petersen Oracle Linux Engineering