2009-07-07 01:46:39

by NeilBrown

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

On Friday June 26, [email protected] wrote:
> >>>>> "Neil" == Neil Brown <[email protected]> writes:
>
> Neil> Providing the fields are clearly and unambiguously documented so
> Neil> that it I can use the documentation to verify the implementation
> Neil> (in md at least), I will be satisfied.
>
> The current sysfs documentation says:
>
> /sys/block/<disk>/queue/minimum_io_size:
> [...] For RAID arrays it is often the stripe chunk size.
>
> /sys/block/<disk>/queue/optimal_io_size:
> [...] For RAID devices it is usually the stripe width or the internal
> block size.
>
> The latter should be "internal track size". But in the context of MD I
> think those two definitions are crystal clear.

They might be "clear" but I'm not convinced that they are "correct".

>
>
> As far as making the application of these values more obvious I propose
> the following:
>
> What: /sys/block/<disk>/queue/minimum_io_size
> Date: April 2009
> Contact: Martin K. Petersen <[email protected]>
> Description:
> Storage devices may report a granularity or minimum I/O
> size which is the device's preferred unit of I/O.
> Requests smaller than this may incur a significant
> performance penalty.
>
> For disk drives this value corresponds to the physical
> block size. For RAID devices it is usually the stripe
> chunk size.

These two paragraphs are contradictory. There is no sense in which a
RAID chunk size is a preferred minimum I/O size.

To some degree it is actually a 'maximum' preferred size for random
IO. If you do random IO is blocks larger than the chunk size then you
risk causing more 'head contention' (at least with RAID0 - with RAID5
the tradeoff is more complex).

If you are talking about "alignment", then yes - the chunk size is an
appropriate size to align on. But so are the block size and the
stripe size and none is, in general, any better than any other.


Also, you say "may" report. If a device does not report, what happens
to this file. Is it not present, or empty, or contain a special
"undefined" value?
I think the answer is that "512" is reported. It might be good to
explicitly document that.


>
> A properly aligned multiple of minimum_io_size is the
> preferred request size for workloads where a high number
> of I/O operations is desired.
>
>
> What: /sys/block/<disk>/queue/optimal_io_size
> Date: April 2009
> Contact: Martin K. Petersen <[email protected]>
> Description:
> Storage devices may report an optimal transfer length or
> streaming I/O size which is the device's preferred unit
> of sustained I/O. This value is a multiple of the
> device's minimum_io_size.
>
> optimal_io_size is rarely reported for disk drives. For
> RAID devices it is usually the stripe width or the
> internal track size.
>
> A properly aligned multiple of optimal_io_size is the
> preferred request size for workloads where sustained
> throughput is desired.

In this case, if a device does not report an optimal size, the file
contains "0" - correct? Should that be explicit?

I'd really like to see an example of how you expect filesystems to use
this.
I can well imagine the VM or elevator using this to assemble IO
requests in to properly aligned requests. But I cannot imagine how
e.g mkfs would use it.
Or am I misunderstanding and this is for programs that use O_DIRECT on
the block device so they can optimise their request stream?

Thanks,
NeilBrown


2009-07-07 05:31:19

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

>>>>> "Neil" == Neil Brown <[email protected]> writes:

>> What: /sys/block/<disk>/queue/minimum_io_size Date: April 2009
>> Contact: Martin K. Petersen <[email protected]> Description:
>> Storage devices may report a granularity or minimum I/O size which is
>> the device's preferred unit of I/O. Requests smaller than this may
>> incur a significant performance penalty.
>>
>> For disk drives this value corresponds to the physical block
>> size. For RAID devices it is usually the stripe chunk size.

Neil> These two paragraphs are contradictory. There is no sense in
Neil> which a RAID chunk size is a preferred minimum I/O size.

Maybe not for MD. This is not just about MD.

This is a hint that says "Please don't send me random I/Os smaller than
this. And please align to a multiple of this value".

I agree that for MD devices the alignment portion of that is the
important one. However, putting a lower boundary on the size *is* quite
important for 4KB disk drives. There are also HW RAID devices that
choke on requests smaller than the chunk size.

I appreciate the difficulty in filling out these hints in a way that
makes sense for all the supported RAID levels in MD. However, I really
don't consider the hints particularly interesting in the isolated
context of MD. To me the hints are conduits for characteristics of the
physical storage. The question you should be asking yourself is: "What
do I put in these fields to help the filesystem so that we get the most
out of the underlying, slow hardware?".

I think it is futile to keep spending time coming up with terminology
that encompasses all current and future software and hardware storage
devices with 100% accuracy.


Neil> To some degree it is actually a 'maximum' preferred size for
Neil> random IO. If you do random IO is blocks larger than the chunk
Neil> size then you risk causing more 'head contention' (at least with
Neil> RAID0 - with RAID5 the tradeoff is more complex).

Please elaborate.


Neil> Also, you say "may" report. If a device does not report, what
Neil> happens to this file. Is it not present, or empty, or contain a
Neil> special "undefined" value? I think the answer is that "512" is
Neil> reported.

The answer is physical_block_size.


Neil> In this case, if a device does not report an optimal size, the
Neil> file contains "0" - correct? Should that be explicit?

Now documented.


Neil> I'd really like to see an example of how you expect filesystems to
Neil> use this. I can well imagine the VM or elevator using this to
Neil> assemble IO requests in to properly aligned requests. But I
Neil> cannot imagine how e.g mkfs would use it. Or am I
Neil> misunderstanding and this is for programs that use O_DIRECT on the
Neil> block device so they can optimise their request stream?

The way it has been working so far (with the manual ioctl pokage) is
that mkfs will align metadata as well as data on a minimum_io_size
boundary. And it will try to use the minimum_io_size as filesystem
block size. On Linux that's currently limited by the fact that we can't
have blocks bigger than a page. The filesystem can also report the
optimal I/O size in statfs. For XFS the stripe width also affects how
the realtime/GRIO allocators work.

--
Martin K. Petersen Oracle Linux Engineering

2009-07-07 22:08:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

Neil Brown wrote:
> On Friday June 26, [email protected] wrote:
>
>> As far as making the application of these values more obvious I propose
>> the following:
>>
>> What: /sys/block/<disk>/queue/minimum_io_size
>> Date: April 2009
>> Contact: Martin K. Petersen <[email protected]>
>> Description:
>> Storage devices may report a granularity or minimum I/O
>> size which is the device's preferred unit of I/O.
>> Requests smaller than this may incur a significant
>> performance penalty.
>>
>> For disk drives this value corresponds to the physical
>> block size. For RAID devices it is usually the stripe
>> chunk size.
>>
>
> These two paragraphs are contradictory. There is no sense in which a
> RAID chunk size is a preferred minimum I/O size.
>
> To some degree it is actually a 'maximum' preferred size for random
> IO. If you do random IO is blocks larger than the chunk size then you
> risk causing more 'head contention' (at least with RAID0 - with RAID5
> the tradeoff is more complex).
>
>
Actually this is allocation unit, and the array can be assumed to be a
series of sets of contiguous bytes of this size. Given LBA addressing,
array members which are not simple whole devices, etc, this doesn't
(can't) promise much for the physical layout. And any read which resides
entirely within a chunk would not have a performance penalty, although
write might, if it were not a multiple of the sector size of the array
member(s) involved.

> If you are talking about "alignment", then yes - the chunk size is an
> appropriate size to align on. But so are the block size and the
> stripe size and none is, in general, any better than any other.
>
>
I would assume that a chunk, aligned on a chunk boundary, would be
allocated in a contiguous series of bytes on the underlying array
member. And that any i/o not aligned on a chunk boundary would be more
likely to access multiple array members.

Feel free to clarify my assumptions.

> Also, you say "may" report. If a device does not report, what happens
> to this file. Is it not present, or empty, or contain a special
> "undefined" value?
> I think the answer is that "512" is reported. It might be good to
> explicitly document that.
>
> I'd really like to see an example of how you expect filesystems to use
> this.
> I can well imagine the VM or elevator using this to assemble IO
> requests in to properly aligned requests. But I cannot imagine how
> e.g mkfs would use it.
> Or am I misunderstanding and this is for programs that use O_DIRECT on
> the block device so they can optimise their request stream?
>

--
Bill Davidsen <[email protected]>
Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one error occurs during
wildcard (glob) expansion.