2009-04-24 12:55:30

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 2 of 9] block: Export I/O topology for block devices and partitions

Kay Sievers wrote:
> On Fri, Apr 24, 2009 at 07:32, Martin K. Petersen
> <[email protected]> wrote:
>> +What: /sys/block/<disk>/alignment
>> +What: /sys/block/<disk>/<partition>/alignment
>> +What: /sys/block/<disk>/queue/minimum_io_size
>> +What: /sys/block/<disk>/queue/optimal_io_size
>
> Wouldn't it be good to include "sector", like the queue files do? The
> alignment of a partition could mean many things.
> /sys/block/<disk>/sector_alignment
> /sys/block/<disk>/<partition>/sector_alignment
>
> And prefixing the io values might be easier to read when they show up
> in a group?
> /sys/block/<disk>/queue/io_minimum_size
> /sys/block/<disk>/queue/io_optimal_size
> /sys/block/<disk>/queue/io_...

Why do we need all this syscall overhead just to read individual data items?

Isn't it dumb to require 30 userland syscalls simply to input a
10-member data structure?

netlink looks more and more attractive for anything non-trivial.

Jeff



2009-04-24 14:37:34

by Carl Henrik Lunde

[permalink] [raw]
Subject: Re: [PATCH 2 of 9] block: Export I/O topology for block devices and partitions

> On Fri, Apr 24, 2009 at 07:32, Martin K. Petersen <[email protected]> wrote:
> +What: ? ? ? ? ?/sys/block/<disk>/alignment
> +What: ? ? ? ? ?/sys/block/<disk>/<partition>/alignment
> +What: ? ? ? ? ?/sys/block/<disk>/queue/minimum_io_size
> +What: ? ? ? ? ?/sys/block/<disk>/queue/optimal_io_size

Would it also be possible and useful to include the number of
spindles/channels, i.e., how many requests the device can handle
concurrently? CFQ could for example serve two time slices
concurrently if you have sequential reads and the device reports two
spindles.

[sorry for replying in the middle of the thread, but I didn't get the
original email]
--
mvh
Carl Henrik

2009-04-24 14:47:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2 of 9] block: Export I/O topology for block devices and partitions

On Fri, Apr 24, 2009 at 04:37:17PM +0200, Carl Henrik Lunde wrote:
> > On Fri, Apr 24, 2009 at 07:32, Martin K. Petersen <[email protected]> wrote:
> > +What: ? ? ? ? ?/sys/block/<disk>/alignment
> > +What: ? ? ? ? ?/sys/block/<disk>/<partition>/alignment
> > +What: ? ? ? ? ?/sys/block/<disk>/queue/minimum_io_size
> > +What: ? ? ? ? ?/sys/block/<disk>/queue/optimal_io_size
>
> Would it also be possible and useful to include the number of
> spindles/channels, i.e., how many requests the device can handle
> concurrently? CFQ could for example serve two time slices
> concurrently if you have sequential reads and the device reports two
> spindles.

This is what we call "creeping featurism" (or other names not as nice).
You'll then want to know which data are provided by which spindle.
Then you'll want to know how fast each spindle is. Then you'll find that
not all storage gives you that information (try asking an EMC Symmetrix
how many spindles it has and where data is mapped ...)

Let's just get something merged which gives us an improvement.
Then you're free to experiment with adding the spindles count yourself,
and if you can show a real advantage to it, come back and we can argue
over it with data.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-04-24 15:04:07

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 2 of 9] block: Export I/O topology for block devices and partitions

>>>>> "Jeff" == Jeff Garzik <[email protected]> writes:

Jeff> Why do we need all this syscall overhead just to read individual
Jeff> data items?

Jeff> Isn't it dumb to require 30 userland syscalls simply to input a
Jeff> 10-member data structure?

Jeff> netlink looks more and more attractive for anything non-trivial.

I think these three knobs are very trivial :)

I agree that traversing sysfs can be sucky. But for the mkfs-es of the
world I expect most of this to be handled by libdisk.

I also I really wanted something that could be easily scripted for
installers to poke at.

If these values were in any kind of hot path I'd be inclined to agree
with the need for a different interface. But realistically these are
only ever going to be accessed when creating a filesystem, partition or
MD/DM device.

So I opted to keep things simple. Doesn't mean we can't add another
interface if there's a real need...

--
Martin K. Petersen Oracle Linux Engineering

2009-04-24 15:18:36

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 2 of 9] block: Export I/O topology for block devices and partitions

>>>>> "Carl" == Carl Henrik Lunde <[email protected]> writes:

Carl> Would it also be possible and useful to include the number of
Carl> spindles/channels, i.e., how many requests the device can handle
Carl> concurrently? CFQ could for example serve two time slices
Carl> concurrently if you have sequential reads and the device reports
Carl> two spindles.

We don't really have a way to get that information at this point.

The values exported in my patch set is what the storage vendors in T10
could agree on. I simply applied them to DM and MD devices as well.

We're talking to SSD vendors about having their devices export some
characteristics that would allow us to schedule I/O more intelligently.
The effort of defining what those values might be is work in progress.
There has been murmurs about T13 adopting the T10 knobs, much like it
happened with form factor and media rotation rate.

For more traditional storage we don't really have a good way to
distinguish between a 10GB LUN on a million dollar array and a single
disk drive. Rotation rate can help and we already use and export that.

--
Martin K. Petersen Oracle Linux Engineering