2009-06-25 03:58:17

by NeilBrown

[permalink] [raw]
Subject: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.


Hi,
I have (belatedly, I admit) been looking at the new 'topology'
metrics that were introduced for 2.6.31.
I have a few questions about them which I have been discussing with
Martin, but there is one issue that I feel fairly strongly about and
would like to see changed before it gets into a -final release.
Hence this email to try to garner understanding and consensus.

The various topology metrics are exposed to user-space through sysfs
attributes in the 'queue' subdirectory of the block device
(e.g. .../sda/queue).
I think this is a poor choice and would like to find a better
choice.

To explain why, it probably helps to review the current situation.
Before 2.6.30, 'queue' contains:
hw_sector_size max_hw_sectors_kb nr_requests rq_affinity
iosched/ max_sectors_kb read_ahead_kb scheduler
iostats nomerges rotational

Of these:

max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
max_sectors_kb scheduler nomerges rotational

are really only relevant to the elevator code and those devices that
used that code (ide, scsi, etc).
They are not relevant for dm or md (md has it's own separate 'md'
directory, and before 2.6.30, the '/queue' subdirectory did not even
appear in dm or md devices).

Of the others:
hw_sector_size - is applicable to all block devices, and could
reasonably be placed one level up in the device
directory (along side 'size').
read_ahead_kb - a duplicate of bdi/read_ahead_kb
iostats - is a switch to enable or disable accounting of
statistics that are reported in the 'stat'
file (one level up)

So most of '/queue' is specific to one class of devices (admittedly a
very large class). The others could be argued to be aberrations.

Adding a number of extra fields such as minimum_io_size,
optimal_io_size etc to '/queue' seems to increase the number of
aberrations and enforces md and dm device to have a /queue which is
largely irrelevant.

One approach that we could take would be to hide all those fields
in 'queue' that are not relevant to the current device, and let
'queue' be a 'dumping ground' for each block device to place whatever
sysfs attributes they want (thus md would move all of md/* to
queue/*, and leave 'md' as a symlink to 'queue').

I don't like this approach because it does not make best use of the
name space. If 'md' and 'queue' have different directories, they are
each free to create new attributes without risk of collision between
different drivers - not that the collision would be a technical
problem but it could be confusing to users.

So, where to put these new fields?

They could go in the device directory, along side 'size' and 'ro'.
Like those fields, the new ones give guidance to filesystems on how
to use the device. Whether or not this is a good thing depends a
bit on how many fields we are talking about. One or two might be
OK. 4 or more might look messy.
There are currently 4 fields: logical_block_size,
physical_block_size, minimum_io_size, optimal_io_size.
I have suggested to Martin that 2 are enough. While I don't
particularly want to debate that in this thread, it could be
relevant so I'll outline my idea below.

They could go in 'bdi' along with read_ahead_kb. read_ahead_kb
gives guidance on optimal read sizes. The topology fields give
guidance on optimal write sizes. There is a synergy there. And
these fields are undeniably info about a backing device.
NFS has it's own per-filesystem bdi so we would not want to impose
fields on NFS that weren't at all relevant. NFS has 'rsize' and
'wsize' which are somewhat related. So I feel somewhat positive
about this possibility. My only concern is that 'read_ahead_kb' is
more about reading *files*, where as the *_block_size and *_io_size
are about writing to the *device*. I'm not sure how important a
difference this is.

They could go in a new subdirectory of the block device, just like
the integrity fields. e.g 'topology/'. or 'metrics/'. This would
be my preferred approach if there do turn out to be the full 4
fields.

Thanks for your attention. Comments most welcome.

NeilBrown

----------------------
Alternate implementation with only two fields.
According to Documentation/ABI/testing/sysfs-block, both
physical_block_size and minimum_io_size are the smallest unit of IO
that doesn't require read-modify-write. The first is thought to
relate to drives with 4K blocks. The second to RAID5 arrays.
But that doesn't make sense as it stands: you cannot have two things
that are both the smallest.

Also, minimum_io_size and optimal_io_size are both described as a
"preferred" for IO - presumably writes, not reads. Again, we cannot
have two values that are both preferred. There is again some
suggestion that one is for disks and the other is for RAID, but I
cannot see how a mkfs would choose between them.

My conclusion is that there are two issues of importance.
1/ avoiding read-modify-write as that can affect correctness (When a
write error happens, you can lose unrelated data).
2/ throughput.

For each of these issues, there are a number of sizes that are
relevant.
e.g as you increase the request size, the performance can increase,
but there a key points where a small increase in size can give a big
increase in performance. These sizes might include block size, chunk
size, stripe size, and cache size.

So I suggested two fields, each of which can store multiple values:

safe_write_size: 512 4096 327680
preferred_write_size: 4096 65536 327680 10485760

The guidance for using these is simple:
When choosing a size where atomicity of writes is important, choose
the largest size from safe_write_size which is practical (or a
multiple there-of).

When choosing a size which doesn't require atomicity, but where
throughput is important, choose a multiple of the largest size from
preferred_write_size which is practical.

The smallest safe_write_size would be taken as the logical_block_size.

If we just have these two fields, I would put them in the top level
directory for the block device.


2009-06-25 08:02:29

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

>>>>> "Neil" == Neil Brown <[email protected]> writes:

Neil,

Neil> Of these:

Neil> max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
Neil> max_sectors_kb scheduler nomerges rotational

Neil> are really only relevant to the elevator code and those devices
Neil> that used that code (ide, scsi, etc).

I'm not sure I completely agree with putting rotational in that bucket.
It affects the choice of allocation policy in btrfs, for instance.



Neil> Of the others:
Neil> hw_sector_size - is applicable to all block devices, and could
Neil> reasonably be placed one level up in the
Neil> device directory (along side 'size').

hw_sector_size is deprecated. It's now split into logical and
physical_block_size.


Neil> Adding a number of extra fields such as minimum_io_size,
Neil> optimal_io_size etc to '/queue' seems to increase the number of
Neil> aberrations and enforces md and dm device to have a /queue which
Neil> is largely irrelevant.

You seem to be hung up on the fact that you don't queue things. I think
that's beside the point. You *do* have a request_queue thanks to
calling blk_queue_make_request() in md.c. And there is more to
request_queue than the values you brought up. Like the callback
functions. I'm not saying that all the values in request_queue apply to
MD, but I really don't understand what all the fuss is about. Other
than the presence of the string "queue" in the choice of naming.

Anyway. If you look at the request_queue in the current tree you'll see
that the very limits we are discussing are contained in a separate
struct. We can easily move that somewhere else at a later date if that
is deemed the right thing to do.


Neil> I have suggested to Martin that 2 are enough.

I think I have covered this in a separate mail. You are mixing up
hardware limitations and I/O hints on the grounds that they went in as
part of the same patch set and live in the same place.

fdisk/mdadm/dmsetup need to use physical_block_size and alignment_offset
to prevent us from misaligning when setting up partitions and virtual
block devices. Also, when stacking devices I need to know these values
to ensure that the I/O hints set by MD/DM don't conflict with the
underlying hardware limitations. There are also special cases like
shared disk setups and filesystem journal padding that may need to know
details of the hardware atomicity.

mkfs.* can leverage minimum_io_size and optimal_io_size hints to choose
block sizes and to lay out data structures on stripe boundaries. Just
like we're doing today except using a common interface for all block
devices instead of poking at MD and LVM internals.


logical_block_size, physical_block_size and alignment_offset are
hardware limits that need to be honored when creating a (virtual) block
device or partition.

The minimum/optimal write sizes are hints to the *user* of the block
device about how to lay out things. If you look at my MD patch you'll
see that I only set the I/O hints. The hardware settings are off limits
for MD.


I don't particularly care whether we store the values in queue/,
topology/, metrics/, limits/ or in the device root. Nor whether we call
it minimum_write_size instead of minimum_io_size. I'll be happy to roll
up a renaming patch...

--
Martin K. Petersen Oracle Linux Engineering

2009-06-25 11:08:05

by NeilBrown

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

On Thu, June 25, 2009 6:00 pm, Martin K. Petersen wrote:
>>>>>> "Neil" == Neil Brown <[email protected]> writes:
>
> Neil,
>
> Neil> Of these:
>
> Neil> max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
> Neil> max_sectors_kb scheduler nomerges rotational
>
> Neil> are really only relevant to the elevator code and those devices
> Neil> that used that code (ide, scsi, etc).
>
> I'm not sure I completely agree with putting rotational in that bucket.
> It affects the choice of allocation policy in btrfs, for instance.
>

I confess to have some uncertainty about this. There is, as far as
I can tell, no documentation for it.

So I asked git why it as added, and it pointed to
commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

which suggests that it was added so that user space could tell the
kernel whether the device was rotational, rather than the other way
around.

But yes, maybe 'rotational' doesn't belong with the others.

>
> Neil> Adding a number of extra fields such as minimum_io_size,
> Neil> optimal_io_size etc to '/queue' seems to increase the number of
> Neil> aberrations and enforces md and dm device to have a /queue which
> Neil> is largely irrelevant.
>
> You seem to be hung up on the fact that you don't queue things. I think
> that's beside the point. You *do* have a request_queue thanks to
> calling blk_queue_make_request() in md.c. And there is more to
> request_queue than the values you brought up. Like the callback
> functions. I'm not saying that all the values in request_queue apply to
> MD, but I really don't understand what all the fuss is about. Other
> than the presence of the string "queue" in the choice of naming.
>

Well names are very important. And as I said later we could possibly keep
them in 'queue' and make 'queue' a more generic directory. I don't like
that but it is probably better than the current situation.

As you say, I do currently have a request_queue, but that is an internal
detail, not part of the externally visible interface, and it is something
that is very much in my sights as something I want to change. I'm
still working out the details so I'm a fair way from a concrete proposal
and a long way from some code. That change certainly doesn't have
to happen in any rush. But we should get the externally visible
names "right" if we can.

> Anyway. If you look at the request_queue in the current tree you'll see
> that the very limits we are discussing are contained in a separate
> struct. We can easily move that somewhere else at a later date if that
> is deemed the right thing to do.

Agreed. But not relevant at the moment. The view from user-space is what
is important.

>
>
> Neil> I have suggested to Martin that 2 are enough.
>
> I think I have covered this in a separate mail. You are mixing up
> hardware limitations and I/O hints on the grounds that they went in as
> part of the same patch set and live in the same place.

I think I'm actually mixing them up because they look very much the
same. Both say "try to make write requests at least this big" and
I cannot see the difference being very big to the filesystem.

I tend to mix them up a bit with read_ahead_kb as they are in some
ways just the 'write' version of that. But it didn't arrive in the
same patch set :-)

Also, I think you seem to be treating the read-modify-write behaviour
of a 4K-sector hard drive as different-in-kind to the read-modify-write
behaviour of raid5. I cannot see that. In both cases an error can cause
unexpected corruption and in both cases getting the alignment right helps
throughput a lot. So the only difference between these two values
is the size.
If one is 4K and one is 40Meg and you have 512bytes of data that you
want to write as safely as possibly, you might pad it to 4K, but you
wont pad it to 40Meg.
If you have 32Meg of data that you want to write as safely as you can,
you may well pad it to 40Meg, rather than say "it is a multiple of
4K, that is enough for me".
So: the difference is only in the size.


>
> fdisk/mdadm/dmsetup need to use physical_block_size and alignment_offset
> to prevent us from misaligning when setting up partitions and virtual
> block devices. Also, when stacking devices I need to know these values
> to ensure that the I/O hints set by MD/DM don't conflict with the
> underlying hardware limitations. There are also special cases like
> shared disk setups and filesystem journal padding that may need to know
> details of the hardware atomicity.

"... of the *device* atomicity." It hardly matters whether the device is
hardware or software, it can still have atomicity issues.

>
> mkfs.* can leverage minimum_io_size and optimal_io_size hints to choose
> block sizes and to lay out data structures on stripe boundaries. Just
> like we're doing today except using a common interface for all block
> devices instead of poking at MD and LVM internals.

As we are using generic terms like "minimum" and "optimal", let's keep
with those terms when describing filesystem behaviour and not mention
stripes.
"mkfs.* can leverage these values to choose the most appropriate sizes
and alignments for various data strutures".
What is the most appropriate size? It depends on how much data we have
to store, and how much wastage we can afford. So if there are a range
of reasonably optimal sizes, the fs can choose the best fit, without
needing to pretend to understand why one is more optimal than another.

And the "Just like we're doing today ..." is actually a bit sad.
If you look at mkfs.ext3, it requires 3 values:
- block size
- stride-size
- stripe-width

block size needs to be smallish and a power of 2 and at most
PAGE_SIZE (I think). It would be ideal if this could be stripe-size on
a raid5, but the stripe is usually too large so the next best is
physical_block_size (so, a size based decision).

stripe-width is only really needed on raid4/5/6 as it is aimed at
avoiding read-modify-write, so it would be the stripe size, which would
be minimum_io_size. Though on a SCSI device it should probably
be "OPTIMAL TRANSFER LENGTH GRANULARITY" which I thought would be
optimal_io_size.. so maybe we use that and make minimum_io_size and
optimal_io_size the same on raid5 ?? Or just provide a list of useful
sizes and let the fs choose whatever is in the right ball-park.

stride-size is used for raid0 or raid4, or the "Asymmetric" raid5 layouts.
It allows ext3 to stagger which drive certain 'hot' data structures are
on. The current metrics don't allow for that at all.
I'm not saying they should, and I'm not sure how they could. But it
is sad.

>
> logical_block_size, physical_block_size and alignment_offset are
> hardware limits that need to be honored when creating a (virtual) block
> device or partition.

logical_block_size is a firmware limit. The others are just hints.
Strong hints I agree. Hints we should follow if at all possible.
But still hints.

>
> The minimum/optimal write sizes are hints to the *user* of the block
> device about how to lay out things. If you look at my MD patch you'll
> see that I only set the I/O hints. The hardware settings are off limits
> for MD.
>
>
> I don't particularly care whether we store the values in queue/,
> topology/, metrics/, limits/ or in the device root. Nor whether we call
> it minimum_write_size instead of minimum_io_size. I'll be happy to roll
> up a renaming patch...

Good, thanks.. I do strongly care that they don't go in queue/, and
that they have a name that is as close as possible to what they mean.
Let's see if anyone else has an opinion.

Thanks,
NeilBrown

2009-06-25 12:02:18

by John Robinson

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

On 25/06/2009 12:07, NeilBrown wrote:
[...]
> stripe-width is only really needed on raid4/5/6 as it is aimed at
> avoiding read-modify-write, so it would be the stripe size, which would
> be minimum_io_size.
[...]
> stride-size is used for raid0 or raid4, or the "Asymmetric" raid5 layouts.
> It allows ext3 to stagger which drive certain 'hot' data structures are
> on. The current metrics don't allow for that at all.
> I'm not saying they should, and I'm not sure how they could. But it
> is sad.

Even sadder, when a raid 0/4/5/6 is reshaped over more discs (and
probably other scenarios outwith md), both stripe-width and stride-size
change. Is there any prospect this new stacking could give us the
opportunity to tell our client (LVM, filesystem, whatever) about the
change, or that they'll be able to take advantage of it?

Cheers,

John.

2009-06-25 12:19:20

by berthiaume_wayne

[permalink] [raw]
Subject: RE: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

Just my 2 cents for what its worth, but I'd have to agree with Neil that
the location of these attributes should be somwhere logical where they
can be easily found and the naming convention of the attributes such
that they make sense to the common man. Afterall, not all of us are
kernel developers. =;^)

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of NeilBrown
Sent: Thursday, June 25, 2009 7:08 AM
To: Martin K. Petersen
Cc: Mike Snitzer; Martin K. Petersen; Linus Torvalds; Alasdair G Kergon;
[email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected]; device-mapper
development
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved
out of the 'queue' sysfs directory.

On Thu, June 25, 2009 6:00 pm, Martin K. Petersen wrote:
>>>>>> "Neil" == Neil Brown <[email protected]> writes:
>
> Neil,
>
> Neil> Of these:
>
> Neil> max_hw_sectors_kb, nr_requests, rq_affinity, iosched/,
> Neil> max_sectors_kb scheduler nomerges rotational
>
> Neil> are really only relevant to the elevator code and those devices
> Neil> that used that code (ide, scsi, etc).
>
> I'm not sure I completely agree with putting rotational in that
bucket.
> It affects the choice of allocation policy in btrfs, for instance.
>

I confess to have some uncertainty about this. There is, as far as
I can tell, no documentation for it.

So I asked git why it as added, and it pointed to
commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

which suggests that it was added so that user space could tell the
kernel whether the device was rotational, rather than the other way
around.

But yes, maybe 'rotational' doesn't belong with the others.

>
> Neil> Adding a number of extra fields such as minimum_io_size,
> Neil> optimal_io_size etc to '/queue' seems to increase the number of
> Neil> aberrations and enforces md and dm device to have a /queue
which
> Neil> is largely irrelevant.
>
> You seem to be hung up on the fact that you don't queue things. I
think
> that's beside the point. You *do* have a request_queue thanks to
> calling blk_queue_make_request() in md.c. And there is more to
> request_queue than the values you brought up. Like the callback
> functions. I'm not saying that all the values in request_queue apply
to
> MD, but I really don't understand what all the fuss is about. Other
> than the presence of the string "queue" in the choice of naming.
>

Well names are very important. And as I said later we could possibly
keep
them in 'queue' and make 'queue' a more generic directory. I don't like
that but it is probably better than the current situation.

As you say, I do currently have a request_queue, but that is an internal
detail, not part of the externally visible interface, and it is
something
that is very much in my sights as something I want to change. I'm
still working out the details so I'm a fair way from a concrete proposal
and a long way from some code. That change certainly doesn't have
to happen in any rush. But we should get the externally visible
names "right" if we can.

> Anyway. If you look at the request_queue in the current tree you'll
see
> that the very limits we are discussing are contained in a separate
> struct. We can easily move that somewhere else at a later date if
that
> is deemed the right thing to do.

Agreed. But not relevant at the moment. The view from user-space is
what
is important.

>
>
> Neil> I have suggested to Martin that 2 are enough.
>
> I think I have covered this in a separate mail. You are mixing up
> hardware limitations and I/O hints on the grounds that they went in as
> part of the same patch set and live in the same place.

I think I'm actually mixing them up because they look very much the
same. Both say "try to make write requests at least this big" and
I cannot see the difference being very big to the filesystem.

I tend to mix them up a bit with read_ahead_kb as they are in some
ways just the 'write' version of that. But it didn't arrive in the
same patch set :-)

Also, I think you seem to be treating the read-modify-write behaviour
of a 4K-sector hard drive as different-in-kind to the read-modify-write
behaviour of raid5. I cannot see that. In both cases an error can
cause
unexpected corruption and in both cases getting the alignment right
helps
throughput a lot. So the only difference between these two values
is the size.
If one is 4K and one is 40Meg and you have 512bytes of data that you
want to write as safely as possibly, you might pad it to 4K, but you
wont pad it to 40Meg.
If you have 32Meg of data that you want to write as safely as you can,
you may well pad it to 40Meg, rather than say "it is a multiple of
4K, that is enough for me".
So: the difference is only in the size.


>
> fdisk/mdadm/dmsetup need to use physical_block_size and
alignment_offset
> to prevent us from misaligning when setting up partitions and virtual
> block devices. Also, when stacking devices I need to know these
values
> to ensure that the I/O hints set by MD/DM don't conflict with the
> underlying hardware limitations. There are also special cases like
> shared disk setups and filesystem journal padding that may need to
know
> details of the hardware atomicity.

"... of the *device* atomicity." It hardly matters whether the device is
hardware or software, it can still have atomicity issues.

>
> mkfs.* can leverage minimum_io_size and optimal_io_size hints to
choose
> block sizes and to lay out data structures on stripe boundaries. Just
> like we're doing today except using a common interface for all block
> devices instead of poking at MD and LVM internals.

As we are using generic terms like "minimum" and "optimal", let's keep
with those terms when describing filesystem behaviour and not mention
stripes.
"mkfs.* can leverage these values to choose the most appropriate sizes
and alignments for various data strutures".
What is the most appropriate size? It depends on how much data we have
to store, and how much wastage we can afford. So if there are a range
of reasonably optimal sizes, the fs can choose the best fit, without
needing to pretend to understand why one is more optimal than another.

And the "Just like we're doing today ..." is actually a bit sad.
If you look at mkfs.ext3, it requires 3 values:
- block size
- stride-size
- stripe-width

block size needs to be smallish and a power of 2 and at most
PAGE_SIZE (I think). It would be ideal if this could be stripe-size on
a raid5, but the stripe is usually too large so the next best is
physical_block_size (so, a size based decision).

stripe-width is only really needed on raid4/5/6 as it is aimed at
avoiding read-modify-write, so it would be the stripe size, which would
be minimum_io_size. Though on a SCSI device it should probably
be "OPTIMAL TRANSFER LENGTH GRANULARITY" which I thought would be
optimal_io_size.. so maybe we use that and make minimum_io_size and
optimal_io_size the same on raid5 ?? Or just provide a list of useful
sizes and let the fs choose whatever is in the right ball-park.

stride-size is used for raid0 or raid4, or the "Asymmetric" raid5
layouts.
It allows ext3 to stagger which drive certain 'hot' data structures are
on. The current metrics don't allow for that at all.
I'm not saying they should, and I'm not sure how they could. But it
is sad.

>
> logical_block_size, physical_block_size and alignment_offset are
> hardware limits that need to be honored when creating a (virtual)
block
> device or partition.

logical_block_size is a firmware limit. The others are just hints.
Strong hints I agree. Hints we should follow if at all possible.
But still hints.

>
> The minimum/optimal write sizes are hints to the *user* of the block
> device about how to lay out things. If you look at my MD patch you'll
> see that I only set the I/O hints. The hardware settings are off
limits
> for MD.
>
>
> I don't particularly care whether we store the values in queue/,
> topology/, metrics/, limits/ or in the device root. Nor whether we
call
> it minimum_write_size instead of minimum_io_size. I'll be happy to
roll
> up a renaming patch...

Good, thanks.. I do strongly care that they don't go in queue/, and
that they have a name that is as close as possible to what they mean.
Let's see if anyone else has an opinion.

Thanks,
NeilBrown

2009-06-25 17:39:28

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

>>>>> "Neil" == NeilBrown <[email protected]> writes:

[rotational flag]

Neil> So I asked git why it as added, and it pointed to
Neil> commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec

Neil> which suggests that it was added so that user space could tell the
Neil> kernel whether the device was rotational, rather than the other
Neil> way around.

There's an option to do it via udev for broken devices that don't report
it. But both SCSI and ATA have a setting that gets queried and the
queue flag set accordingly.


Neil> Also, I think you seem to be treating the read-modify-write
Neil> behaviour of a 4K-sector hard drive as different-in-kind to the
Neil> read-modify-write behaviour of raid5. I cannot see that. In both
Neil> cases an error can cause unexpected corruption and in both cases
Neil> getting the alignment right helps throughput a lot.

If you get a write error on a RAID5 component you are able to
reconstruct and remap the stripe given the cache and the remaining
drives.

If you get a write error on a 4KB phys/512 byte logical drive the result
is undefined. In a single machine setting you can treat the 4KB block
as suspect. In a clustered setting, however, the other machines will
unknowingly be reading garbage.

I realize this is moot in the context of MD given that it doesn't
support shared storage. But MD is not the only virtual block device
driver that I need to support with the topology bits.


Neil> So the only difference between these two values is the size. If
Neil> one is 4K and one is 40Meg and you have 512bytes of data that you
Neil> want to write as safely as possibly, you might pad it to 4K, but
Neil> you wont pad it to 40Meg. If you have 32Meg of data that you want
Neil> to write as safely as you can, you may well pad it to 40Meg,
Neil> rather than say "it is a multiple of 4K, that is enough for me".
Neil> So: the difference is only in the size.

Yep. I call the lower boundary minimum_io_size and the upper boundary
optimal_io_size.

People have been putting filesystems and databases on top of RAID
devices for ages. And generally the best practice has been to align and
write in multiples of the chunk size and try to write full stripe
widths.

Given the requirement for read-modify-write on RAID[456] I can
understand your predisposition to set minimum_io_size to the stripe
width. However, I'm not really sure that's what the user wants. Given
the stripe cache I'm also not convinced the performance impact of the MD
RAID[456] RMW cycle is as bad as that of the disk drive. So I set
minimum_io_size to the chunk size in my patch.

If you can come up with better names for minimum and optimal then that's
ok with me. SCSI uses the term granularity. I used that for a while in
my patches but most people thought that was really weird. Minimum and
optimal seemed easier to grasp. Maximum also exists in the storage
device context but is literally the largest I/O the device can receive.

And just to make it clear: I completely agree with your argument that
which knob to choose is I/O size dependent. My beef with your proposal
is that I believe the length of the list should be 2.

How we do report this stuff is really something I'd like the FS guys to
comment on, though. The knobs we have now correspond to what's
currently used by XFS (libdisk) and indirectly by ext2+.

--
Martin K. Petersen Oracle Linux Engineering

2009-06-25 17:44:20

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

>>>>> "John" == John Robinson <[email protected]> writes:

John> Even sadder, when a raid 0/4/5/6 is reshaped over more discs (and
John> probably other scenarios outwith md),

I'm not sure I agree with the whole "sad" sentiment. Just because
ext[234] doesn't query devices automatically doesn't mean the technology
doesn't exist. I wrote the MD/LVM extraction code for XFS in 2001.


John> both stripe-width and stride-size change. Is there any prospect
John> this new stacking could give us the opportunity to tell our client
John> (LVM, filesystem, whatever) about the change, or that they'll be
John> able to take advantage of it?

When you resize the fs it would have to compare the new topology to the
old one and adjust accordingly.

--
Martin K. Petersen Oracle Linux Engineering

2009-06-25 17:47:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.



On Thu, 25 Jun 2009, Martin K. Petersen wrote:
>
> Neil> So I asked git why it as added, and it pointed to
> Neil> commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec
>
> Neil> which suggests that it was added so that user space could tell the
> Neil> kernel whether the device was rotational, rather than the other
> Neil> way around.
>
> There's an option to do it via udev for broken devices that don't report
> it. But both SCSI and ATA have a setting that gets queried and the
> queue flag set accordingly.

.. except few devices actually set it.

That flag is _definitely_ all about the user being able to override it.

Linus

2009-06-25 19:34:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

On Thu, Jun 25 2009, Linus Torvalds wrote:
>
>
> On Thu, 25 Jun 2009, Martin K. Petersen wrote:
> >
> > Neil> So I asked git why it as added, and it pointed to
> > Neil> commit 1308835ffffe6d61ad1f48c5c381c9cc47f683ec
> >
> > Neil> which suggests that it was added so that user space could tell the
> > Neil> kernel whether the device was rotational, rather than the other
> > Neil> way around.
> >
> > There's an option to do it via udev for broken devices that don't report
> > it. But both SCSI and ATA have a setting that gets queried and the
> > queue flag set accordingly.
>
> .. except few devices actually set it.
>
> That flag is _definitely_ all about the user being able to override it.

Most certainly, the idea was to add udev rules to set it for drives.
Fortunately newer drives to work right without a need for such rules,
but it should be handy for the ones released last year and earlier.

Most user space will not care what the setting is, it's mostly for
internal use. CFQ uses it, as does btrfs to decide allocation policy.

--
Jens Axboe

2009-06-25 19:40:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [dm-devel] REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory.

On Thu, Jun 25 2009, NeilBrown wrote:
> > You seem to be hung up on the fact that you don't queue things. I think
> > that's beside the point. You *do* have a request_queue thanks to
> > calling blk_queue_make_request() in md.c. And there is more to
> > request_queue than the values you brought up. Like the callback
> > functions. I'm not saying that all the values in request_queue apply to
> > MD, but I really don't understand what all the fuss is about. Other
> > than the presence of the string "queue" in the choice of naming.
> >
>
> Well names are very important. And as I said later we could possibly keep
> them in 'queue' and make 'queue' a more generic directory. I don't like
> that but it is probably better than the current situation.

Sorry to ask the obvious question, but what would the point of all this
pain be? The existing values can't go anywhere else, so you'd have to
add symlinks back into queue/ anyway.

> As you say, I do currently have a request_queue, but that is an internal
> detail, not part of the externally visible interface, and it is something
> that is very much in my sights as something I want to change. I'm
> still working out the details so I'm a fair way from a concrete proposal
> and a long way from some code. That change certainly doesn't have
> to happen in any rush. But we should get the externally visible
> names "right" if we can.

What crack are you smoking? :-)

A block device must have a request_queue, that's pretty much spread
throughout the kernel. The fact that md/dm is only using a subset of the
functionality is not a very good reason for re-writing large parts of
that design. We could save some space, but whether the queue is 200
bytes or 1400 bytes doesn't really make a whole lot of real-world
difference. It's not like we allocate/deallocate these all the time,
they are mostly static structures.

--
Jens Axboe