Linux Device Mapper (DM) and Software Raid (MD) device drivers can be
used to arbitrarily combine devices with different "I/O Limits". The
kernel's block layer goes to great lengths to reasonably combine the
"I/O Limits" of the individual devices. The kernel will not prevent
combining heterogenuous devices but the user should be aware of the risk
associated with doing so.
For instance, a 512 byte device and a 4K device may be combined into a
single logical DM device; the resulting DM device would have a
logical_block_size of 4K. Filesystems layered on such a hybrid device
assume that 4K will be written atomically but in reality that 4K will be
split into 8 512 byte IOs when issued to the 512 byte device. Using a
4K logical_block_size for the higher-level DM device increases potential
for a partial write to the 512b device if there is a system crash.
If combining multiple devices' "I/O Limits" results in a conflict the
block layer will report a warning that the device is more susceptible to
partial writes and misaligned. [NOTE: setting "misaligned" for this
warning is somewhat awkward but blk_stack_limits() return of -1 can be
viewed as there was an "alignment inconsistency". Would it be better to
return -1 but avoid setting t->misaligned?]
Signed-off-by: Mike Snitzer <[email protected]>
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5eeb9e0..33bebe7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -566,8 +566,16 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
}
}
+ top = t->logical_block_size;
t->logical_block_size = max(t->logical_block_size,
b->logical_block_size);
+ if (top && top < t->logical_block_size) {
+ printk(KERN_NOTICE "Warning: changing logical_block_size of top device "
+ "(from %u to %u) increases potential for partial writes\n",
+ top, t->logical_block_size);
+ t->misaligned = 1;
+ ret = -1;
+ }
t->physical_block_size = max(t->physical_block_size,
b->physical_block_size);
>>>>> "Mike" == Mike Snitzer <[email protected]> writes:
Mike> For instance, a 512 byte device and a 4K device may be combined
Mike> into a single logical DM device; the resulting DM device would
Mike> have a logical_block_size of 4K. Filesystems layered on such a
Mike> hybrid device assume that 4K will be written atomically but in
Mike> reality that 4K will be split into 8 512 byte IOs when issued to
Mike> the 512 byte device.
Not really. It'll be issued as one I/O with a smaller LBA count but an
identical data payload.
Mike> Using a 4K logical_block_size for the higher-level DM device
Mike> increases potential for a partial write to the 512b device if
Mike> there is a system crash.
That's a definite maybe :)
Mike> [NOTE: setting "misaligned" for this warning is somewhat awkward
Mike> but blk_stack_limits() return of -1 can be viewed as there was an
Mike> "alignment inconsistency". Would it be better to return -1 but
Mike> avoid setting t->misaligned?]
I don't have a problem with printing a warning but I don't think this
qualifies as misalignment on the grounds that the error scenario is in
the hypothetical bucket and not a deterministic thing.
--
Martin K. Petersen Oracle Linux Engineering
On Tue, Feb 23 2010 at 12:10pm -0500,
Martin K. Petersen <[email protected]> wrote:
> >>>>> "Mike" == Mike Snitzer <[email protected]> writes:
>
> Mike> For instance, a 512 byte device and a 4K device may be combined
> Mike> into a single logical DM device; the resulting DM device would
> Mike> have a logical_block_size of 4K. Filesystems layered on such a
> Mike> hybrid device assume that 4K will be written atomically but in
> Mike> reality that 4K will be split into 8 512 byte IOs when issued to
> Mike> the 512 byte device.
>
> Not really. It'll be issued as one I/O with a smaller LBA count but an
> identical data payload.
Can you expand on that a bit? How does a smaller LBA count relate to
this? On a 512b device the 4K data payload would touch more LBAs.
In any case, a 4K write to a 512b device is not atomic.
> Mike> Using a 4K logical_block_size for the higher-level DM device
> Mike> increases potential for a partial write to the 512b device if
> Mike> there is a system crash.
>
> That's a definite maybe :)
If you think what I've raised here is overblown then I'd like to
understand why in more detail.
> Mike> [NOTE: setting "misaligned" for this warning is somewhat awkward
> Mike> but blk_stack_limits() return of -1 can be viewed as there was an
> Mike> "alignment inconsistency". Would it be better to return -1 but
> Mike> avoid setting t->misaligned?]
>
> I don't have a problem with printing a warning but I don't think this
> qualifies as misalignment on the grounds that the error scenario is in
> the hypothetical bucket and not a deterministic thing.
OK, I was relying on returning -1 so the blk_stack_limits() caller could
provide additional context (via existing warnings) for which device
"increases potential for partial writes" when it gets stacked.
Otherwise all you have is a largely generic warning (as blk_stack_limits
knows nothing about which devices the provided limits belong to).
Mike
>>>>> "Mike" == Mike Snitzer <[email protected]> writes:
>> Not really. It'll be issued as one I/O with a smaller LBA count but
>> an identical data payload.
Mike> Can you expand on that a bit? How does a smaller LBA count relate
Mike> to this? On a 512b device the 4K data payload would touch more
Mike> LBAs.
Sorry, I had my head stuck in the 4KB case. More blocks. My point
being that regardless of the logical block size we'll be issuing a
single command. The only difference between the two cases is the LBA
count. I.e. the protocol encoding of how much data to transfer.
Mike> In any case, a 4K write to a 512b device is not atomic.
Mike> If you think what I've raised here is overblown then I'd like to
Mike> understand why in more detail.
I'm just playing devil's advocate here. We have no guarantees that a
512-byte write to a 512-byte device is atomic either. None. We've been
trying very hard to get any guarantees out of storage vendors for years
without any luck. I know that a lot of our stuff operate on the
assumption that sector writes are atomic. But in a lot of cases they
are not.
Mike> Otherwise all you have is a largely generic warning (as
Mike> blk_stack_limits knows nothing about which devices the provided
Mike> limits belong to).
Yeah, I had a patch at some point that distinguished between the various
error conditions instead of returning -1. I'll see if I can dig that
up...
--
Martin K. Petersen Oracle Linux Engineering