2010-02-22 20:49:37

by Mike Snitzer

[permalink] [raw]
Subject: [RFC PATCH] block: warn if blk_stack_limits() undermines atomicity

Linux Device Mapper (DM) and Software Raid (MD) device drivers can be
used to arbitrarily combine devices with different "I/O Limits". The
kernel's block layer goes to great lengths to reasonably combine the
"I/O Limits" of the individual devices. The kernel will not prevent
combining heterogenuous devices but the user should be aware of the risk
associated with doing so.

For instance, a 512 byte device and a 4K device may be combined into a
single logical DM device; the resulting DM device would have a
logical_block_size of 4K. Filesystems layered on such a hybrid device
assume that 4K will be written atomically but in reality that 4K will be
split into 8 512 byte IOs when issued to the 512 byte device. Using a
4K logical_block_size for the higher-level DM device increases potential
for a partial write to the 512b device if there is a system crash.

If combining multiple devices' "I/O Limits" results in a conflict the
block layer will report a warning that the device is more susceptible to
partial writes and misaligned. [NOTE: setting "misaligned" for this
warning is somewhat awkward but blk_stack_limits() return of -1 can be
viewed as there was an "alignment inconsistency". Would it be better to
return -1 but avoid setting t->misaligned?]

Signed-off-by: Mike Snitzer <[email protected]>

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5eeb9e0..33bebe7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -566,8 +566,16 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
}
}

+ top = t->logical_block_size;
t->logical_block_size = max(t->logical_block_size,
b->logical_block_size);
+ if (top && top < t->logical_block_size) {
+ printk(KERN_NOTICE "Warning: changing logical_block_size of top device "
+ "(from %u to %u) increases potential for partial writes\n",
+ top, t->logical_block_size);
+ t->misaligned = 1;
+ ret = -1;
+ }

t->physical_block_size = max(t->physical_block_size,
b->physical_block_size);


2010-02-23 17:11:46

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [RFC PATCH] block: warn if blk_stack_limits() undermines atomicity

>>>>> "Mike" == Mike Snitzer <[email protected]> writes:

Mike> For instance, a 512 byte device and a 4K device may be combined
Mike> into a single logical DM device; the resulting DM device would
Mike> have a logical_block_size of 4K. Filesystems layered on such a
Mike> hybrid device assume that 4K will be written atomically but in
Mike> reality that 4K will be split into 8 512 byte IOs when issued to
Mike> the 512 byte device.

Not really. It'll be issued as one I/O with a smaller LBA count but an
identical data payload.


Mike> Using a 4K logical_block_size for the higher-level DM device
Mike> increases potential for a partial write to the 512b device if
Mike> there is a system crash.

That's a definite maybe :)


Mike> [NOTE: setting "misaligned" for this warning is somewhat awkward
Mike> but blk_stack_limits() return of -1 can be viewed as there was an
Mike> "alignment inconsistency". Would it be better to return -1 but
Mike> avoid setting t->misaligned?]

I don't have a problem with printing a warning but I don't think this
qualifies as misalignment on the grounds that the error scenario is in
the hypothetical bucket and not a deterministic thing.

--
Martin K. Petersen Oracle Linux Engineering

2010-02-23 19:32:53

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] block: warn if blk_stack_limits() undermines atomicity

On Tue, Feb 23 2010 at 12:10pm -0500,
Martin K. Petersen <[email protected]> wrote:

> >>>>> "Mike" == Mike Snitzer <[email protected]> writes:
>
> Mike> For instance, a 512 byte device and a 4K device may be combined
> Mike> into a single logical DM device; the resulting DM device would
> Mike> have a logical_block_size of 4K. Filesystems layered on such a
> Mike> hybrid device assume that 4K will be written atomically but in
> Mike> reality that 4K will be split into 8 512 byte IOs when issued to
> Mike> the 512 byte device.
>
> Not really. It'll be issued as one I/O with a smaller LBA count but an
> identical data payload.

Can you expand on that a bit? How does a smaller LBA count relate to
this? On a 512b device the 4K data payload would touch more LBAs.

In any case, a 4K write to a 512b device is not atomic.

> Mike> Using a 4K logical_block_size for the higher-level DM device
> Mike> increases potential for a partial write to the 512b device if
> Mike> there is a system crash.
>
> That's a definite maybe :)

If you think what I've raised here is overblown then I'd like to
understand why in more detail.

> Mike> [NOTE: setting "misaligned" for this warning is somewhat awkward
> Mike> but blk_stack_limits() return of -1 can be viewed as there was an
> Mike> "alignment inconsistency". Would it be better to return -1 but
> Mike> avoid setting t->misaligned?]
>
> I don't have a problem with printing a warning but I don't think this
> qualifies as misalignment on the grounds that the error scenario is in
> the hypothetical bucket and not a deterministic thing.

OK, I was relying on returning -1 so the blk_stack_limits() caller could
provide additional context (via existing warnings) for which device
"increases potential for partial writes" when it gets stacked.

Otherwise all you have is a largely generic warning (as blk_stack_limits
knows nothing about which devices the provided limits belong to).

Mike

2010-02-24 00:13:25

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [RFC PATCH] block: warn if blk_stack_limits() undermines atomicity

>>>>> "Mike" == Mike Snitzer <[email protected]> writes:

>> Not really. It'll be issued as one I/O with a smaller LBA count but
>> an identical data payload.

Mike> Can you expand on that a bit? How does a smaller LBA count relate
Mike> to this? On a 512b device the 4K data payload would touch more
Mike> LBAs.

Sorry, I had my head stuck in the 4KB case. More blocks. My point
being that regardless of the logical block size we'll be issuing a
single command. The only difference between the two cases is the LBA
count. I.e. the protocol encoding of how much data to transfer.


Mike> In any case, a 4K write to a 512b device is not atomic.

Mike> If you think what I've raised here is overblown then I'd like to
Mike> understand why in more detail.

I'm just playing devil's advocate here. We have no guarantees that a
512-byte write to a 512-byte device is atomic either. None. We've been
trying very hard to get any guarantees out of storage vendors for years
without any luck. I know that a lot of our stuff operate on the
assumption that sector writes are atomic. But in a lot of cases they
are not.


Mike> Otherwise all you have is a largely generic warning (as
Mike> blk_stack_limits knows nothing about which devices the provided
Mike> limits belong to).

Yeah, I had a patch at some point that distinguished between the various
error conditions instead of returning -1. I'll see if I can dig that
up...

--
Martin K. Petersen Oracle Linux Engineering