While experimenting with the data integrity support in the Linux
kernel, i found that the block layer integrity code can send integrity
data segments for a request that do not adhere to the queue limits.
The integrity data segment can be larger than queue_max_segment_size
and the segment does not adhere to the queue_segment_boundary.
It appears to me that the right way would be to apply the same
restrictions that are in place for data segments also to integrity
data segments. The patch works for my experiments and applies on top
of the current Linux tree (2.6.35-rc5).
Christof
>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
Christof> While experimenting with the data integrity support in the
Christof> Linux kernel, i found that the block layer integrity code can
Christof> send integrity data segments for a request that do not adhere
Christof> to the queue limits. The integrity data segment can be larger
Christof> than queue_max_segment_size and the segment does not adhere to
Christof> the queue_segment_boundary.
Correct. That was a deliberate design decision.
Modern HBAs allow essentially indefinite chaining and our block layer
segmentation controls are to some extent legacy baggage. I did not want
to put in a set of constraints on the DI scatterlist because I was
afraid it would encourage vendors to actually them.
Christof> It appears to me that the right way would be to apply the same
Christof> restrictions that are in place for data segments also to
Christof> integrity data segments. The patch works for my experiments
Christof> and applies on top of the current Linux tree (2.6.35-rc5).
Who says constraints on the integrity scatterlist are the same as on the
data ditto? In my experience they are not. If you must do this, then
the DI constraints should be separate from the data segmentation ones.
But I'm interested in what motivated this change to begin with.
Your change also has repercussions when merging requests and bios. We'd
need to honor the DI segmentation constraints when merging. Otherwise
we may end up going beyond the controller limits when mapping the sgl.
--
Martin K. Petersen Oracle Linux Engineering
On 07/15/2010 10:03 AM, Martin K. Petersen wrote:
>>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
>
> Christof> While experimenting with the data integrity support in the
> Christof> Linux kernel, i found that the block layer integrity code can
> Christof> send integrity data segments for a request that do not adhere
> Christof> to the queue limits. The integrity data segment can be larger
> Christof> than queue_max_segment_size and the segment does not adhere to
> Christof> the queue_segment_boundary.
>
> Correct. That was a deliberate design decision.
>
> Modern HBAs allow essentially indefinite chaining and our block layer
> segmentation controls are to some extent legacy baggage. I did not want
> to put in a set of constraints on the DI scatterlist because I was
> afraid it would encourage vendors to actually them.
That sounds like a very batch design decision. Either the limits are
explicitly given and different, or if not we have to assume that they
are the same as the data limits at least.
So do they have limits or not? Essentially indefinite, what does that
mean? If they are limited, then we must have settings to cope and map
around those. If these limits are not the same as the regular data
limits, then those limits could be separate.
> Christof> It appears to me that the right way would be to apply the same
> Christof> restrictions that are in place for data segments also to
> Christof> integrity data segments. The patch works for my experiments
> Christof> and applies on top of the current Linux tree (2.6.35-rc5).
>
> Who says constraints on the integrity scatterlist are the same as on the
> data ditto? In my experience they are not. If you must do this, then
> the DI constraints should be separate from the data segmentation ones.
> But I'm interested in what motivated this change to begin with.
Yes me too, what bug triggered this patch?
--
Jens Axboe
>>>>> "Jens" == Jens Axboe <[email protected]> writes:
Jens> That sounds like a very batch design decision. Either the limits
Jens> are explicitly given and different, or if not we have to assume
Jens> that they are the same as the data limits at least.
Imagine a controller that has a 4KB segment, 1 entry limit. If we
capped the DI sgl the same way as the data we'd only be able to issue
512-byte requests unless the DI entries happened to be contiguous in
memory.
For several types of I/O the DI sgl is much longer than the data sgl.
Especially if the submitter is using buffer_heads to map 512-byte
blocks.
And consequently we require vendors to be able to handle the
pathological case in which any data scatterlist honoring the
segmentation constraints given by the driver can be matched with an
integrity scatterlist in which there is a separate entry for each
logical block. No vendor has had any problems with this. Therefore
there are no block layer data integrity queue limits.
If a device appears that does in fact have constraints I have no
problems intruducing a set of suitable knobs.
--
Martin K. Petersen Oracle Linux Engineering
On 07/15/2010 10:35 AM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <[email protected]> writes:
>
> Jens> That sounds like a very batch design decision. Either the limits
> Jens> are explicitly given and different, or if not we have to assume
> Jens> that they are the same as the data limits at least.
>
> Imagine a controller that has a 4KB segment, 1 entry limit. If we
> capped the DI sgl the same way as the data we'd only be able to issue
> 512-byte requests unless the DI entries happened to be contiguous in
> memory.
>
> For several types of I/O the DI sgl is much longer than the data sgl.
> Especially if the submitter is using buffer_heads to map 512-byte
> blocks.
>
> And consequently we require vendors to be able to handle the
> pathological case in which any data scatterlist honoring the
> segmentation constraints given by the driver can be matched with an
> integrity scatterlist in which there is a separate entry for each
> logical block. No vendor has had any problems with this. Therefore
> there are no block layer data integrity queue limits.
>
> If a device appears that does in fact have constraints I have no
> problems intruducing a set of suitable knobs.
OK, lets wait and hear what problem that Christof ran into here.
Is it ensuring that a segment doesn't corss eg the 4GB boundary?
--
Jens Axboe
On Thu, Jul 15, 2010 at 12:03:24PM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <[email protected]> writes:
>
> Christof> While experimenting with the data integrity support in the
> Christof> Linux kernel, i found that the block layer integrity code can
> Christof> send integrity data segments for a request that do not adhere
> Christof> to the queue limits. The integrity data segment can be larger
> Christof> than queue_max_segment_size and the segment does not adhere to
> Christof> the queue_segment_boundary.
>
> Correct. That was a deliberate design decision.
>
> Modern HBAs allow essentially indefinite chaining and our block layer
> segmentation controls are to some extent legacy baggage. I did not want
> to put in a set of constraints on the DI scatterlist because I was
> afraid it would encourage vendors to actually them.
>
>
> Christof> It appears to me that the right way would be to apply the same
> Christof> restrictions that are in place for data segments also to
> Christof> integrity data segments. The patch works for my experiments
> Christof> and applies on top of the current Linux tree (2.6.35-rc5).
>
> Who says constraints on the integrity scatterlist are the same as on the
> data ditto? In my experience they are not. If you must do this, then
> the DI constraints should be separate from the data segmentation ones.
> But I'm interested in what motivated this change to begin with.
The motivation stems from research how the integrity data can be
mapped to the hardware interface used by the zfcp driver. When passing
data segments to the zfcp hardware controller, there is the constraint
that each data segment has a maximum size of 4k and a segment must not
cross a 4k boundary. Right now, this is done by reporting the
maximum segment size and segment boundary accordingly from zfcp. When
issuing a request, zfcp simply walks the sg list and passes the
segments to the hardware controller, no mapping or readjustment is
necessary in the driver.
Adding integrity data to this interface implies that the integrity
data segments are passed the same way and with the same restrictions.
integrity data segments. In fact, i am planning to post an
experimental patch for zfcp for upstream inclusion. While this is
still research, it does not affect non-integrity I/O and it will ease
future work on the integrity data mapping for zfcp.
Maybe my thinking is too much with the zfcp hardware interface where
it is obvious to have the same constraints for everything passed along
to the hardware. Another motivation is that i do not want to have the
need in the driver to re-map or copy data segments, when the block
layer already has a generic method of doing this.
What would be the right approach for hardware that has specific
constraints for integrity data? Add an interface for reporting
integrity data constraints independently of constraints for regular
data?
> Your change also has repercussions when merging requests and bios. We'd
> need to honor the DI segmentation constraints when merging. Otherwise
> we may end up going beyond the controller limits when mapping the sgl.
Meaning the integrity data sg list would have more entries than
max_segments? I have not seen this during my experiments, but then i
likely have not hit every case of a possible request layout.
Christof
>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
Christof,
Christof> The motivation stems from research how the integrity data can
Christof> be mapped to the hardware interface used by the zfcp
Christof> driver. When passing data segments to the zfcp hardware
Christof> controller, there is the constraint that each data segment has
Christof> a maximum size of 4k and a segment must not cross a 4k
Christof> boundary.
Ok.
Christof> Right now, this is done by reporting the maximum segment size
Christof> and segment boundary accordingly from zfcp. When issuing a
Christof> request, zfcp simply walks the sg list and passes the segments
Christof> to the hardware controller, no mapping or readjustment is
Christof> necessary in the driver.
In that case I don't really have a problem with adhering to the queue
segment size and segment boundary for the integrity metadata. As long
as we avoid using the max number of data segments to cap the integrity
scatterlist because that'll definitely break a lot of stuff.
Does the zfcp hardware have scatterlist length constraints?
>> Your change also has repercussions when merging requests and bios.
>> We'd need to honor the DI segmentation constraints when merging.
>> Otherwise we may end up going beyond the controller limits when
>> mapping the sgl.
Christof> Meaning the integrity data sg list would have more entries
Christof> than max_segments? I have not seen this during my experiments,
Christof> but then i likely have not hit every case of a possible
Christof> request layout.
It happens all the time.
--
Martin K. Petersen Oracle Linux Engineering
On Tue, Jul 20, 2010 at 12:45:49AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <[email protected]> writes:
>
> Christof,
>
> Christof> The motivation stems from research how the integrity data can
> Christof> be mapped to the hardware interface used by the zfcp
> Christof> driver. When passing data segments to the zfcp hardware
> Christof> controller, there is the constraint that each data segment has
> Christof> a maximum size of 4k and a segment must not cross a 4k
> Christof> boundary.
>
> Ok.
>
>
> Christof> Right now, this is done by reporting the maximum segment size
> Christof> and segment boundary accordingly from zfcp. When issuing a
> Christof> request, zfcp simply walks the sg list and passes the segments
> Christof> to the hardware controller, no mapping or readjustment is
> Christof> necessary in the driver.
>
> In that case I don't really have a problem with adhering to the queue
> segment size and segment boundary for the integrity metadata. As long
> as we avoid using the max number of data segments to cap the integrity
> scatterlist because that'll definitely break a lot of stuff.
>
> Does the zfcp hardware have scatterlist length constraints?
Yes. There is a hardware limit for the maximum number of scatterlist
entries the driver can send to the hardware at the same time. For
adding integrity data, we have to come up with a way to map both,
integrity data and user data in the same hardware request.
This is the experimental zfcp integrity data patch i sent for upstream
inclusion: http://marc.info/?l=linux-scsi&m=127928781200392&w=2 The
approach is that the driver has to adhere to the hardware constraint
of sum of all data segments (user plus integrity data). To have a
simple approach that covers the case with one integrity data segment
per user data segment, we only report half the size for the
scatterlist length when running DIX. This guarantees that the other
half can be used for integrity data.
> >> Your change also has repercussions when merging requests and bios.
> >> We'd need to honor the DI segmentation constraints when merging.
> >> Otherwise we may end up going beyond the controller limits when
> >> mapping the sgl.
>
> Christof> Meaning the integrity data sg list would have more entries
> Christof> than max_segments? I have not seen this during my experiments,
> Christof> but then i likely have not hit every case of a possible
> Christof> request layout.
>
> It happens all the time.
Ok, i have to look into that as well. It would be an issue with the
approach we are looking at now: If there are max_segments data
segments, and more than max_segments integrity data segments, we will
overrun the hardware constraint.
Christof
>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
Christof> To have a simple approach that covers the case with one
Christof> integrity data segment per user data segment, we only report
Christof> half the size for the scatterlist length when running
Christof> DIX. This guarantees that the other half can be used for
Christof> integrity data.
Yup, a few of our partners did something similar.
My concern is the scenario where we submit lots of 512-byte writes that
get merged into (in your case) 4 KB segments. Each of those 512-byte
writes could come with an 8-byte integrity metadata tuple. And so you'd
need 8 DI scatterlist elements per data element.
Christof> Meaning the integrity data sg list would have more entries
Christof> than max_segments? I have not seen this during my experiments,
Christof> but then i likely have not hit every case of a possible
Christof> request layout.
dd to the block device is usually a good way to issue long scatterlists.
Christof> Ok, i have to look into that as well. It would be an issue
Christof> with the approach we are looking at now: If there are
Christof> max_segments data segments, and more than max_segments
Christof> integrity data segments, we will overrun the hardware
Christof> constraint.
Ok.
--
Martin K. Petersen Oracle Linux Engineering
On Wed, Jul 21, 2010 at 12:20:01AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <[email protected]> writes:
>
> Christof> To have a simple approach that covers the case with one
> Christof> integrity data segment per user data segment, we only report
> Christof> half the size for the scatterlist length when running
> Christof> DIX. This guarantees that the other half can be used for
> Christof> integrity data.
>
> Yup, a few of our partners did something similar.
>
> My concern is the scenario where we submit lots of 512-byte writes that
> get merged into (in your case) 4 KB segments. Each of those 512-byte
> writes could come with an 8-byte integrity metadata tuple. And so you'd
> need 8 DI scatterlist elements per data element.
>
>
> Christof> Meaning the integrity data sg list would have more entries
> Christof> than max_segments? I have not seen this during my experiments,
> Christof> but then i likely have not hit every case of a possible
> Christof> request layout.
>
> dd to the block device is usually a good way to issue long scatterlists.
>
>
> Christof> Ok, i have to look into that as well. It would be an issue
> Christof> with the approach we are looking at now: If there are
> Christof> max_segments data segments, and more than max_segments
> Christof> integrity data segments, we will overrun the hardware
> Christof> constraint.
>
> Ok.
After looking at the given facts, this seems to be the missing part:
The zfcp hardware interface has an maximum number of data segments
that can be part of one request. In the experimental zfcp DIF/DIX
patch (now in scsi-misc), zfcp reserves half of the data segments
for integrity data. But if small bios have been merged until hitting
queue_max_segments, there may be more integrity data segments.
To summarize the limits i see in the zfcp hardware:
- Maximum size of 4k per segment
- The segments must not cross page boundaries
- The number of segments per request is limited
My preferred approach would be to set the limits on the queue, so that
the request adheres to the hardware limitations and can be passed on
directly to the hardware. I would like to avoid splitting large
segments in the driver code, and i especially want to avoid having to
copy the integrity data to new buffers to adhere to the hardware
constraints.
Looking at the block layer, the number of integrity data segments
could be limited with an additional check in ll_new_hw_segment.
What would be the preferred approach for handling the integrity data
limits in the block layer? Introduce new queue limits for integrity
data, or assume that the limits for integrity data are the same as for
user data? I can update my patch accordingly and include a check for
the maximum number of segments.
Christof
>>>>> "Christof" == Christof Schmitt <[email protected]> writes:
Christof> To summarize the limits i see in the zfcp hardware:
Christof> - Maximum size of 4k per segment
Christof> - The segments must not cross page boundaries
Christof> - The number of segments per request is limited
The interesting thing here is that your hw has a limit for the total
number of segments whereas other DIX HBAs have separate limits for data
and integrity scatterlists.
Christof> What would be the preferred approach for handling the
Christof> integrity data limits in the block layer? Introduce new queue
Christof> limits for integrity data, or assume that the limits for
Christof> integrity data are the same as for user data? I can update my
Christof> patch accordingly and include a check for the maximum number
Christof> of segments.
I've been messing with this tonight. It's not entirely trivial because
of the housekeeping involved, having to accomodate different types of
hardware, having to avoid breaking existing setups, and having to work
with integrity compiled and without.
My first attempt at this got quite messy. I think I have found a way
but it's bedtime here. Give me a day or two to get back to you with
something that'll hopefully work for everyone.
--
Martin K. Petersen Oracle Linux Engineering
On Tue, Aug 03, 2010 at 12:44:50AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <[email protected]> writes:
>
> Christof> To summarize the limits i see in the zfcp hardware:
> Christof> - Maximum size of 4k per segment
> Christof> - The segments must not cross page boundaries
> Christof> - The number of segments per request is limited
>
> The interesting thing here is that your hw has a limit for the total
> number of segments whereas other DIX HBAs have separate limits for data
> and integrity scatterlists.
Yes, the hw interface only has a limit for the total number. The best
solution would be an interface that allows reporting this limit to the
block layer. If this is not possible, or deemed too exotic, reporting
seperate limits for integrity segments and data segments would also be
fine with me.
> Christof> What would be the preferred approach for handling the
> Christof> integrity data limits in the block layer? Introduce new queue
> Christof> limits for integrity data, or assume that the limits for
> Christof> integrity data are the same as for user data? I can update my
> Christof> patch accordingly and include a check for the maximum number
> Christof> of segments.
>
> I've been messing with this tonight. It's not entirely trivial because
> of the housekeeping involved, having to accomodate different types of
> hardware, having to avoid breaking existing setups, and having to work
> with integrity compiled and without.
>
> My first attempt at this got quite messy. I think I have found a way
> but it's bedtime here. Give me a day or two to get back to you with
> something that'll hopefully work for everyone.
Ok, when you have something, i can have a look at it and see if it
matches the requirements here.
Thanks,
Christof