2007-10-29 06:32:18

by Vasily Averin

[permalink] [raw]
Subject: dm: bounce_pfn limit added

Device mapper uses its own bounce_pfn that may differ from one on underlying
device. In that way dm can build incorrect requests that contain sg elements
greater than underlying device is able to handle.

This is the cause of slab corruption in i2o layer, occurred on i386 arch when
very long direct IO requests are addressed to dm-over-i2o device.

Signed-off-by: Vasily Averin <[email protected]>

--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -102,6 +102,8 @@ static void combine_restrictions_low(struct io_restrictions
lhs->seg_boundary_mask =
min_not_zero(lhs->seg_boundary_mask, rhs->seg_boundary_mask);

+ lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn);
+
lhs->no_cluster |= rhs->no_cluster;
}

@@ -566,6 +568,8 @@ void dm_set_device_limits(struct dm_target *ti, struct
min_not_zero(rs->seg_boundary_mask,
q->seg_boundary_mask);

+ rs->bounce_pfn = min_not_zero(rs->bounce_pfn, q->bounce_pfn);
+
rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
}
EXPORT_SYMBOL_GPL(dm_set_device_limits);
@@ -707,6 +711,8 @@ static void check_for_valid_limits(struct io_restrictions
rs->max_segment_size = MAX_SEGMENT_SIZE;
if (!rs->seg_boundary_mask)
rs->seg_boundary_mask = -1;
+ if (!rs->bounce_pfn)
+ rs->bounce_pfn = -1;
}

int dm_table_add_target(struct dm_table *t, const char *type,
@@ -891,6 +897,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct
q->hardsect_size = t->limits.hardsect_size;
q->max_segment_size = t->limits.max_segment_size;
q->seg_boundary_mask = t->limits.seg_boundary_mask;
+ q->bounce_pfn = t->limits.bounce_pfn;
if (t->limits.no_cluster)
q->queue_flags &= ~(1 << QUEUE_FLAG_CLUSTER);
else
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -111,6 +111,7 @@ struct target_type {

struct io_restrictions {
unsigned long seg_boundary_mask;
+ unsigned long bounce_pfn;
unsigned int max_sectors;
unsigned int max_segment_size;
unsigned short max_phys_segments;


2007-10-30 20:12:59

by Andrew Morton

[permalink] [raw]
Subject: Re: dm: bounce_pfn limit added

On Mon, 29 Oct 2007 09:31:39 +0300
Vasily Averin <[email protected]> wrote:

> Device mapper uses its own bounce_pfn that may differ from one on underlying
> device. In that way dm can build incorrect requests that contain sg elements
> greater than underlying device is able to handle.
>
> This is the cause of slab corruption in i2o layer, occurred on i386 arch when
> very long direct IO requests are addressed to dm-over-i2o device.
>
> Signed-off-by: Vasily Averin <[email protected]>
>
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -102,6 +102,8 @@ static void combine_restrictions_low(struct io_restrictions
> lhs->seg_boundary_mask =
> min_not_zero(lhs->seg_boundary_mask, rhs->seg_boundary_mask);
>
> + lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn);
> +
> lhs->no_cluster |= rhs->no_cluster;
> }
>
> @@ -566,6 +568,8 @@ void dm_set_device_limits(struct dm_target *ti, struct
> min_not_zero(rs->seg_boundary_mask,
> q->seg_boundary_mask);
>
> + rs->bounce_pfn = min_not_zero(rs->bounce_pfn, q->bounce_pfn);
> +
> rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
> }
> EXPORT_SYMBOL_GPL(dm_set_device_limits);
> @@ -707,6 +711,8 @@ static void check_for_valid_limits(struct io_restrictions
> rs->max_segment_size = MAX_SEGMENT_SIZE;
> if (!rs->seg_boundary_mask)
> rs->seg_boundary_mask = -1;
> + if (!rs->bounce_pfn)
> + rs->bounce_pfn = -1;
> }
>
> int dm_table_add_target(struct dm_table *t, const char *type,
> @@ -891,6 +897,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct
> q->hardsect_size = t->limits.hardsect_size;
> q->max_segment_size = t->limits.max_segment_size;
> q->seg_boundary_mask = t->limits.seg_boundary_mask;
> + q->bounce_pfn = t->limits.bounce_pfn;
> if (t->limits.no_cluster)
> q->queue_flags &= ~(1 << QUEUE_FLAG_CLUSTER);
> else
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -111,6 +111,7 @@ struct target_type {
>
> struct io_restrictions {
> unsigned long seg_boundary_mask;
> + unsigned long bounce_pfn;
> unsigned int max_sectors;
> unsigned int max_segment_size;
> unsigned short max_phys_segments;

Well that's a rather grave sounding bug. Two days and nobody from DM land
has commented? Hello?

I'll tag this as needed in 2.6.23.x as well.

I'll duck the "dm: struct io_restriction reordered" patch. People have
been changing things around in there and I had to fix a reject in "dm:
bounce_pfn limit added" to make it apply - let's not complicate life.

However it is a good change and hopefully the DM people will pick it up.



2007-10-30 23:28:34

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

On Tue, Oct 30, 2007 at 01:11:38PM -0700, Andrew Morton wrote:
> On Mon, 29 Oct 2007 09:31:39 +0300
> Vasily Averin <[email protected]> wrote:
> > Device mapper uses its own bounce_pfn that may differ from one on underlying
> > device. In that way dm can build incorrect requests that contain sg elements
> > greater than underlying device is able to handle.
> > This is the cause of slab corruption in i2o layer, occurred on i386 arch when
> > very long direct IO requests are addressed to dm-over-i2o device.

Interesting. When this patch was originally proposed a couple of years ago, we
did this instead:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=daef265f1590cf3e6de989d074041a280c82d58b;hp=00d6da9b4d6707b808481372537adb0fb38f99b3

) DM doesn't need to bounce bio's on its own, but the block layer defaults
) to that in blk_queue_make_request(). The lower level drivers should
) bounce ios themselves, that is what they need to do if not layered below
) dm anyways.

Alasdair
--
[email protected]

2007-10-31 02:02:54

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

On Tue, Oct 30, 2007 at 11:26:17PM +0000, Alasdair G Kergon wrote:
> ) DM doesn't need to bounce bio's on its own, but the block layer defaults
> ) to that in blk_queue_make_request(). The lower level drivers should
> ) bounce ios themselves, that is what they need to do if not layered below
> ) dm anyways.

So currently we treat bounce_pfn as a property that does not need to be
propagated through the stack.

But is that the right approach?
- Is there a blk_queue_bounce() missing either from dm or elsewhere?
(And BTW can the bio_alloc() that lurks within lead to deadlock?)

Firstly, what's going wrong?
- What is the dm table you are using? (output of 'dmsetup table')
- Which dm targets and with how many underlying devices?
- Which underlying driver?
- Is this direct I/O to the block device from userspace, or via some
filesystem or what?

Presumably you've shown the problem goes away when the dm device is
removed from the stack.
What if you swap in alternative dm targets, e.g. if it's linear,
try multipath (round-robin, one path)?

Alasdair
--
[email protected]

2007-10-31 02:12:29

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

On Wed, Oct 31, 2007 at 02:01:33AM +0000, Alasdair G Kergon wrote:
> What if you swap in alternative dm targets, e.g. if it's linear,
> try multipath (round-robin, one path)?

And try using md instead of dm - does that also show the problem?
(md takes a similar stance to dm on this I believe.)

Alasdair
--
[email protected]

2007-10-31 07:13:56

by Vasily Averin

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

Alasdair G Kergon wrote:
> So currently we treat bounce_pfn as a property that does not need to be
> propagated through the stack.
>
> But is that the right approach?
> - Is there a blk_queue_bounce() missing either from dm or elsewhere?
> (And BTW can the bio_alloc() that lurks within lead to deadlock?)
>
> Firstly, what's going wrong?
> - What is the dm table you are using? (output of 'dmsetup table')
> - Which dm targets and with how many underlying devices?
> - Which underlying driver?
> - Is this direct I/O to the block device from userspace, or via some
> filesystem or what?

On my testnode I have 6 Gb memory (1Gb normal zone for i386 kernels),
i2o hardware and lvm over i2o.

[root@ts10 ~]# dmsetup table
vzvg-vz: 0 10289152 linear 80:5 384
vzvg-vzt: 0 263127040 linear 80:5 10289536
[root@ts10 ~]# cat /proc/partitions
major minor #blocks name

80 0 143374336 i2o/hda
80 1 514048 i2o/hda1
80 2 4096575 i2o/hda2
80 3 2040255 i2o/hda3
80 4 1 i2o/hda4
80 5 136721151 i2o/hda5
253 0 5144576 dm-0
253 1 131563520 dm-1

Diotest from LTP test suite with ~1Mb buffer size and files on dm-over-i2o
paritions corrupts i2o_iop0_msg_inpool slab.

I2o on this node is able to handle only requests with up to 38 segments. Device
mapper correctly creates such requests and as you know it uses
max_pfn=BLK_BOUNCE_ANY. When this request translates to underlying device, it
clones bio and cleans BIO_SEG_VALID flag.

In this way underlying device calls blk_recalc_rq_segments() to recount number
of segments. However blk_recalc_rq_segments uses bounce_pfn=BLK_BOUNCE_HIGH
taken from underlying device. As result number of segments become over than
max_hw_segments limit.

Unfortunately there is not any checks and when i2o driver handles this incorrect
request it fills the memory out of i2o_iop0_msg_inpool slab.

Thank you,
Vasily Averin

2007-10-31 07:36:20

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

Vasily Averin wrote:
> Alasdair G Kergon wrote:
>> So currently we treat bounce_pfn as a property that does not need to be
>> propagated through the stack.
>>
>> But is that the right approach?
>> - Is there a blk_queue_bounce() missing either from dm or elsewhere?
>> (And BTW can the bio_alloc() that lurks within lead to deadlock?)
>>
>> Firstly, what's going wrong?
>> - What is the dm table you are using? (output of 'dmsetup table')
>> - Which dm targets and with how many underlying devices?
>> - Which underlying driver?
>> - Is this direct I/O to the block device from userspace, or via some
>> filesystem or what?
>
> On my testnode I have 6 Gb memory (1Gb normal zone for i386 kernels),
> i2o hardware and lvm over i2o.
>
> [root@ts10 ~]# dmsetup table
> vzvg-vz: 0 10289152 linear 80:5 384
> vzvg-vzt: 0 263127040 linear 80:5 10289536
> [root@ts10 ~]# cat /proc/partitions
> major minor #blocks name
>
> 80 0 143374336 i2o/hda
> 80 1 514048 i2o/hda1
> 80 2 4096575 i2o/hda2
> 80 3 2040255 i2o/hda3
> 80 4 1 i2o/hda4
> 80 5 136721151 i2o/hda5
> 253 0 5144576 dm-0
> 253 1 131563520 dm-1
>
> Diotest from LTP test suite with ~1Mb buffer size and files on dm-over-i2o
> paritions corrupts i2o_iop0_msg_inpool slab.
>
> I2o on this node is able to handle only requests with up to 38 segments. Device
> mapper correctly creates such requests and as you know it uses
> max_pfn=BLK_BOUNCE_ANY. When this request translates to underlying device, it
> clones bio and cleans BIO_SEG_VALID flag.
>
> In this way underlying device calls blk_recalc_rq_segments() to recount number
> of segments. However blk_recalc_rq_segments uses bounce_pfn=BLK_BOUNCE_HIGH
> taken from underlying device. As result number of segments become over than
> max_hw_segments limit.
>
> Unfortunately there is not any checks and when i2o driver handles this incorrect
> request it fills the memory out of i2o_iop0_msg_inpool slab.
>
We actually had a similar issue with some raid drivers (gdth iirc), and Neil Brown
did a similar patch for it. These were his comments on it:
>
> dm handles max_hw_segments by using an 'io_restrictions' structure
> that keeps the most restrictive values from all component devices.
>
> So it should not allow more than max_hw_segments.
>
> However I just notices that it does not preserve bounce_pfn as a restriction.
> So when the request gets down to the driver, it may be split up in to more
> segments than was expected up at the dm level.
>
So I guess we should take this.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Markus Rex, HRB 16746 (AG N?rnberg)

2007-10-31 21:01:42

by Kiyoshi Ueda

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

Hi,

On Wed, 31 Oct 2007 08:36:01 +0100, Hannes Reinecke <[email protected]> wrote:
> Vasily Averin wrote:
> > Alasdair G Kergon wrote:
> >> So currently we treat bounce_pfn as a property that does not need to be
> >> propagated through the stack.
> >>
> >> But is that the right approach?
> >> - Is there a blk_queue_bounce() missing either from dm or elsewhere?
> >> (And BTW can the bio_alloc() that lurks within lead to deadlock?)
> >>
> >> Firstly, what's going wrong?
> >> - What is the dm table you are using? (output of 'dmsetup table')
> >> - Which dm targets and with how many underlying devices?
> >> - Which underlying driver?
> >> - Is this direct I/O to the block device from userspace, or via some
> >> filesystem or what?
> >
> > On my testnode I have 6 Gb memory (1Gb normal zone for i386 kernels),
> > i2o hardware and lvm over i2o.
> >
> > [root@ts10 ~]# dmsetup table
> > vzvg-vz: 0 10289152 linear 80:5 384
> > vzvg-vzt: 0 263127040 linear 80:5 10289536
> > [root@ts10 ~]# cat /proc/partitions
> > major minor #blocks name
> >
> > 80 0 143374336 i2o/hda
> > 80 1 514048 i2o/hda1
> > 80 2 4096575 i2o/hda2
> > 80 3 2040255 i2o/hda3
> > 80 4 1 i2o/hda4
> > 80 5 136721151 i2o/hda5
> > 253 0 5144576 dm-0
> > 253 1 131563520 dm-1
> >
> > Diotest from LTP test suite with ~1Mb buffer size and files on dm-over-i2o
> > paritions corrupts i2o_iop0_msg_inpool slab.
> >
> > I2o on this node is able to handle only requests with up to 38 segments. Device
> > mapper correctly creates such requests and as you know it uses
> > max_pfn=BLK_BOUNCE_ANY. When this request translates to underlying device, it
> > clones bio and cleans BIO_SEG_VALID flag.
> >
> > In this way underlying device calls blk_recalc_rq_segments() to recount number
> > of segments. However blk_recalc_rq_segments uses bounce_pfn=BLK_BOUNCE_HIGH
> > taken from underlying device. As result number of segments become over than
> > max_hw_segments limit.
> >
> > Unfortunately there is not any checks and when i2o driver handles this incorrect
> > request it fills the memory out of i2o_iop0_msg_inpool slab.
> >
> We actually had a similar issue with some raid drivers (gdth iirc), and Neil Brown
> did a similar patch for it. These were his comments on it:
> >
> > dm handles max_hw_segments by using an 'io_restrictions' structure
> > that keeps the most restrictive values from all component devices.
> >
> > So it should not allow more than max_hw_segments.
> >
> > However I just notices that it does not preserve bounce_pfn as a restriction.
> > So when the request gets down to the driver, it may be split up in to more
> > segments than was expected up at the dm level.
> >
> So I guess we should take this.

How about the case that other dm device is stacked on the dm device?
(e.g. dm-linear over dm-multipath over i2o with bounce_pfn=64GB, and
the multipath table is changed to i2o with bounce_pfn=1GB.)

With this example, the patch propagates the restriction of i2o
to dm-multipath but not to dm-linear.
So I guess the same problem happens.
Although it may sound like a corner case, such situation could occur
with pvmove of LVM2, for example.
I think we should take care of it too so that system won't be destroyed.
Rejecting to load such table will at least prevent the problem.

Thanks,
Kiyoshi Ueda

2007-11-01 00:01:04

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] Re: dm: bounce_pfn limit added

On Wed, Oct 31, 2007 at 05:00:16PM -0500, Kiyoshi Ueda wrote:
> How about the case that other dm device is stacked on the dm device?
> (e.g. dm-linear over dm-multipath over i2o with bounce_pfn=64GB, and
> the multipath table is changed to i2o with bounce_pfn=1GB.)

Let's not broaden the problem out in that direction yet - that's a
known flaw in the way all these device restrictions are handled.
(Which would, it happens, also be resolved by the dm architectural
changes I'm contemplating.)

Yes, we could certainly take this patch - it won't do much harm (just
hit performance in some configurations). But I am not yet convinced
that there isn't some further underlying problem with the way the
responsibility for this bouncing is divided up between the various
layers: I still don't feel I completely understand this problem yet.

- How does that bio_alloc() in blk_queue_bounce() guarantee never to
lead a deadlock (in the device-mapper context)?
- Are some functions failing to take account of the hw_segments
(and perhaps other) restrictions?
- Are things actually simpler if the bouncing is dealt with just once
prior to entering the device stack (even though that may involve
bouncing some data that does not need it) or is it better to endeavour
to keep the bouncing as close to the final layer as possible?

Alasdair
--
[email protected]