LinuxLists.cc - [PATCH] BUG: nr_phys_segments cannot be less than nr_hw

2008-10-02 14:30:49

Subject: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294
([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)

It is possible for the merging code to create lesser no of phys segments than
hw segments, but every hw segment needs atleast one new phys segment. This
triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns more no
of segments than rq->nr_phys_segments

The following blktrace shows a sequence of bio's to trigger such condition on
my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.

8,0 0 12 1.943680621 2261 Q R 6184075 + 55 [bash]
8,0 0 13 1.943684671 2261 G R 6184075 + 55 [bash]
8,0 0 14 1.943690189 2261 P N [bash]
8,0 0 15 1.943692075 2261 I R 6184075 + 55 [bash]
8,0 0 16 1.943712119 2261 D R 6184075 + 55 [bash]
8,0 0 17 1.943718684 2261 Q R 6184130 + 55 [bash]
8,0 0 18 1.943721897 2261 G R 6184130 + 55 [bash]
8,0 0 19 1.943726576 2261 P N [bash]
8,0 0 20 1.943727763 2261 I R 6184130 + 55 [bash]
8,0 0 21 1.943731675 2261 Q R 6184241 + 56 [bash]
8,0 0 22 1.943735167 2261 G R 6184241 + 56 [bash]
8,0 0 23 1.943739497 2261 I R 6184241 + 56 [bash]
8,0 0 24 1.943742081 2261 Q R 6184185 + 56 [bash]
8,0 0 25 1.943744875 2261 M R 6184185 + 56 [bash]
8,0 0 26 1.943753535 2261 Q R 6184352 + 55 [bash]
8,0 0 27 1.943756538 2261 G R 6184352 + 55 [bash]
8,0 0 28 1.943760868 2261 I R 6184352 + 55 [bash]
8,0 0 29 1.943763522 2261 Q R 6184407 + 55 [bash]
8,0 0 30 1.943766316 2261 M R 6184407 + 55 [bash]
8,0 0 31 1.943770506 2261 Q R 6184297 + 55 [bash]
8,0 0 32 1.943772951 2261 F R 6184297 + 55 [bash]
8,0 0 33 1.943780843 2261 Q R 6184462 + 55 [bash]
8,0 0 34 1.943783776 2261 M R 6184462 + 55 [bash]
8,0 0 35 1.944444684 0 UT N [swapper] 2
8,0 0 36 1.944453624 10 U N [kblockd/0] 2
8,0 0 37 1.944470595 10 D R 6184130 + 387 [kblockd/0]
8,0 0 38 1.970340853 0 C R 6184075 + 55 [0]
8,0 0 39 1.973576739 0 C R 6184130 + 387 [0]

Patches against Jens git to fix this.

Signed-off-by: Nikanth Karthikesan <[email protected]>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..44cc1d7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -395,9 +395,6 @@ static int ll_merge_requests_fn(struct request_queue *q,
struct request *req,
if (blk_phys_contig_segment(q, req->biotail, next->bio))
total_phys_segments--;

- if (total_phys_segments > q->max_phys_segments)
- return 0;
-
total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
if (blk_hw_contig_segment(q, req->biotail, next->bio)) {
int len = req->biotail->bi_hw_back_size +
@@ -415,6 +412,15 @@ static int ll_merge_requests_fn(struct request_queue *q,
struct request *req,
if (total_hw_segments > q->max_hw_segments)
return 0;

+ /*
+ * FIXME: if 2 phys segments is used for a single hw segment?
+ */
+ if (total_phys_segments < total_hw_segments)
+ total_phys_segments = total_hw_segments;
+
+ if (total_phys_segments > q->max_phys_segments)
+ return 0;
+
/* Merge is OK... */
req->nr_phys_segments = total_phys_segments;
req->nr_hw_segments = total_hw_segments;

But this may not be the complete fix? I am not sure whether the condition in
FIXME comment can be triggered. The following patch may be a better fix.

Signed-off-by: Nikanth Karthikesan <[email protected]>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..b2c37f4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -467,6 +467,8 @@ static int attempt_merge(struct request_queue *q, struct
request *req,

req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;

+ blk_recalc_rq_segments(req);
+
elv_merge_requests(q, req, next);

if (req->rq_disk) {

But still wouldn't it be incomplete fix as blk_recalc_rq_segments() can
produce nr_phys_segments > q->max_phys_segments? Do we need something like
this.

Signed-off-by: Nikanth Karthikesan <[email protected]>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..e4431db 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -427,6 +427,8 @@ static int ll_merge_requests_fn(struct request_queue *q,
struct request *req,
static int attempt_merge(struct request_queue *q, struct request *req,
struct request *next)
{
+ struct bio *req_biotail, *req_biotail_bi_next;
+
if (!rq_mergeable(req) || !rq_mergeable(next))
return 0;

@@ -462,11 +464,21 @@ static int attempt_merge(struct request_queue *q, struct
request *req,
if (time_after(req->start_time, next->start_time))
req->start_time = next->start_time;

+ req_biotail = req->biotail;
+ req_biotail_bi_next = req->biotail->bi_next;
req->biotail->bi_next = next->bio;
req->biotail = next->biotail;

req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;

+ blk_recalc_rq_segments(req);
+ if (req->nr_phys_segments > q->max_phys_segments) {
+ req->biotail = req_biotail;
+ req->biotail->bi_next = req_biotail_bi_next;
+ blk_recalc_rq_segments(req);
+ return 0;
+ }
+
elv_merge_requests(q, req, next);

if (req->rq_disk) {

But blk_recalc_rq_segments() cannot increase nr_phys_segments by more than
one. So checking for one lesser limit earlier should also work? But we would
be mostly merging 1 lesser segment.

Signed-off-by: Nikanth Karthikesan <[email protected]>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..d9b5289 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -256,7 +256,7 @@ static inline int ll_new_mergeable(struct request_queue
*q,
{
int nr_phys_segs = bio_phys_segments(q, bio);

- if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments) {
+ if (req->nr_phys_segments + nr_phys_segs > q->max_phys_segments - 1) {
req->cmd_flags |= REQ_NOMERGE;
if (req == q->last_merge)
q->last_merge = NULL;
@@ -395,7 +395,7 @@ static int ll_merge_requests_fn(struct request_queue *q,
struct request *req,
if (blk_phys_contig_segment(q, req->biotail, next->bio))
total_phys_segments--;

- if (total_phys_segments > q->max_phys_segments)
+ if (total_phys_segments > q->max_phys_segments - 1)
return 0;

total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
@@ -467,6 +467,8 @@ static int attempt_merge(struct request_queue *q, struct
request *req,

req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;

+ blk_recalc_rq_segments(req);
+
elv_merge_requests(q, req, next);

if (req->rq_disk) {

Thanks
Nikanth Karthikesan

2008-10-02 15:04:36

by James Bottomley

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thu, 2008-10-02 at 19:59 +0530, Nikanth Karthikesan wrote:
> This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294
> ([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)
>
> It is possible for the merging code to create lesser no of phys segments than
> hw segments, but every hw segment needs atleast one new phys segment. This
> triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns more no
> of segments than rq->nr_phys_segments
>
> The following blktrace shows a sequence of bio's to trigger such condition on
> my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.

Um, don't you mean this the other way around? I can see this problem
occurring if the block layer gets tricked into doing a physical merge
where sector limits forbid a virtual merge.

The bug would appear to be that we sometimes only look at q->max_sectors
when deciding on mergability. Either we have to insist on max_sectors
<= hw_max_sectors, or we have to start using min(q->max_sectors,
q->max_hw_sectors) for this.

James

2008-10-02 15:06:20

by Nikanth Karthikesan

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thursday 02 October 2008 19:59:33 Nikanth Karthikesan wrote:
> This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294
> ([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)
>
> It is possible for the merging code to create lesser no of phys segments
> than hw segments, but every hw segment needs atleast one new phys segment.
> This triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns
> more no of segments than rq->nr_phys_segments
>
> The following blktrace shows a sequence of bio's to trigger such condition
> on my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.
>
> 8,0 0 12 1.943680621 2261 Q R 6184075 + 55 [bash]
> 8,0 0 13 1.943684671 2261 G R 6184075 + 55 [bash]
> 8,0 0 14 1.943690189 2261 P N [bash]
> 8,0 0 15 1.943692075 2261 I R 6184075 + 55 [bash]
> 8,0 0 16 1.943712119 2261 D R 6184075 + 55 [bash]
> 8,0 0 17 1.943718684 2261 Q R 6184130 + 55 [bash]
> 8,0 0 18 1.943721897 2261 G R 6184130 + 55 [bash]
> 8,0 0 19 1.943726576 2261 P N [bash]
> 8,0 0 20 1.943727763 2261 I R 6184130 + 55 [bash]
> 8,0 0 21 1.943731675 2261 Q R 6184241 + 56 [bash]
> 8,0 0 22 1.943735167 2261 G R 6184241 + 56 [bash]
> 8,0 0 23 1.943739497 2261 I R 6184241 + 56 [bash]
> 8,0 0 24 1.943742081 2261 Q R 6184185 + 56 [bash]
> 8,0 0 25 1.943744875 2261 M R 6184185 + 56 [bash]
> 8,0 0 26 1.943753535 2261 Q R 6184352 + 55 [bash]
> 8,0 0 27 1.943756538 2261 G R 6184352 + 55 [bash]
> 8,0 0 28 1.943760868 2261 I R 6184352 + 55 [bash]
> 8,0 0 29 1.943763522 2261 Q R 6184407 + 55 [bash]
> 8,0 0 30 1.943766316 2261 M R 6184407 + 55 [bash]
> 8,0 0 31 1.943770506 2261 Q R 6184297 + 55 [bash]
> 8,0 0 32 1.943772951 2261 F R 6184297 + 55 [bash]
> 8,0 0 33 1.943780843 2261 Q R 6184462 + 55 [bash]
> 8,0 0 34 1.943783776 2261 M R 6184462 + 55 [bash]
> 8,0 0 35 1.944444684 0 UT N [swapper] 2
> 8,0 0 36 1.944453624 10 U N [kblockd/0] 2
> 8,0 0 37 1.944470595 10 D R 6184130 + 387 [kblockd/0]
> 8,0 0 38 1.970340853 0 C R 6184075 + 55 [0]
> 8,0 0 39 1.973576739 0 C R 6184130 + 387 [0]
>
>
Oops, that was incomplete information to reproduce the issue.

typedef struct {
int IoSize;
char *Mem;
} DB_MEM_LIST, *pDB_MEM_LIST;

DB_MEM_LIST dbMemList[NUM_BIO] = {
{ 0x6e00, NULL },
{ 0x6e00, NULL },
{ 0x7000, NULL },
{ 0x7000, NULL },
{ 0x6e00, NULL },
{ 0x6e00, NULL },
{ 0x6e00, NULL },
{ 0x6e00, NULL },
};

// Allocate Memory.
char *Mem0 = kmalloc(dbMemList[0].IoSize, GFP_KERNEL);
char *Mem1 = kmalloc(dbMemList[1].IoSize, GFP_KERNEL);
char *Mem2 = kmalloc((dbMemList[2].IoSize + dbMemList[3].IoSize +
dbMemList[4].IoSize), GFP_KERNEL);
char *Mem5 = kmalloc(dbMemList[5].IoSize, GFP_KERNEL);
char *Mem6 = kmalloc(dbMemList[6].IoSize, GFP_KERNEL);
char *Mem7 = kmalloc(dbMemList[7].IoSize, GFP_KERNEL);

dbMemList[0].Mem = Mem0;
dbMemList[1].Mem = Mem1;
dbMemList[2].Mem = Mem2;
dbMemList[3].Mem = dbMemList[2].Mem + dbMemList[2].IoSize;
dbMemList[4].Mem = dbMemList[3].Mem + dbMemList[3].IoSize;
dbMemList[5].Mem = Mem5;
dbMemList[6].Mem = Mem6;
dbMemList[7].Mem = Mem7;

int Loop;
struct bio *BioList[NUM_BIO] = { NULL };
struct bio *Bio = NULL;
long Sector;

/*
** Allocate BIOs.
*/
for(Loop = 0; Loop < NUM_BIO; Loop++)
BioList[Loop] = bio_alloc(GFP_KERNEL, NUMBER_OF_VECS);

for(Loop = 0; Loop < NUM_BIO; Loop++) {
Bio = BioList[Loop];
Bio->bi_sector = Sector;
Bio->bi_bdev = Bdev;
Bio->bi_end_io = db_end_io;
if (Loop == 0) {
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[Loop].Mem),
0x3000,
offset_in_page(dbMemList[Loop].Mem))) {
printk("NIK: bio add page failed- %dA\n", Loop);
}
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[7].Mem),
0x3e00,
offset_in_page(dbMemList[7].Mem))) {
printk("NIK: bio add page failed- %dB\n", Loop);
}
} else if (Loop == 7) {
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[Loop].Mem),
0x2000,
offset_in_page(dbMemList[Loop].Mem))) {
printk("NIK: bio add page failed- %dA\n", Loop);
}
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[7].Mem),
0x4e00,
offset_in_page(dbMemList[7].Mem))) {
printk("NIK: bio add page failed- %dB\n", Loop);
}
} else if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[Loop].Mem),
dbMemList[Loop].IoSize,
offset_in_page(dbMemList[Loop].Mem))) {
printk("NIK: bio add page failed %d\n", Loop);
}
Sector += (Bio->bi_size / 512);
}

submit_bio(READ, BioList[0]); // 1
submit_bio(READ, BioList[1]); // 2
submit_bio(READ, BioList[3]); // 4
submit_bio(READ, BioList[2]); // 3
submit_bio(READ, BioList[5]); // 6
submit_bio(READ, BioList[6]); // 7
submit_bio(READ, BioList[4]); // 5
submit_bio(READ, BioList[7]); // 8

Thanks
Nikanth Karthikesan

2008-10-02 17:00:48

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thu, Oct 02 2008, James Bottomley wrote:
> The bug would appear to be that we sometimes only look at q->max_sectors
> when deciding on mergability. Either we have to insist on max_sectors
> <= hw_max_sectors, or we have to start using min(q->max_sectors,
> q->max_hw_sectors) for this.

q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
be sending down requests that are too large for the device to handle. So
that condition would be a big bug. The sysfs interface checks for this,
and blk_queue_max_sectors() makes sure that is true as well.

The fixes proposed still look weird. There is no phys vs hw segment
constraints, the request must adhere to the limits set by both. It's
mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
anyway.

--
Jens Axboe

2008-10-02 17:13:25

by James Bottomley

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> On Thu, Oct 02 2008, James Bottomley wrote:
> > The bug would appear to be that we sometimes only look at q->max_sectors
> > when deciding on mergability. Either we have to insist on max_sectors
> > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > q->max_hw_sectors) for this.
>
> q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> be sending down requests that are too large for the device to handle. So
> that condition would be a big bug. The sysfs interface checks for this,
> and blk_queue_max_sectors() makes sure that is true as well.

Yes, that seems always to be enforced. Perhaps there are other ways of
tripping this problem ... I'm still sure if it occurs it's because we do
a physical merge where a virtual merge is forbidden.

> The fixes proposed still look weird. There is no phys vs hw segment
> constraints, the request must adhere to the limits set by both. It's
> mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> anyway.

Agree all three proposed fixes look wrong. However, if it's what I
think, just getting rid of hw accounting won't fix the problem because
we'll still have the case where a physical merge is forbidden by iommu
constraints ... this still needs to be accounted for.

What we really need is a demonstration of what actually is going
wrong ...

James

2008-10-02 17:14:27

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thu, Oct 02 2008, James Bottomley wrote:
> On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > On Thu, Oct 02 2008, James Bottomley wrote:
> > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > when deciding on mergability. Either we have to insist on max_sectors
> > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > q->max_hw_sectors) for this.
> >
> > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > be sending down requests that are too large for the device to handle. So
> > that condition would be a big bug. The sysfs interface checks for this,
> > and blk_queue_max_sectors() makes sure that is true as well.
>
> Yes, that seems always to be enforced. Perhaps there are other ways of
> tripping this problem ... I'm still sure if it occurs it's because we do
> a physical merge where a virtual merge is forbidden.
>
> > The fixes proposed still look weird. There is no phys vs hw segment
> > constraints, the request must adhere to the limits set by both. It's
> > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > anyway.
>
> Agree all three proposed fixes look wrong. However, if it's what I
> think, just getting rid of hw accounting won't fix the problem because
> we'll still have the case where a physical merge is forbidden by iommu
> constraints ... this still needs to be accounted for.
>
> What we really need is a demonstration of what actually is going
> wrong ...

Yep, IIRC we both asked for that the last time as well... Nikanth?

--
Jens Axboe

2008-10-03 05:29:59

by Nikanth Karthikesan

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thursday 02 October 2008 22:43:57 Jens Axboe wrote:
> On Thu, Oct 02 2008, James Bottomley wrote:
> > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > The bug would appear to be that we sometimes only look at
> > > > q->max_sectors when deciding on mergability. Either we have to
> > > > insist on max_sectors <= hw_max_sectors, or we have to start using
> > > > min(q->max_sectors, q->max_hw_sectors) for this.
> > >
> > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > be sending down requests that are too large for the device to handle.
> > > So that condition would be a big bug. The sysfs interface checks for
> > > this, and blk_queue_max_sectors() makes sure that is true as well.
> >
> > Yes, that seems always to be enforced. Perhaps there are other ways of
> > tripping this problem ... I'm still sure if it occurs it's because we do
> > a physical merge where a virtual merge is forbidden.
> >
> > > The fixes proposed still look weird. There is no phys vs hw segment
> > > constraints, the request must adhere to the limits set by both. It's
> > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > anyway.
> >
> > Agree all three proposed fixes look wrong. However, if it's what I
> > think, just getting rid of hw accounting won't fix the problem because
> > we'll still have the case where a physical merge is forbidden by iommu
> > constraints ... this still needs to be accounted for.
> >

Yes. This patch fixes it.

Signed-off-by: Nikanth Karthikesan <[email protected]>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..9c2fe49 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -392,7 +392,11 @@ static int ll_merge_requests_fn(struct request_queue *q,
struct request *req,
return 0;

total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
- if (blk_phys_contig_segment(q, req->biotail, next->bio))
+ if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
+ BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail),
+ __BVEC_START(next->bio)) &&
+ !BIOVEC_VIRT_OVERSIZE(req->biotail->bi_hw_back_size
+ + next->bio->bi_hw_front_size))
total_phys_segments--;

if (total_phys_segments > q->max_phys_segments)

> > What we really need is a demonstration of what actually is going
> > wrong ...
>
> Yep, IIRC we both asked for that the last time as well... Nikanth?

Sorry, I had already sent the blktrace and some code snippet that would
reproduce the condition. Here is the module and user-space program that can
trigger this condition.

diskbiomod.c
---

#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/blkdev.h>
#include <linux/miscdevice.h>
/*
** Shared between the driver module and application.
*/
#define MOD_NODE "/dev/DiskBio"
#define MOD_NAME "DiskBio"

/*
** IOCTLs
*/
#define DB_IOCTL_DISK 0x00000001
#define DB_IOCTL_GENDISK 0x00000002

/*
** MODULE Defines.
*/
MODULE_DESCRIPTION("DiskBio - Initiate Disk I/O Traffic");
MODULE_LICENSE("GPL");

/*
** Module Defines.
*/
#define MOD_VERSION "09-11-2008-1"
#define NUM_BIO 8
#define NUMBER_OF_VECS 16
#define START_LBA 0x5e5c8b

/*
** Module Globals.
*/
static struct gendisk *(*get_GenDisk)(dev_t, int *) = NULL;
static int ErrorsOccurred = 0;

typedef struct {
int IoSize;
char *Mem;
} DB_MEM_LIST, *pDB_MEM_LIST;

DB_MEM_LIST dbMemList[NUM_BIO] = {
{ 0x6e00, NULL },
{ 0x6e00, NULL },
{ 0x7000, NULL },
{ 0x7000, NULL },
{ 0x6e00, NULL },
{ 0x6e00, NULL },
{ 0x6e00, NULL },
{ 0x6e00, NULL },
};

/*
** Wait Queue.
*/
DECLARE_WAIT_QUEUE_HEAD(db_wait_queue);

/*
** db_end_io()
*/
static void/*int*/
db_end_io(struct bio *Bio,/* unsigned int Bytes_Done,*/ int Error)
{
static int Completions = 0;

if(Error) {
++ErrorsOccurred;
printk("%s: End_Io Error=0x%x\n", MOD_NAME, Error);
}
if(Bio->bi_size) {
++ErrorsOccurred;
printk("%s: End_Io bi_size=0x%x\n", MOD_NAME, Bio->bi_size);
}

if(++Completions < NUM_BIO) {
return;//(0);
}
Completions = 0;

wake_up(&db_wait_queue);
return;//(0);
} /* end - db_end_io() */

/*
** db_do_io()
*/
static int
db_do_io(struct gendisk *GD)
{
int Loop;
int RC = 0;
struct bio *BioList[NUM_BIO] = { NULL };
struct bio *Bio = NULL;
struct block_device *Bdev;
long Sector;

/*
** Get block device structure.
*/
Bdev = bdget_disk(GD, 0);
if(!Bdev) {
RC = -ENOMEM;
goto OutOfHere;
}

/*
** Allocate BIOs.
*/
for(Loop = 0; Loop < NUM_BIO; Loop++) {
BioList[Loop] = bio_alloc(GFP_KERNEL, NUMBER_OF_VECS);
if(!BioList[Loop]) {
RC = -ENOMEM;
goto OutOfHere;
}
}

/*
** Initialize the I/O.
*/
Sector = START_LBA;
for(Loop = 0; Loop < NUM_BIO; Loop++) {
Bio = BioList[Loop];
Bio->bi_sector = Sector;
Bio->bi_bdev = Bdev;
Bio->bi_end_io = db_end_io;
if (Loop == 0) {
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[Loop].Mem),
0x3000,
offset_in_page(dbMemList[Loop].Mem))) {
printk("bio add page failed- %dA\n", Loop);
}
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[7].Mem),
0x3e00,
offset_in_page(dbMemList[7].Mem))) {
printk("bio add page failed- %dB\n", Loop);
}
} else if (Loop == 7) {
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[Loop].Mem),
0x2000,
offset_in_page(dbMemList[Loop].Mem))) {
printk("bio add page failed- %dA\n", Loop);
}
if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[7].Mem),
0x4e00,
offset_in_page(dbMemList[7].Mem))) {
printk("bio add page failed- %dB\n", Loop);
}
} else if (!bio_add_page(BioList[Loop],
virt_to_page(dbMemList[Loop].Mem),
dbMemList[Loop].IoSize,
offset_in_page(dbMemList[Loop].Mem))) {
printk("bio add page failed %d\n", Loop);
}
Sector += (Bio->bi_size / 512);
}
ErrorsOccurred = 0;

/*
** Issue the I/Os.
*/
submit_bio(READ, BioList[0]); // 1
submit_bio(READ, BioList[1]); // 2
submit_bio(READ, BioList[3]); // 4
submit_bio(READ, BioList[2]); // 3
submit_bio(READ, BioList[5]); // 6
submit_bio(READ, BioList[6]); // 7
submit_bio(READ, BioList[4]); // 5
submit_bio(READ, BioList[7]); // 8

/*
** Wait for the completion.
*/
sleep_on(&db_wait_queue);

/*
** Release the resources.
*/
OutOfHere:
for(Loop = 0; Loop < NUM_BIO; Loop++) {
if(BioList[Loop]) bio_put(BioList[Loop]);
}
if(ErrorsOccurred)
RC = -EIO;

return(RC);
} /* end - db_do_io() */

/*
** db_ioctl
*/
static int
db_ioctl(struct inode *inode, struct file *file, uint Command, ulong Arg)
{
int Status = 0;
struct gendisk *GenDisk;
int Part;
char *Mem0 = NULL;
char *Mem1 = NULL;
char *Mem2 = NULL;
char *Mem5 = NULL;
char *Mem6 = NULL;
char *Mem7 = NULL;

switch(Command) {
case DB_IOCTL_GENDISK:
get_GenDisk = (void *)Arg;
break;

case DB_IOCTL_DISK:
Arg = new_decode_dev(Arg);
if(!get_GenDisk) {
Status = -EADDRNOTAVAIL;
break;
}
GenDisk = get_GenDisk(Arg, &Part);
if(!GenDisk) {
Status = -ENODEV;
break;
}

/*
** Allocate Memory.
*/
Mem0 = kmalloc(dbMemList[0].IoSize, GFP_KERNEL);
Mem1 = kmalloc(dbMemList[1].IoSize, GFP_KERNEL);
Mem2 = kmalloc((dbMemList[2].IoSize + dbMemList[3].IoSize +
dbMemList[4].IoSize), GFP_KERNEL);
Mem5 = kmalloc(dbMemList[5].IoSize, GFP_KERNEL);
Mem6 = kmalloc(dbMemList[6].IoSize, GFP_KERNEL);
Mem7 = kmalloc(dbMemList[7].IoSize, GFP_KERNEL);

if(!Mem0 || !Mem1 || !Mem2 || !Mem5 || !Mem6 || !Mem7) {
Status = -ENOMEM;
goto OutOfHere2;
}

dbMemList[0].Mem = Mem0;
dbMemList[1].Mem = Mem1;
dbMemList[2].Mem = Mem2;
dbMemList[3].Mem = dbMemList[2].Mem + dbMemList[2].IoSize;
dbMemList[4].Mem = dbMemList[3].Mem + dbMemList[3].IoSize;
dbMemList[5].Mem = Mem5;
dbMemList[6].Mem = Mem6;
dbMemList[7].Mem = Mem7;

Status = db_do_io(GenDisk);

OutOfHere2:
if(Mem0) kfree(Mem0);
if(Mem1) kfree(Mem1);
if(Mem2) kfree(Mem2);
if(Mem5) kfree(Mem5);
if(Mem6) kfree(Mem6);
if(Mem7) kfree(Mem7);
break;

default:
printk("%s: Unknown Ioctl Command (0x%x)\n.", MOD_NAME, Command);
Status = -EINVAL;
break;
}

return(Status);
} /* end db_ioctl() */

/*
** File operations structure.
*/
static struct file_operations db_fops = {
.owner = THIS_MODULE,
.ioctl = db_ioctl,
};

/*
** Misc Device structure.
*/
static struct miscdevice db_miscdev = {
MISC_DYNAMIC_MINOR,
MOD_NAME,
&db_fops,
};

/*
** db_init()
*/
int __init
db_init(void)
{
int Error = 0;

printk("%s: Init: ENTER.\n", MOD_NAME);
printk("%s: Version %s.\n", MOD_NAME, MOD_VERSION);

Error = misc_register(&db_miscdev);
if(Error)
printk("%s: Misc Register Failure!\n", MOD_NAME);

printk("%s: Init: DONE.\n", MOD_NAME);
return(Error);
} /* end - db_init() */

/*
** db_exit()
*/
void __exit
db_exit(void)
{
printk("%s: Exit: ENTER.\n", MOD_NAME);

misc_deregister(&db_miscdev);

printk("%s: Exit: DONE.\n", MOD_NAME);
} /* end - db_exit() */

/*
** Module Entry Points.
*/
module_init(db_init);
module_exit(db_exit);

---
The Userspace utility to send ioctl to the module.
diskbio.c
---

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <libgen.h>
/*
** Shared between the driver module and application.
*/
#define MOD_NODE "/dev/DiskBio"
#define MOD_NAME "DiskBio"

/*
** IOCTLs
*/
#define DB_IOCTL_DISK 0x00000001
#define DB_IOCTL_GENDISK 0x00000002

#define DEVICE_BUF_SIZE 100
#define GET_GENDISK "get_gendisk"

int
OpenNode(void)
{
int IoPtr;
FILE *Fd;
char Buf[DEVICE_BUF_SIZE];
unsigned long Symbol_get_gendisk = 0;

/*
** Get the symbols we need so they can be sent to the driver.
*/
if((Fd = fopen("/proc/kallsyms", "r")) == NULL) {
perror("open");
fprintf(stderr, "Can Not Get Symbol Values!\n");
return(-1);
}

/*
** Scan the /proc/kallsyms file for the symbols and addresses.
*/
while(fgets(Buf, DEVICE_BUF_SIZE, Fd)) {
unsigned long SymbolAddress;
char Tag[DEVICE_BUF_SIZE];
char Name[DEVICE_BUF_SIZE];

if(sscanf(Buf, "%lx %s %s", &SymbolAddress, Tag, Name) != 3)
continue;

/*
** Check for a symbol match.
*/
if(!strcmp(Name, GET_GENDISK)) {
Symbol_get_gendisk = SymbolAddress;
break;
}
}

/*
** Done with this file.
*/
fclose(Fd);

/*
** Make sure we found all of the symbols.
*/
if(!Symbol_get_gendisk) {
fprintf(stderr, "%s Symbol NOT Found!\n", GET_GENDISK);
return(-1);
}

/*
** Device node should be present. Try and open.
*/
if((IoPtr = open(MOD_NODE, O_RDWR)) == -1) {
perror("open");
fprintf(stderr, "Can Not Open Module Device '%s'!\n", MOD_NODE);
fprintf(stderr, "Unload and Reload the Module.\n");
return(-1);
}

/*
** Send the symbol addresses to the module.
*/
if(ioctl(IoPtr, DB_IOCTL_GENDISK, Symbol_get_gendisk)) {
perror("ioctl");
fprintf(stderr, "Unable to Send GENDISK Symbol to Module!\n");
close(IoPtr);
return(-1);
}

return(IoPtr);
}

int
main(int argc, char *argv[])
{
int RC = 0;
int IoPtr = 0;
int DevPtr = 0;
unsigned long DeviceNumber;
struct stat StatBuf;

/*
** Check Arguments.
*/
if(argc < 2) {
fprintf(stderr, "Usage: %s /dev/sdX\n", basename(argv[0]));
exit(1);
}

/*
** Open the device node to the driver interface.
*/
if((IoPtr = OpenNode()) == -1) {
exit(2);
}

/*
** Open the device.
*/
DevPtr = open(argv[1], O_RDWR);
if(DevPtr == -1) {
perror("open");
fprintf(stderr, "Open Failed for Device '%s'\n", argv[1]);
RC = 3;
goto out;
}

/*
** Make sure the device is a block device.
*/
if(fstat(DevPtr, &StatBuf)) {
perror("fstat");
fprintf(stderr, "Unable to stat device '%s'.\n", argv[1]);
RC=4;
goto out;
}
if(!S_ISBLK(StatBuf.st_mode)) {
fprintf(stderr, "Device '%s' not a block device.\n", argv[1]);
RC = 5;
goto out;
}

/*
** Get Device Number.
*/
DeviceNumber = StatBuf.st_rdev;

/*
** Send the device number to the driver
** so it can do an I/O for us.
*/
RC = ioctl(IoPtr, DB_IOCTL_DISK, DeviceNumber);
if(RC) {
perror("ioctl");
fprintf(stderr, "Ioctl Failed (%d).\n", RC);
}

out:
if(DevPtr > -1) close(DevPtr);
if(IoPtr > -1) close(IoPtr);
exit(RC);
}

Thanks
Nikanth Karthikesan

2008-10-06 17:25:13

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Thu, 2 Oct 2008 19:13:57 +0200
Jens Axboe <[email protected]> wrote:

> On Thu, Oct 02 2008, James Bottomley wrote:
> > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > when deciding on mergability. Either we have to insist on max_sectors
> > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > q->max_hw_sectors) for this.
> > >
> > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > be sending down requests that are too large for the device to handle. So
> > > that condition would be a big bug. The sysfs interface checks for this,
> > > and blk_queue_max_sectors() makes sure that is true as well.
> >
> > Yes, that seems always to be enforced. Perhaps there are other ways of
> > tripping this problem ... I'm still sure if it occurs it's because we do
> > a physical merge where a virtual merge is forbidden.
> >
> > > The fixes proposed still look weird. There is no phys vs hw segment
> > > constraints, the request must adhere to the limits set by both. It's
> > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > anyway.
> >
> > Agree all three proposed fixes look wrong. However, if it's what I
> > think, just getting rid of hw accounting won't fix the problem because
> > we'll still have the case where a physical merge is forbidden by iommu
> > constraints ... this still needs to be accounted for.
> >
> > What we really need is a demonstration of what actually is going
> > wrong ...
>
> Yep, IIRC we both asked for that the last time as well... Nikanth?

Possibly, blk_phys_contig_segment might miscalculate
q->max_segment_size?

blk_phys_contig_segment does:

req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;

But it's possible that req->biotail and the previous bio are supposed
be merged into one segment? Then we could create too large segment
here.

2008-10-10 12:04:35

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> On Thu, 2 Oct 2008 19:13:57 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Thu, Oct 02 2008, James Bottomley wrote:
> > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > when deciding on mergability. Either we have to insist on max_sectors
> > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > q->max_hw_sectors) for this.
> > > >
> > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > be sending down requests that are too large for the device to handle. So
> > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > and blk_queue_max_sectors() makes sure that is true as well.
> > >
> > > Yes, that seems always to be enforced. Perhaps there are other ways of
> > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > a physical merge where a virtual merge is forbidden.
> > >
> > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > constraints, the request must adhere to the limits set by both. It's
> > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > anyway.
> > >
> > > Agree all three proposed fixes look wrong. However, if it's what I
> > > think, just getting rid of hw accounting won't fix the problem because
> > > we'll still have the case where a physical merge is forbidden by iommu
> > > constraints ... this still needs to be accounted for.
> > >
> > > What we really need is a demonstration of what actually is going
> > > wrong ...
> >
> > Yep, IIRC we both asked for that the last time as well... Nikanth?
>
> Possibly, blk_phys_contig_segment might miscalculate
> q->max_segment_size?
>
> blk_phys_contig_segment does:
>
> req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
>
> But it's possible that req->biotail and the previous bio are supposed
> be merged into one segment? Then we could create too large segment
> here.

Hmm yes, that looks like it could indeed be a problem! We could fix this
with similar logic to what we used to do for the hw merging, keep track
of the current segment size that this bio belongs to, so it would end up
ala

if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
rq->biotail->bi_seg_size + next->bio->bi_size <= q->max_segment_size) {
total_phys_segments--;
next->bio->bi_seg_size = rq->biotail->bi_seg_size + next->bio->bi_size;
}

for the merge part.

--
Jens Axboe

2008-10-10 12:38:13

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Fri, Oct 10 2008, FUJITA Tomonori wrote:
> On Fri, 10 Oct 2008 14:03:44 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> > > On Thu, 2 Oct 2008 19:13:57 +0200
> > > Jens Axboe <[email protected]> wrote:
> > >
> > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > > > when deciding on mergability. Either we have to insist on max_sectors
> > > > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > > > q->max_hw_sectors) for this.
> > > > > >
> > > > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > > > be sending down requests that are too large for the device to handle. So
> > > > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > > >
> > > > > Yes, that seems always to be enforced. Perhaps there are other ways of
> > > > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > > > a physical merge where a virtual merge is forbidden.
> > > > >
> > > > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > > > constraints, the request must adhere to the limits set by both. It's
> > > > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > > > anyway.
> > > > >
> > > > > Agree all three proposed fixes look wrong. However, if it's what I
> > > > > think, just getting rid of hw accounting won't fix the problem because
> > > > > we'll still have the case where a physical merge is forbidden by iommu
> > > > > constraints ... this still needs to be accounted for.
> > > > >
> > > > > What we really need is a demonstration of what actually is going
> > > > > wrong ...
> > > >
> > > > Yep, IIRC we both asked for that the last time as well... Nikanth?
> > >
> > > Possibly, blk_phys_contig_segment might miscalculate
> > > q->max_segment_size?
> > >
> > > blk_phys_contig_segment does:
> > >
> > > req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> > >
> > > But it's possible that req->biotail and the previous bio are supposed
> > > be merged into one segment? Then we could create too large segment
> > > here.
> >
> > Hmm yes, that looks like it could indeed be a problem!
>
> I think so.
>
>
> > We could fix this
> > with similar logic to what we used to do for the hw merging, keep track
> > of the current segment size that this bio belongs to, so it would end up
> > ala
>
> Yeah, exactly.
>
> You want a fix for this 2.6.28? Or disable this feature for 2.6.28?

Lets fix it. It wont be part of the initial merge, since it'll need some
dedicated testing, but we can get it there for 2.6.28. Shall I interpret
your message as willingness to write up the fix? :)

--
Jens Axboe

2008-10-10 12:41:39

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Fri, 10 Oct 2008 14:03:44 +0200
Jens Axboe <[email protected]> wrote:

> On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> > On Thu, 2 Oct 2008 19:13:57 +0200
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > > when deciding on mergability. Either we have to insist on max_sectors
> > > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > > q->max_hw_sectors) for this.
> > > > >
> > > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > > be sending down requests that are too large for the device to handle. So
> > > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > >
> > > > Yes, that seems always to be enforced. Perhaps there are other ways of
> > > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > > a physical merge where a virtual merge is forbidden.
> > > >
> > > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > > constraints, the request must adhere to the limits set by both. It's
> > > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > > anyway.
> > > >
> > > > Agree all three proposed fixes look wrong. However, if it's what I
> > > > think, just getting rid of hw accounting won't fix the problem because
> > > > we'll still have the case where a physical merge is forbidden by iommu
> > > > constraints ... this still needs to be accounted for.
> > > >
> > > > What we really need is a demonstration of what actually is going
> > > > wrong ...
> > >
> > > Yep, IIRC we both asked for that the last time as well... Nikanth?
> >
> > Possibly, blk_phys_contig_segment might miscalculate
> > q->max_segment_size?
> >
> > blk_phys_contig_segment does:
> >
> > req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> >
> > But it's possible that req->biotail and the previous bio are supposed
> > be merged into one segment? Then we could create too large segment
> > here.
>
> Hmm yes, that looks like it could indeed be a problem!

I think so.

> We could fix this
> with similar logic to what we used to do for the hw merging, keep track
> of the current segment size that this bio belongs to, so it would end up
> ala

Yeah, exactly.

You want a fix for this 2.6.28? Or disable this feature for 2.6.28?

> if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
> rq->biotail->bi_seg_size + next->bio->bi_size <= q->max_segment_size) {
> total_phys_segments--;
> next->bio->bi_seg_size = rq->biotail->bi_seg_size + next->bio->bi_size;
> }
>
> for the merge part.

2008-10-10 12:57:39

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

On Fri, 10 Oct 2008 14:37:19 +0200
Jens Axboe <[email protected]> wrote:

> On Fri, Oct 10 2008, FUJITA Tomonori wrote:
> > On Fri, 10 Oct 2008 14:03:44 +0200
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Tue, Oct 07 2008, FUJITA Tomonori wrote:
> > > > On Thu, 2 Oct 2008 19:13:57 +0200
> > > > Jens Axboe <[email protected]> wrote:
> > > >
> > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > On Thu, 2008-10-02 at 18:58 +0200, Jens Axboe wrote:
> > > > > > > On Thu, Oct 02 2008, James Bottomley wrote:
> > > > > > > > The bug would appear to be that we sometimes only look at q->max_sectors
> > > > > > > > when deciding on mergability. Either we have to insist on max_sectors
> > > > > > > > <= hw_max_sectors, or we have to start using min(q->max_sectors,
> > > > > > > > q->max_hw_sectors) for this.
> > > > > > >
> > > > > > > q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
> > > > > > > be sending down requests that are too large for the device to handle. So
> > > > > > > that condition would be a big bug. The sysfs interface checks for this,
> > > > > > > and blk_queue_max_sectors() makes sure that is true as well.
> > > > > >
> > > > > > Yes, that seems always to be enforced. Perhaps there are other ways of
> > > > > > tripping this problem ... I'm still sure if it occurs it's because we do
> > > > > > a physical merge where a virtual merge is forbidden.
> > > > > >
> > > > > > > The fixes proposed still look weird. There is no phys vs hw segment
> > > > > > > constraints, the request must adhere to the limits set by both. It's
> > > > > > > mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
> > > > > > > anyway.
> > > > > >
> > > > > > Agree all three proposed fixes look wrong. However, if it's what I
> > > > > > think, just getting rid of hw accounting won't fix the problem because
> > > > > > we'll still have the case where a physical merge is forbidden by iommu
> > > > > > constraints ... this still needs to be accounted for.
> > > > > >
> > > > > > What we really need is a demonstration of what actually is going
> > > > > > wrong ...
> > > > >
> > > > > Yep, IIRC we both asked for that the last time as well... Nikanth?
> > > >
> > > > Possibly, blk_phys_contig_segment might miscalculate
> > > > q->max_segment_size?
> > > >
> > > > blk_phys_contig_segment does:
> > > >
> > > > req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;
> > > >
> > > > But it's possible that req->biotail and the previous bio are supposed
> > > > be merged into one segment? Then we could create too large segment
> > > > here.
> > >
> > > Hmm yes, that looks like it could indeed be a problem!
> >
> > I think so.
> >
> >
> > > We could fix this
> > > with similar logic to what we used to do for the hw merging, keep track
> > > of the current segment size that this bio belongs to, so it would end up
> > > ala
> >
> > Yeah, exactly.
> >
> > You want a fix for this 2.6.28? Or disable this feature for 2.6.28?
>
> Lets fix it. It wont be part of the initial merge, since it'll need some
> dedicated testing, but we can get it there for 2.6.28. Shall I interpret
> your message as willingness to write up the fix? :)

Yeah, it's on this weekend todo list. :) I want to look at the code
again and make sure I correctly understand the problem.