LinuxLists.cc - Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

2002-06-05 10:44:36

by Adam J. Richter

[permalink] [raw]

Subject: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

I may very well be misunderstanding something, but it looks
to like ll_rw_kio can generate bio's that have more segments or
more sectors than their queues declare that they can handle.

As an extreme example, I could write a driver for a
block device that has max_hw_segments == 1, max_phys_segments == 1,
max_sectors == 1. Right?

The intermediate bio merging code in drivers/block/ll_rw_blk.c
obeys these constraints when it comes to merging requests that
also are all within these constraints already, but it does break
up bio's that are to big for the underlying queue. As far as I
can tell, it seems to be the responsibility of anything that
calls submit_bio or generic_make_request to ensure that the
bio that is submitted is within the request_queue_t's limits
if the driver that pulls requests off of the queue is going to
be able to use max_hw_segments, max_phys_segments and max_sectors
as guarantees. Right?

If I got both of those statements correct, I would
appreciate comments on the following untested patch (I have
to go to sleep now, and I think it's more likely that I just
misunderstanding something).

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

--- linux-2.5.20/fs/bio.c 2002-06-02 18:44:40.000000000 -0700
+++ linux/fs/bio.c 2002-06-05 03:27:19.000000000 -0700
@@ -339,12 +339,14 @@
void ll_rw_kio(int rw, struct kiobuf *kio, struct block_device *bdev, sector_t sector)
{
int i, offset, size, err, map_i, total_nr_pages, nr_pages;
struct bio_vec *bvec;
struct bio *bio;
kdev_t dev = to_kdev_t(bdev->bd_dev);
+ request_queue_t *q = bdev->bd_queue;
+ int max_pages;

err = 0;
if ((rw & WRITE) && is_read_only(dev)) {
printk("ll_rw_bio: WRITE to ro device %s\n", kdevname(dev));
err = -EPERM;
goto out;
@@ -364,14 +366,20 @@
size = kio->length;

atomic_set(&kio->io_count, 1);

map_i = 0;

+ max_pages = q->max_sectors >> (PAGE_SHIFT - 9);
+ if (max_pages > q->max_phys_segments)
+ max_pages = q->max_phys_segments;
+ if (max_pages > q->max_hw_segments)
+ max_pages = q->max_hw_segments;
+
next_chunk:
- nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - 9);
+ nr_pages = max_pages;
if (nr_pages > total_nr_pages)
nr_pages = total_nr_pages;

atomic_inc(&kio->io_count);

/*

2002-06-06 01:22:44

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

[I am cc'ing lkml, Andrew and Jens on this one, because I believe it
is probably a real world example of the oversized bio problem that I
was describing, although from fs/mpage.c, as Andrew Moreton
suggested.]

>Adam,
> I have been experiencing kernel panic's due to the BUG_ON that checks the
>bio size vs the queue size in generic_make_request (ll_rw_blk.c). I
>tried your patch, but it didn't fix the panic. Below is the output of
>putting:
> printk("***generic_make_request: bio_sectors(bio) = %d, q->max_sectors =
>%d\n", bio_sectors(bio), q->max_sectors);

>the line before the bug on in generic_make_request. I've also attached
>the output of ksymoops on the oops. I don't know if you're interested
>in pursing it or not, but let me know if you want some more information,
>otherwise, I'll be attempting to slug through this on my own.

>Cheers!

[...]
>***generic_make_request: bio_sectors(bio) = 8, q->max_sectors = 64
>***generic_make_request: bio_sectors(bio) = 96, q->max_sectors = 64
>kernel BUG at ll_rw_blk.c:1602!
[...]
>>>EIP; c01bb604 <generic_make_request+d4/130> <=====
>Trace; c01bb6dc <submit_bio+5c/70>
>Trace; c0152aa1 <mpage_bio_submit+31/40>
>Trace; c0152dee <do_mpage_readpage+2ee/340>
>Trace; c019ae75 <radix_tree_insert+15/30>
>Trace; c012809e <__add_to_page_cache+1e/a0>
>Trace; c01281bd <add_to_page_cache_unique+3d/60>
>Trace; c0152eaa <mpage_readpages+6a/b0>
>Trace; c01620a0 <ext2_get_block+0/400>
[...]

I think your kernel panic is proably related to the
same problem that I addressed in ll_rw_kio, but, in fs/mpage.c
as Andrew Moreton suggested:

| Does this not also need to be done in fs/mpage.c? [...]

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-06-06 01:44:11

by Andrew Morton

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

"Adam J. Richter" wrote:
>
> ...
> >***generic_make_request: bio_sectors(bio) = 8, q->max_sectors = 64
> >***generic_make_request: bio_sectors(bio) = 96, q->max_sectors = 64
> >kernel BUG at ll_rw_blk.c:1602!
> [...]
> >>>EIP; c01bb604 <generic_make_request+d4/130> <=====
> >Trace; c01bb6dc <submit_bio+5c/70>
> >Trace; c0152aa1 <mpage_bio_submit+31/40>
> >Trace; c0152dee <do_mpage_readpage+2ee/340>
> >Trace; c019ae75 <radix_tree_insert+15/30>
> >Trace; c012809e <__add_to_page_cache+1e/a0>
> >Trace; c01281bd <add_to_page_cache_unique+3d/60>
> >Trace; c0152eaa <mpage_readpages+6a/b0>
> >Trace; c01620a0 <ext2_get_block+0/400>
> [...]
>
> I think your kernel panic is proably related to the
> same problem that I addressed in ll_rw_kio, but, in fs/mpage.c
> as Andrew Moreton suggested:

Yes, same thing.

It looks like BIO_MAX_FOO needs to become an API function.
Question is: what should it return? Number of sectors, number
of bytes or number of pages?

For my purposes, I'd prefer number of pages. ie: the vector
count which gets passed into bio_alloc:

unsigned bio_max_iovecs(struct block_device *bdev);

nr_iovecs = bio_max_iovecs(bdev);
bio = bio_alloc(GFP_KERNEL, nr_iovecs);

would suit.

And if, via this, we can submit BIOs which are larger than 64k
for the common "it's just a disk" case then that is icing.

-

2002-06-06 08:50:08

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Andrew Moreton wrote:
>It looks like BIO_MAX_FOO needs to become an API function.
>Question is: what should it return? Number of sectors, number
>of bytes or number of pages?
>
>For my purposes, I'd prefer number of pages. ie: the vector
>count which gets passed into bio_alloc:
>
> unsigned bio_max_iovecs(struct block_device *bdev);
>
> nr_iovecs = bio_max_iovecs(bdev);
> bio = bio_alloc(GFP_KERNEL, nr_iovecs);
>
>would suit.
>
>And if, via this, we can submit BIOs which are larger than 64k
>for the common "it's just a disk" case then that is icing.

Here is my attempt at implimenting your idea. I am composing
this email on a machine that has this code compiled into the kernel,
but I do not know if any of the effected code paths have ever been
executed.

The interface to the routine is:

int bio_max_iovecs(request_queue_t *q, int *iovec_size)

iovec_size is both an input and output variable. You give it
the desired number of bytes you want to stuff in each iovec, and the
number may come back chopped it exceeds q->max_sectors * 512. It
returns the number of iovecs of that size that you can safely stuff
into a single bio for the underlying block device driver.

Notes on these changes:

1. BIO_MAX_SECTORS et al are gone. Nothing uses them
anymore.

2. I changed do_mpage_readpage and mpage_writepage to
precompute bdev = inode->i_sb->s_bdev, rather than
repeatedly getting it from a data returned from
the get_block function. This allow bio_max_iovecs()
to be called once in each routine, rather than repeatedly
within a loop. If bdev really could have some other
value, then my optimization needs to be undone.

3. I assume that "goto confused" in these routines causes
a safer but slower approach to be used. I added jumps
to these labels for the case where the underlying queue
could not handle enough sectors in a single bio to cover
even one page.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

--- linux-2.5.20/include/linux/bio.h 2002-06-02 18:44:52.000000000 -0700
+++ linux/include/linux/bio.h 2002-06-05 20:09:34.000000000 -0700
@@ -34,9 +34,6 @@
#define BIO_BUG_ON
#endif

-#define BIO_MAX_SECTORS 128
-#define BIO_MAX_SIZE (BIO_MAX_SECTORS << 9)
-
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
--- linux-2.5.20/include/linux/blkdev.h 2002-06-02 18:44:44.000000000 -0700
+++ linux/include/linux/blkdev.h 2002-06-05 23:05:37.000000000 -0700
@@ -191,9 +191,9 @@
* queue settings
*/
unsigned short max_sectors;
- unsigned short max_phys_segments;
- unsigned short max_hw_segments;
+ unsigned short max_phys_segments; /* <= max_sectors */
+ unsigned short max_hw_segments; /* <= max_sectors */
unsigned short hardsect_size;
unsigned int max_segment_size;

@@ -418,5 +428,7 @@
page_cache_release(p.v);
}

+extern int bio_max_iovecs(request_queue_t *q, int *iovec_size);
+
#endif
--- linux-2.5.20/fs/bio.c 2002-06-02 18:44:40.000000000 -0700
+++ linux/fs/bio.c 2002-06-06 00:56:37.000000000 -0700
@@ -316,6 +316,23 @@
return NULL;
}

+int bio_max_iovecs(request_queue_t *q, int *iovec_size)
+{
+ const int max_bytes = q->max_sectors << 9;
+ int max_iovecs;
+
+ if (*iovec_size > max_bytes) {
+ *iovec_size = max_bytes;
+ return 1;
+ }
+ max_iovecs = max_bytes / *iovec_size;
+ if (max_iovecs > q->max_phys_segments)
+ max_iovecs = q->max_phys_segments;
+ if (max_iovecs > q->max_hw_segments)
+ max_iovecs = q->max_hw_segments;
+ return max_iovecs;
+}
+
static void bio_end_io_kio(struct bio *bio)
{
struct kiobuf *kio = (struct kiobuf *) bio->bi_private;
@@ -338,10 +355,11 @@
**/
void ll_rw_kio(int rw, struct kiobuf *kio, struct block_device *bdev, sector_t sector)
{
- int i, offset, size, err, map_i, total_nr_pages, nr_pages;
+ int offset, size, err, map_i, total_nr_pages, nr_bvecs;
struct bio_vec *bvec;
struct bio *bio;
kdev_t dev = to_kdev_t(bdev->bd_dev);
+ int bytes_per_iovec, max_pages;

err = 0;
if ((rw & WRITE) && is_read_only(dev)) {
@@ -367,17 +385,20 @@

map_i = 0;

+ bytes_per_iovec = PAGE_SIZE;
+ max_pages = bio_max_iovecs(bdev->bd_queue, &bytes_per_iovec);
+
next_chunk:
- nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - 9);
- if (nr_pages > total_nr_pages)
- nr_pages = total_nr_pages;
+ nr_bvecs = max_pages;
+ if (nr_bvecs > total_nr_pages)
+ nr_bvecs = total_nr_pages;

atomic_inc(&kio->io_count);

/*
* allocate bio and do initial setup
*/
- if ((bio = bio_alloc(GFP_NOIO, nr_pages)) == NULL) {
+ if ((bio = bio_alloc(GFP_NOIO, nr_bvecs)) == NULL) {
err = -ENOMEM;
goto out;
}
@@ -389,15 +410,17 @@
bio->bi_private = kio;

bvec = bio->bi_io_vec;
- for (i = 0; i < nr_pages; i++, bvec++, map_i++) {
+ for (;total_nr_pages > 0; bvec++, nr_bvecs--) {
int nbytes = PAGE_SIZE - offset;

if (nbytes > size)
nbytes = size;
+ if (nbytes > bytes_per_iovec)
+ nbytes = bytes_per_iovec;

BUG_ON(kio->maplist[map_i] == NULL);

- if (bio->bi_size + nbytes > (BIO_MAX_SECTORS << 9))
+ if (nr_bvecs == 0)
goto queue_io;

bio->bi_vcnt++;
@@ -407,14 +430,14 @@
bvec->bv_len = nbytes;
bvec->bv_offset = offset;

- /*
- * kiobuf only has an offset into the first page
- */
- offset = 0;
+ offset = (offset + nbytes) & PAGE_MASK;
+ if (offset == 0) {
+ total_nr_pages--;
+ map_i++;
+ }

sector += nbytes >> 9;
size -= nbytes;
- total_nr_pages--;
kio->offset += nbytes;
}

--- linux-2.5.20/fs/mpage.c 2002-06-02 18:44:44.000000000 -0700
+++ linux/fs/mpage.c 2002-06-06 00:56:39.000000000 -0700
@@ -21,12 +21,6 @@
#include <linux/mpage.h>

/*
- * The largest-sized BIO which this code will assemble, in bytes. Set this
- * to PAGE_CACHE_SIZE if your drivers are broken.
- */
-#define MPAGE_BIO_MAX_SIZE BIO_MAX_SIZE
-
-/*
* I/O completion handler for multipage BIOs.
*
* The mpage code never puts partial pages into a BIO (except for end-of-file).
@@ -167,8 +161,13 @@
sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_hole = blocks_per_page;
- struct block_device *bdev = NULL;
+ struct block_device *bdev = inode->i_sb->s_bdev;
struct buffer_head bh;
+ int bvec_size = PAGE_SIZE;
+ const unsigned max_bvecs = bio_max_iovecs(bdev->bd_queue, &bvec_size);
+
+ if (bvec_size != PAGE_SIZE)
+ goto confused;

if (page_has_buffers(page))
goto confused;
@@ -197,7 +196,6 @@
if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
goto confused;
blocks[page_block] = bh.b_blocknr;
- bdev = bh.b_bdev;
}

if (first_hole != blocks_per_page) {
@@ -220,7 +218,7 @@
bio = mpage_bio_submit(READ, bio);

if (bio == NULL) {
- unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
+ unsigned int nr_bvecs = max_bvecs;

if (nr_bvecs > nr_pages)
nr_bvecs = nr_pages;
@@ -321,8 +319,13 @@
sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_unmapped = blocks_per_page;
- struct block_device *bdev = NULL;
+ struct block_device *bdev = inode->i_sb->s_bdev;
int boundary = 0;
+ int bvec_size = PAGE_SIZE;
+ const unsigned max_bvecs = bio_max_iovecs(bdev->bd_queue, &bvec_size);
+
+ if (bvec_size != PAGE_SIZE)
+ goto confused;

if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -355,7 +358,6 @@
}
blocks[page_block++] = bh->b_blocknr;
boundary = buffer_boundary(bh);
- bdev = bh->b_bdev;
} while ((bh = bh->b_this_page) != head);

if (first_unmapped)
@@ -391,7 +393,6 @@
}
blocks[page_block++] = map_bh.b_blocknr;
boundary = buffer_boundary(&map_bh);
- bdev = map_bh.b_bdev;
if (block_in_file == last_block)
break;
block_in_file++;
@@ -422,10 +423,8 @@
bio = mpage_bio_submit(WRITE, bio);

if (bio == NULL) {
- unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
-
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
- nr_bvecs, GFP_NOFS);
+ max_bvecs, GFP_NOFS);
if (bio == NULL)
goto confused;
}

2002-06-06 09:14:34

by Andrew Morton

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

"Adam J. Richter" wrote:
>
> Andrew Moreton wrote:
^ hey.

> >It looks like BIO_MAX_FOO needs to become an API function.
> >Question is: what should it return? Number of sectors, number
> >of bytes or number of pages?
> >
> >For my purposes, I'd prefer number of pages. ie: the vector
> >count which gets passed into bio_alloc:
> >
> > unsigned bio_max_iovecs(struct block_device *bdev);
> >
> > nr_iovecs = bio_max_iovecs(bdev);
> > bio = bio_alloc(GFP_KERNEL, nr_iovecs);
> >
> >would suit.
> >
> >And if, via this, we can submit BIOs which are larger than 64k
> >for the common "it's just a disk" case then that is icing.
>
> Here is my attempt at implimenting your idea. I am composing
> this email on a machine that has this code compiled into the kernel,
> but I do not know if any of the effected code paths have ever been
> executed.

If you're using ext2 or ext3 they have.

> The interface to the routine is:
>
> int bio_max_iovecs(request_queue_t *q, int *iovec_size)

hm. OK. But why not just a block_device, rather than
plucking out bd_queue at each call site?

> iovec_size is both an input and output variable. You give it
> the desired number of bytes you want to stuff in each iovec, and the
> number may come back chopped it exceeds q->max_sectors * 512. It
> returns the number of iovecs of that size that you can safely stuff
> into a single bio for the underlying block device driver.
>
> Notes on these changes:
>
> 1. BIO_MAX_SECTORS et al are gone. Nothing uses them
> anymore.

Have to check that with Jens. I was basing the bvec count on
q->max_sectors a while back and it had a problem.

> 2. I changed do_mpage_readpage and mpage_writepage to
> precompute bdev = inode->i_sb->s_bdev, rather than
> repeatedly getting it from a data returned from
> the get_block function. This allow bio_max_iovecs()
> to be called once in each routine, rather than repeatedly
> within a loop. If bdev really could have some other
> value, then my optimization needs to be undone.

That won't work for reads from blockdevs. `cat /dev/hda1' will
probably go splat. The blockdev superblock is shared between
all blockdevs. We have to stick with the bdev which get_block()
gave us.

> 3. I assume that "goto confused" in these routines causes
> a safer but slower approach to be used. I added jumps
> to these labels for the case where the underlying queue
> could not handle enough sectors in a single bio to cover
> even one page.

That looks fine.

-

2002-06-06 13:32:07

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Here is version 3 of my changes to ll_rw_kio in fs/bio.c and fs/mpage.c.
The differences from the previous version are that do_mpage_readpage
and mpage_writepage now precompute bdev differently, and I have
tried to factor out a little bit of common code between bio.c and
mpage.c in the form of a routine bio_append().

I now know that I am actually running the mpage.c code, because I
screwed up one of my changes and got to see the resulting kernel
panic. On the other hand, I do not think that I have caused ll_rw_kio
to be executed. So, I beileve my ll_rw_kio changes are completely
untested.

For what it's worth, I am composing this email on a machine running
the patch that I have attached below.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

--- linux-2.5.20/include/linux/bio.h 2002-06-02 18:44:52.000000000 -0700
+++ linux/include/linux/bio.h 2002-06-06 06:30:04.000000000 -0700
@@ -34,9 +34,6 @@
#define BIO_BUG_ON
#endif

-#define BIO_MAX_SECTORS 128
-#define BIO_MAX_SIZE (BIO_MAX_SECTORS << 9)
-
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
@@ -201,6 +198,8 @@
extern struct bio *bio_clone(struct bio *, int);
extern struct bio *bio_copy(struct bio *, int, int);

+extern int bio_append(struct bio **bio_p,
+ struct page *page, int len, int offset);
extern inline void bio_init(struct bio *);

extern int bio_ioctl(kdev_t, unsigned int, unsigned long);
--- linux-2.5.20/include/linux/blkdev.h 2002-06-02 18:44:44.000000000 -0700
+++ linux/include/linux/blkdev.h 2002-06-06 04:40:25.000000000 -0700
@@ -191,9 +191,9 @@
* queue settings
*/
unsigned short max_sectors;
- unsigned short max_phys_segments;
- unsigned short max_hw_segments;
+ unsigned short max_phys_segments; /* <= max_sectors */
+ unsigned short max_hw_segments; /* <= max_sectors */
unsigned short hardsect_size;
unsigned int max_segment_size;

@@ -418,5 +428,7 @@
page_cache_release(p.v);
}

+extern int bio_max_iovecs(request_queue_t *q, int *iovec_size);
+
#endif
--- linux-2.5.20/fs/bio.c 2002-06-02 18:44:40.000000000 -0700
+++ linux/fs/bio.c 2002-06-06 06:13:52.000000000 -0700
@@ -316,6 +316,64 @@
return NULL;
}

+int bio_max_iovecs(request_queue_t *q, int *iovec_size)
+{
+ const int max_bytes = q->max_sectors << 9;
+ int max_iovecs;
+
+ if (*iovec_size > max_bytes) {
+ *iovec_size = max_bytes;
+ return 1;
+ }
+ max_iovecs = max_bytes / *iovec_size;
+ if (max_iovecs > q->max_phys_segments)
+ max_iovecs = q->max_phys_segments;
+ if (max_iovecs > q->max_hw_segments)
+ max_iovecs = q->max_hw_segments;
+ return max_iovecs;
+}
+
+/* bio_append appends an IO vector to a bio. bio_append expects to be
+ called with bio->bi_idx indicating the maximum number of IO vectors
+ that this bio can hold, and with bio->bi_vcnt indicating the number
+ of IO vectors already loaded. If the provided bio is already full,
+ bio_append will submit the current bio and allocate a new one. */
+
+int bio_append(struct bio **bio_p, struct page *page, int len, int offset)
+{
+ struct bio *bio = *bio_p;
+ struct bio_vec *vec;
+ if (bio->bi_idx == bio->bi_vcnt) {
+ struct bio *old = bio;
+ *bio_p = bio = bio_alloc(GFP_KERNEL, old->bi_vcnt);
+ if (bio != NULL) {
+ bio->bi_sector =
+ old->bi_sector + (old->bi_size >> 9);
+
+#define COPY(field) bio->bi_ ## field = old->bi_ ## field
+ COPY(bdev);
+ COPY(flags);
+ COPY(idx);
+ COPY(rw);
+ COPY(end_io);
+ COPY(private);
+#undef COPY
+ }
+ old->bi_idx = 0;
+ submit_bio(old->bi_rw, old);
+ if (bio == NULL)
+ return -ENOMEM;
+ }
+ vec = &bio->bi_io_vec[bio->bi_vcnt++];
+ vec->bv_page = page;
+ vec->bv_len = len;
+ vec->bv_offset = offset;
+
+ bio->bi_size += len;
+ return 0;
+}
+
+
static void bio_end_io_kio(struct bio *bio)
{
struct kiobuf *kio = (struct kiobuf *) bio->bi_private;
@@ -338,10 +396,10 @@
**/
void ll_rw_kio(int rw, struct kiobuf *kio, struct block_device *bdev, sector_t sector)
{
- int i, offset, size, err, map_i, total_nr_pages, nr_pages;
- struct bio_vec *bvec;
+ int offset, size, err, map_i, total_nr_pages, nr_bvecs;
struct bio *bio;
kdev_t dev = to_kdev_t(bdev->bd_dev);
+ int bytes_per_iovec, max_pages;

err = 0;
if ((rw & WRITE) && is_read_only(dev)) {
@@ -367,63 +425,58 @@

map_i = 0;

-next_chunk:
- nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - 9);
- if (nr_pages > total_nr_pages)
- nr_pages = total_nr_pages;
+ bytes_per_iovec = PAGE_SIZE;
+ max_pages = bio_max_iovecs(bdev->bd_queue, &bytes_per_iovec);
+
+ nr_bvecs = max_pages;
+ if (nr_bvecs > total_nr_pages)
+ nr_bvecs = total_nr_pages;

atomic_inc(&kio->io_count);

/*
* allocate bio and do initial setup
*/
- if ((bio = bio_alloc(GFP_NOIO, nr_pages)) == NULL) {
+ if ((bio = bio_alloc(GFP_NOIO, nr_bvecs)) == NULL) {
err = -ENOMEM;
goto out;
}

+ bio->bi_idx = nr_bvecs; /* for bio_append */
+ bio->bi_rw = rw;
+
bio->bi_sector = sector;
bio->bi_bdev = bdev;
- bio->bi_idx = 0;
bio->bi_end_io = bio_end_io_kio;
bio->bi_private = kio;

- bvec = bio->bi_io_vec;
- for (i = 0; i < nr_pages; i++, bvec++, map_i++) {
+ while (total_nr_pages > 0) {
int nbytes = PAGE_SIZE - offset;

if (nbytes > size)
nbytes = size;
+ if (nbytes > bytes_per_iovec)
+ nbytes = bytes_per_iovec;

BUG_ON(kio->maplist[map_i] == NULL);

- if (bio->bi_size + nbytes > (BIO_MAX_SECTORS << 9))
- goto queue_io;
-
- bio->bi_vcnt++;
- bio->bi_size += nbytes;
+ err = bio_append(&bio, kio->maplist[map_i], nbytes, offset);
+ if (err)
+ goto out;
+
+ offset = (offset + nbytes) & PAGE_MASK;
+ if (offset == 0) {
+ total_nr_pages--;
+ map_i++;
+ }

- bvec->bv_page = kio->maplist[map_i];
- bvec->bv_len = nbytes;
- bvec->bv_offset = offset;
-
- /*
- * kiobuf only has an offset into the first page
- */
- offset = 0;
-
- sector += nbytes >> 9;
size -= nbytes;
- total_nr_pages--;
kio->offset += nbytes;
}

-queue_io:
+ bio->bi_idx = 0;
submit_bio(rw, bio);

- if (total_nr_pages)
- goto next_chunk;
-
if (size) {
printk("ll_rw_kio: size %d left (kio %d)\n", size, kio->length);
BUG();
--- linux-2.5.20/fs/mpage.c 2002-06-02 18:44:44.000000000 -0700
+++ linux/fs/mpage.c 2002-06-06 06:14:13.000000000 -0700
@@ -21,12 +21,6 @@
#include <linux/mpage.h>

/*
- * The largest-sized BIO which this code will assemble, in bytes. Set this
- * to PAGE_CACHE_SIZE if your drivers are broken.
- */
-#define MPAGE_BIO_MAX_SIZE BIO_MAX_SIZE
-
-/*
* I/O completion handler for multipage BIOs.
*
* The mpage code never puts partial pages into a BIO (except for end-of-file).
@@ -78,19 +72,15 @@
bio_put(bio);
}

-struct bio *mpage_bio_submit(int rw, struct bio *bio)
+struct bio *mpage_bio_submit(struct bio *bio)
{
- bio->bi_vcnt = bio->bi_idx;
bio->bi_idx = 0;
- bio->bi_end_io = mpage_end_io_read;
- if (rw == WRITE)
- bio->bi_end_io = mpage_end_io_write;
- submit_bio(rw, bio);
+ submit_bio(bio->bi_rw, bio);
return NULL;
}

static struct bio *
-mpage_alloc(struct block_device *bdev,
+mpage_alloc(struct block_device *bdev, int rw,
sector_t first_sector, int nr_vecs, int gfp_flags)
{
struct bio *bio;
@@ -98,11 +88,14 @@
bio = bio_alloc(gfp_flags, nr_vecs);
if (bio) {
bio->bi_bdev = bdev;
- bio->bi_vcnt = nr_vecs;
- bio->bi_idx = 0;
+ bio->bi_vcnt = 0;
+ bio->bi_idx = nr_vecs;
bio->bi_size = 0;
bio->bi_sector = first_sector;
- bio->bi_io_vec[0].bv_page = NULL;
+ bio->bi_rw = rw;
+ bio->bi_end_io = mpage_end_io_read;
+ if (rw == WRITE)
+ bio->bi_end_io = mpage_end_io_write;
}
return bio;
}
@@ -161,14 +154,19 @@
const unsigned blkbits = inode->i_blkbits;
const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
const unsigned blocksize = 1 << blkbits;
- struct bio_vec *bvec;
sector_t block_in_file;
sector_t last_block;
sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_hole = blocks_per_page;
- struct block_device *bdev = NULL;
+ struct block_device *bdev =
+ S_ISBLK(inode->i_mode) ? inode->i_bdev : inode->i_sb->s_bdev;
struct buffer_head bh;
+ int bvec_size = PAGE_SIZE;
+ const unsigned max_bvecs = bio_max_iovecs(bdev->bd_queue, &bvec_size);
+
+ if (bvec_size != PAGE_SIZE)
+ goto confused;

if (page_has_buffers(page))
goto confused;
@@ -197,7 +195,6 @@
if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
goto confused;
blocks[page_block] = bh.b_blocknr;
- bdev = bh.b_bdev;
}

if (first_hole != blocks_per_page) {
@@ -215,28 +212,28 @@
/*
* This page will go to BIO. Do we need to send this BIO off first?
*/
- if (bio && (bio->bi_idx == bio->bi_vcnt ||
- *last_block_in_bio != blocks[0] - 1))
- bio = mpage_bio_submit(READ, bio);
-
if (bio == NULL) {
- unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
+ unsigned int nr_bvecs = max_bvecs;

if (nr_bvecs > nr_pages)
nr_bvecs = nr_pages;
- bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+ bio = mpage_alloc(bdev, READ, blocks[0] << (blkbits - 9),
nr_bvecs, GFP_KERNEL);
if (bio == NULL)
goto confused;
}
+ else if (*last_block_in_bio != blocks[0] - 1)
+ bio->bi_idx = bio->bi_vcnt; /* force bio_append to
+ allocate a new bio. */
+
+ if (bio_append(&bio, page, first_hole << blkbits, 0) != 0)
+ goto confused;
+
+ if (bio->bi_vcnt == 1)
+ bio->bi_sector = blocks[0] << (blkbits - 9);

- bvec = &bio->bi_io_vec[bio->bi_idx++];
- bvec->bv_page = page;
- bvec->bv_len = (first_hole << blkbits);
- bvec->bv_offset = 0;
- bio->bi_size += bvec->bv_len;
if (buffer_boundary(&bh) || (first_hole != blocks_per_page))
- bio = mpage_bio_submit(READ, bio);
+ bio->bi_idx = bio->bi_vcnt;
else
*last_block_in_bio = blocks[blocks_per_page - 1];
out:
@@ -244,7 +241,7 @@

confused:
if (bio)
- bio = mpage_bio_submit(READ, bio);
+ bio = mpage_bio_submit(bio);
block_read_full_page(page, get_block);
goto out;
}
@@ -270,7 +267,7 @@
}
BUG_ON(!list_empty(pages));
if (bio)
- mpage_bio_submit(READ, bio);
+ mpage_bio_submit(bio);
return 0;
}
EXPORT_SYMBOL(mpage_readpages);
@@ -286,7 +283,7 @@
bio = do_mpage_readpage(bio, page, 1,
&last_block_in_bio, get_block);
if (bio)
- mpage_bio_submit(READ, bio);
+ mpage_bio_submit(bio);
return 0;
}
EXPORT_SYMBOL(mpage_readpage);
@@ -315,14 +312,19 @@
const unsigned blkbits = inode->i_blkbits;
unsigned long end_index;
const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
- struct bio_vec *bvec;
sector_t last_block;
sector_t block_in_file;
sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_unmapped = blocks_per_page;
- struct block_device *bdev = NULL;
+ struct block_device *bdev =
+ S_ISBLK(inode->i_mode) ? inode->i_bdev : inode->i_sb->s_bdev;
int boundary = 0;
+ int bvec_size = PAGE_SIZE;
+ const unsigned max_bvecs = bio_max_iovecs(bdev->bd_queue, &bvec_size);
+
+ if (bvec_size != PAGE_SIZE)
+ goto confused;

if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -355,7 +357,6 @@
}
blocks[page_block++] = bh->b_blocknr;
boundary = buffer_boundary(bh);
- bdev = bh->b_bdev;
} while ((bh = bh->b_this_page) != head);

if (first_unmapped)
@@ -391,7 +392,6 @@
}
blocks[page_block++] = map_bh.b_blocknr;
boundary = buffer_boundary(&map_bh);
- bdev = map_bh.b_bdev;
if (block_in_file == last_block)
break;
block_in_file++;
@@ -417,18 +417,14 @@
/*
* This page will go to BIO. Do we need to send this BIO off first?
*/
- if (bio && (bio->bi_idx == bio->bi_vcnt ||
- *last_block_in_bio != blocks[0] - 1))
- bio = mpage_bio_submit(WRITE, bio);
-
if (bio == NULL) {
- unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
-
- bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
- nr_bvecs, GFP_NOFS);
+ bio = mpage_alloc(bdev, WRITE, blocks[0] << (blkbits - 9),
+ max_bvecs, GFP_NOFS);
if (bio == NULL)
goto confused;
}
+ else if (*last_block_in_bio != blocks[0] - 1)
+ bio->bi_idx = bio->bi_vcnt;

/*
* OK, we have our BIO, so we can now mark the buffers clean. Make
@@ -447,23 +443,21 @@
} while (bh != head);
}

- bvec = &bio->bi_io_vec[bio->bi_idx++];
- bvec->bv_page = page;
- bvec->bv_len = (first_unmapped << blkbits);
- bvec->bv_offset = 0;
- bio->bi_size += bvec->bv_len;
+ bio_append(&bio, page, first_unmapped << blkbits, 0);
+ if (bio->bi_vcnt == 1)
+ bio->bi_sector = blocks[0] << (blkbits - 9);
BUG_ON(PageWriteback(page));
SetPageWriteback(page);
unlock_page(page);
if (boundary || (first_unmapped != blocks_per_page))
- bio = mpage_bio_submit(WRITE, bio);
+ bio = mpage_bio_submit(bio);
else
*last_block_in_bio = blocks[blocks_per_page - 1];
goto out;

confused:
if (bio)
- bio = mpage_bio_submit(WRITE, bio);
+ bio = mpage_bio_submit(bio);
*ret = block_write_full_page(page, get_block);
out:
return bio;
@@ -541,7 +535,7 @@
}
write_unlock(&mapping->page_lock);
if (bio)
- mpage_bio_submit(WRITE, bio);
+ mpage_bio_submit(bio);
return ret;
}
EXPORT_SYMBOL(mpage_writepages);

2002-06-06 19:35:29

by Andrew Morton

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

"Adam J. Richter" wrote:
>
> ...
> +int bio_append(struct bio **bio_p, struct page *page, int len, int offset)
> +{
> + struct bio *bio = *bio_p;
> + struct bio_vec *vec;
> + if (bio->bi_idx == bio->bi_vcnt) {
> + struct bio *old = bio;
> + *bio_p = bio = bio_alloc(GFP_KERNEL, old->bi_vcnt);

GFP_KERKEL is OK on the readpages() path, but on the writepages()
path it can go infinitely recursive - it needs to be GFP_NOFS.
So another arg is needed here. I suspect this function ends up adding
more complexity than it takes away, frankly.

> + if (bio != NULL) {
> + bio->bi_sector =
> + old->bi_sector + (old->bi_size >> 9);
> +
> +#define COPY(field) bio->bi_ ## field = old->bi_ ## field
> + COPY(bdev);
> + COPY(flags);
> + COPY(idx);
> + COPY(rw);
> + COPY(end_io);
> + COPY(private);
> +#undef COPY
> + }
> + old->bi_idx = 0;
> + submit_bio(old->bi_rw, old);

Here we're allocating a new BIO before submitting the old one.
This increases the risk of oom deadlocks on the vm_writeback()
path.

I'd prefer that the bio alloc/submit code paths be left as-is...

> + struct block_device *bdev =
> + S_ISBLK(inode->i_mode) ? inode->i_bdev : inode->i_sb->s_bdev;

I believe it's cleaner to allow get_block() to decide which
blockdevice belongs to this mapping, really.

-

2002-06-06 22:09:42

by Matthew Dobson

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Adam,
Your patch version 3 did the trick for me. It booted past the panic, but
it's now hung right after init loads. After init spits out "INIT: " it
locks up.. I've put in kdb, and all the cpus seem to be spinning in the
TIF_NEED_RESCHED loop in poll_idle. I dunno if this is in any way
related to the block stuff or not...

But the panic stopped happening!

Cheers!

-Matt

Adam J. Richter wrote:
> Here is version 3 of my changes to ll_rw_kio in fs/bio.c and fs/mpage.c.
> The differences from the previous version are that do_mpage_readpage
> and mpage_writepage now precompute bdev differently, and I have
> tried to factor out a little bit of common code between bio.c and
> mpage.c in the form of a routine bio_append().
>
> I now know that I am actually running the mpage.c code, because I
> screwed up one of my changes and got to see the resulting kernel
> panic. On the other hand, I do not think that I have caused ll_rw_kio
> to be executed. So, I beileve my ll_rw_kio changes are completely
> untested.
>
> For what it's worth, I am composing this email on a machine running
> the patch that I have attached below.
>
> Adam J. Richter __ ______________ 575 Oroville Road
> [email protected] \ / Milpitas, California 95035
> +1 408 309-6081 | g g d r a s i l United States of America
> "Free Software For The Rest Of Us."
>
>
> --- linux-2.5.20/include/linux/bio.h 2002-06-02 18:44:52.000000000 -0700
> +++ linux/include/linux/bio.h 2002-06-06 06:30:04.000000000 -0700
> @@ -34,9 +34,6 @@
> #define BIO_BUG_ON
> #endif
>
> -#define BIO_MAX_SECTORS 128
> -#define BIO_MAX_SIZE (BIO_MAX_SECTORS << 9)
> -
> /*
> * was unsigned short, but we might as well be ready for > 64kB I/O pages
> */
> @@ -201,6 +198,8 @@
> extern struct bio *bio_clone(struct bio *, int);
> extern struct bio *bio_copy(struct bio *, int, int);
>
> +extern int bio_append(struct bio **bio_p,
> + struct page *page, int len, int offset);
> extern inline void bio_init(struct bio *);
>
> extern int bio_ioctl(kdev_t, unsigned int, unsigned long);
> --- linux-2.5.20/include/linux/blkdev.h 2002-06-02 18:44:44.000000000 -0700
> +++ linux/include/linux/blkdev.h 2002-06-06 04:40:25.000000000 -0700
> @@ -191,9 +191,9 @@
> * queue settings
> */
> unsigned short max_sectors;
> - unsigned short max_phys_segments;
> - unsigned short max_hw_segments;
> + unsigned short max_phys_segments; /* <= max_sectors */
> + unsigned short max_hw_segments; /* <= max_sectors */
> unsigned short hardsect_size;
> unsigned int max_segment_size;
>
> @@ -418,5 +428,7 @@
> page_cache_release(p.v);
> }
>
> +extern int bio_max_iovecs(request_queue_t *q, int *iovec_size);
> +
> #endif
> --- linux-2.5.20/fs/bio.c 2002-06-02 18:44:40.000000000 -0700
> +++ linux/fs/bio.c 2002-06-06 06:13:52.000000000 -0700
> @@ -316,6 +316,64 @@
> return NULL;
> }
>
> +int bio_max_iovecs(request_queue_t *q, int *iovec_size)
> +{
> + const int max_bytes = q->max_sectors << 9;
> + int max_iovecs;
> +
> + if (*iovec_size > max_bytes) {
> + *iovec_size = max_bytes;
> + return 1;
> + }
> + max_iovecs = max_bytes / *iovec_size;
> + if (max_iovecs > q->max_phys_segments)
> + max_iovecs = q->max_phys_segments;
> + if (max_iovecs > q->max_hw_segments)
> + max_iovecs = q->max_hw_segments;
> + return max_iovecs;
> +}
> +
> +/* bio_append appends an IO vector to a bio. bio_append expects to be
> + called with bio->bi_idx indicating the maximum number of IO vectors
> + that this bio can hold, and with bio->bi_vcnt indicating the number
> + of IO vectors already loaded. If the provided bio is already full,
> + bio_append will submit the current bio and allocate a new one. */
> +
> +int bio_append(struct bio **bio_p, struct page *page, int len, int offset)
> +{
> + struct bio *bio = *bio_p;
> + struct bio_vec *vec;
> + if (bio->bi_idx == bio->bi_vcnt) {
> + struct bio *old = bio;
> + *bio_p = bio = bio_alloc(GFP_KERNEL, old->bi_vcnt);
> + if (bio != NULL) {
> + bio->bi_sector =
> + old->bi_sector + (old->bi_size >> 9);
> +
> +#define COPY(field) bio->bi_ ## field = old->bi_ ## field
> + COPY(bdev);
> + COPY(flags);
> + COPY(idx);
> + COPY(rw);
> + COPY(end_io);
> + COPY(private);
> +#undef COPY
> + }
> + old->bi_idx = 0;
> + submit_bio(old->bi_rw, old);
> + if (bio == NULL)
> + return -ENOMEM;
> + }
> + vec = &bio->bi_io_vec[bio->bi_vcnt++];
> + vec->bv_page = page;
> + vec->bv_len = len;
> + vec->bv_offset = offset;
> +
> + bio->bi_size += len;
> + return 0;
> +}
> +
> +
> static void bio_end_io_kio(struct bio *bio)
> {
> struct kiobuf *kio = (struct kiobuf *) bio->bi_private;
> @@ -338,10 +396,10 @@
> **/
> void ll_rw_kio(int rw, struct kiobuf *kio, struct block_device *bdev, sector_t sector)
> {
> - int i, offset, size, err, map_i, total_nr_pages, nr_pages;
> - struct bio_vec *bvec;
> + int offset, size, err, map_i, total_nr_pages, nr_bvecs;
> struct bio *bio;
> kdev_t dev = to_kdev_t(bdev->bd_dev);
> + int bytes_per_iovec, max_pages;
>
> err = 0;
> if ((rw & WRITE) && is_read_only(dev)) {
> @@ -367,63 +425,58 @@
>
> map_i = 0;
>
> -next_chunk:
> - nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - 9);
> - if (nr_pages > total_nr_pages)
> - nr_pages = total_nr_pages;
> + bytes_per_iovec = PAGE_SIZE;
> + max_pages = bio_max_iovecs(bdev->bd_queue, &bytes_per_iovec);
> +
> + nr_bvecs = max_pages;
> + if (nr_bvecs > total_nr_pages)
> + nr_bvecs = total_nr_pages;
>
> atomic_inc(&kio->io_count);
>
> /*
> * allocate bio and do initial setup
> */
> - if ((bio = bio_alloc(GFP_NOIO, nr_pages)) == NULL) {
> + if ((bio = bio_alloc(GFP_NOIO, nr_bvecs)) == NULL) {
> err = -ENOMEM;
> goto out;
> }
>
> + bio->bi_idx = nr_bvecs; /* for bio_append */
> + bio->bi_rw = rw;
> +
> bio->bi_sector = sector;
> bio->bi_bdev = bdev;
> - bio->bi_idx = 0;
> bio->bi_end_io = bio_end_io_kio;
> bio->bi_private = kio;
>
> - bvec = bio->bi_io_vec;
> - for (i = 0; i < nr_pages; i++, bvec++, map_i++) {
> + while (total_nr_pages > 0) {
> int nbytes = PAGE_SIZE - offset;
>
> if (nbytes > size)
> nbytes = size;
> + if (nbytes > bytes_per_iovec)
> + nbytes = bytes_per_iovec;
>
> BUG_ON(kio->maplist[map_i] == NULL);
>
> - if (bio->bi_size + nbytes > (BIO_MAX_SECTORS << 9))
> - goto queue_io;
> -
> - bio->bi_vcnt++;
> - bio->bi_size += nbytes;
> + err = bio_append(&bio, kio->maplist[map_i], nbytes, offset);
> + if (err)
> + goto out;
> +
> + offset = (offset + nbytes) & PAGE_MASK;
> + if (offset == 0) {
> + total_nr_pages--;
> + map_i++;
> + }
>
> - bvec->bv_page = kio->maplist[map_i];
> - bvec->bv_len = nbytes;
> - bvec->bv_offset = offset;
> -
> - /*
> - * kiobuf only has an offset into the first page
> - */
> - offset = 0;
> -
> - sector += nbytes >> 9;
> size -= nbytes;
> - total_nr_pages--;
> kio->offset += nbytes;
> }
>
> -queue_io:
> + bio->bi_idx = 0;
> submit_bio(rw, bio);
>
> - if (total_nr_pages)
> - goto next_chunk;
> -
> if (size) {
> printk("ll_rw_kio: size %d left (kio %d)\n", size, kio->length);
> BUG();
> --- linux-2.5.20/fs/mpage.c 2002-06-02 18:44:44.000000000 -0700
> +++ linux/fs/mpage.c 2002-06-06 06:14:13.000000000 -0700
> @@ -21,12 +21,6 @@
> #include <linux/mpage.h>
>
> /*
> - * The largest-sized BIO which this code will assemble, in bytes. Set this
> - * to PAGE_CACHE_SIZE if your drivers are broken.
> - */
> -#define MPAGE_BIO_MAX_SIZE BIO_MAX_SIZE
> -
> -/*
> * I/O completion handler for multipage BIOs.
> *
> * The mpage code never puts partial pages into a BIO (except for end-of-file).
> @@ -78,19 +72,15 @@
> bio_put(bio);
> }
>
> -struct bio *mpage_bio_submit(int rw, struct bio *bio)
> +struct bio *mpage_bio_submit(struct bio *bio)
> {
> - bio->bi_vcnt = bio->bi_idx;
> bio->bi_idx = 0;
> - bio->bi_end_io = mpage_end_io_read;
> - if (rw == WRITE)
> - bio->bi_end_io = mpage_end_io_write;
> - submit_bio(rw, bio);
> + submit_bio(bio->bi_rw, bio);
> return NULL;
> }
>
> static struct bio *
> -mpage_alloc(struct block_device *bdev,
> +mpage_alloc(struct block_device *bdev, int rw,
> sector_t first_sector, int nr_vecs, int gfp_flags)
> {
> struct bio *bio;
> @@ -98,11 +88,14 @@
> bio = bio_alloc(gfp_flags, nr_vecs);
> if (bio) {
> bio->bi_bdev = bdev;
> - bio->bi_vcnt = nr_vecs;
> - bio->bi_idx = 0;
> + bio->bi_vcnt = 0;
> + bio->bi_idx = nr_vecs;
> bio->bi_size = 0;
> bio->bi_sector = first_sector;
> - bio->bi_io_vec[0].bv_page = NULL;
> + bio->bi_rw = rw;
> + bio->bi_end_io = mpage_end_io_read;
> + if (rw == WRITE)
> + bio->bi_end_io = mpage_end_io_write;
> }
> return bio;
> }
> @@ -161,14 +154,19 @@
> const unsigned blkbits = inode->i_blkbits;
> const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
> const unsigned blocksize = 1 << blkbits;
> - struct bio_vec *bvec;
> sector_t block_in_file;
> sector_t last_block;
> sector_t blocks[MAX_BUF_PER_PAGE];
> unsigned page_block;
> unsigned first_hole = blocks_per_page;
> - struct block_device *bdev = NULL;
> + struct block_device *bdev =
> + S_ISBLK(inode->i_mode) ? inode->i_bdev : inode->i_sb->s_bdev;
> struct buffer_head bh;
> + int bvec_size = PAGE_SIZE;
> + const unsigned max_bvecs = bio_max_iovecs(bdev->bd_queue, &bvec_size);
> +
> + if (bvec_size != PAGE_SIZE)
> + goto confused;
>
> if (page_has_buffers(page))
> goto confused;
> @@ -197,7 +195,6 @@
> if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
> goto confused;
> blocks[page_block] = bh.b_blocknr;
> - bdev = bh.b_bdev;
> }
>
> if (first_hole != blocks_per_page) {
> @@ -215,28 +212,28 @@
> /*
> * This page will go to BIO. Do we need to send this BIO off first?
> */
> - if (bio && (bio->bi_idx == bio->bi_vcnt ||
> - *last_block_in_bio != blocks[0] - 1))
> - bio = mpage_bio_submit(READ, bio);
> -
> if (bio == NULL) {
> - unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
> + unsigned int nr_bvecs = max_bvecs;
>
> if (nr_bvecs > nr_pages)
> nr_bvecs = nr_pages;
> - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> + bio = mpage_alloc(bdev, READ, blocks[0] << (blkbits - 9),
> nr_bvecs, GFP_KERNEL);
> if (bio == NULL)
> goto confused;
> }
> + else if (*last_block_in_bio != blocks[0] - 1)
> + bio->bi_idx = bio->bi_vcnt; /* force bio_append to
> + allocate a new bio. */
> +
> + if (bio_append(&bio, page, first_hole << blkbits, 0) != 0)
> + goto confused;
> +
> + if (bio->bi_vcnt == 1)
> + bio->bi_sector = blocks[0] << (blkbits - 9);
>
> - bvec = &bio->bi_io_vec[bio->bi_idx++];
> - bvec->bv_page = page;
> - bvec->bv_len = (first_hole << blkbits);
> - bvec->bv_offset = 0;
> - bio->bi_size += bvec->bv_len;
> if (buffer_boundary(&bh) || (first_hole != blocks_per_page))
> - bio = mpage_bio_submit(READ, bio);
> + bio->bi_idx = bio->bi_vcnt;
> else
> *last_block_in_bio = blocks[blocks_per_page - 1];
> out:
> @@ -244,7 +241,7 @@
>
> confused:
> if (bio)
> - bio = mpage_bio_submit(READ, bio);
> + bio = mpage_bio_submit(bio);
> block_read_full_page(page, get_block);
> goto out;
> }
> @@ -270,7 +267,7 @@
> }
> BUG_ON(!list_empty(pages));
> if (bio)
> - mpage_bio_submit(READ, bio);
> + mpage_bio_submit(bio);
> return 0;
> }
> EXPORT_SYMBOL(mpage_readpages);
> @@ -286,7 +283,7 @@
> bio = do_mpage_readpage(bio, page, 1,
> &last_block_in_bio, get_block);
> if (bio)
> - mpage_bio_submit(READ, bio);
> + mpage_bio_submit(bio);
> return 0;
> }
> EXPORT_SYMBOL(mpage_readpage);
> @@ -315,14 +312,19 @@
> const unsigned blkbits = inode->i_blkbits;
> unsigned long end_index;
> const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
> - struct bio_vec *bvec;
> sector_t last_block;
> sector_t block_in_file;
> sector_t blocks[MAX_BUF_PER_PAGE];
> unsigned page_block;
> unsigned first_unmapped = blocks_per_page;
> - struct block_device *bdev = NULL;
> + struct block_device *bdev =
> + S_ISBLK(inode->i_mode) ? inode->i_bdev : inode->i_sb->s_bdev;
> int boundary = 0;
> + int bvec_size = PAGE_SIZE;
> + const unsigned max_bvecs = bio_max_iovecs(bdev->bd_queue, &bvec_size);
> +
> + if (bvec_size != PAGE_SIZE)
> + goto confused;
>
> if (page_has_buffers(page)) {
> struct buffer_head *head = page_buffers(page);
> @@ -355,7 +357,6 @@
> }
> blocks[page_block++] = bh->b_blocknr;
> boundary = buffer_boundary(bh);
> - bdev = bh->b_bdev;
> } while ((bh = bh->b_this_page) != head);
>
> if (first_unmapped)
> @@ -391,7 +392,6 @@
> }
> blocks[page_block++] = map_bh.b_blocknr;
> boundary = buffer_boundary(&map_bh);
> - bdev = map_bh.b_bdev;
> if (block_in_file == last_block)
> break;
> block_in_file++;
> @@ -417,18 +417,14 @@
> /*
> * This page will go to BIO. Do we need to send this BIO off first?
> */
> - if (bio && (bio->bi_idx == bio->bi_vcnt ||
> - *last_block_in_bio != blocks[0] - 1))
> - bio = mpage_bio_submit(WRITE, bio);
> -
> if (bio == NULL) {
> - unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
> -
> - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
> - nr_bvecs, GFP_NOFS);
> + bio = mpage_alloc(bdev, WRITE, blocks[0] << (blkbits - 9),
> + max_bvecs, GFP_NOFS);
> if (bio == NULL)
> goto confused;
> }
> + else if (*last_block_in_bio != blocks[0] - 1)
> + bio->bi_idx = bio->bi_vcnt;
>
> /*
> * OK, we have our BIO, so we can now mark the buffers clean. Make
> @@ -447,23 +443,21 @@
> } while (bh != head);
> }
>
> - bvec = &bio->bi_io_vec[bio->bi_idx++];
> - bvec->bv_page = page;
> - bvec->bv_len = (first_unmapped << blkbits);
> - bvec->bv_offset = 0;
> - bio->bi_size += bvec->bv_len;
> + bio_append(&bio, page, first_unmapped << blkbits, 0);
> + if (bio->bi_vcnt == 1)
> + bio->bi_sector = blocks[0] << (blkbits - 9);
> BUG_ON(PageWriteback(page));
> SetPageWriteback(page);
> unlock_page(page);
> if (boundary || (first_unmapped != blocks_per_page))
> - bio = mpage_bio_submit(WRITE, bio);
> + bio = mpage_bio_submit(bio);
> else
> *last_block_in_bio = blocks[blocks_per_page - 1];
> goto out;
>
> confused:
> if (bio)
> - bio = mpage_bio_submit(WRITE, bio);
> + bio = mpage_bio_submit(bio);
> *ret = block_write_full_page(page, get_block);
> out:
> return bio;
> @@ -541,7 +535,7 @@
> }
> write_unlock(&mapping->page_lock);
> if (bio)
> - mpage_bio_submit(WRITE, bio);
> + mpage_bio_submit(bio);
> return ret;
> }
> EXPORT_SYMBOL(mpage_writepages);
>

2002-06-06 23:26:11

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Here is version 4 of my changes to fs/bio.c (ll_rw_kio) and
fs/mpage.c. Andrew, I believe it this accomodates all of your
requests. Please let me know if I missed anything. If it looks
good, I would apperciate your recommendation on how to proceed with
getting this into Linus's tree.

The one new behavior in this version is that bio_max_iovecs()
will treat q->max_sectors == 0 as meaning that there is no limit.
This should accomodate any drivers that set max_sectors == 0.

We might want to keep the max_sectors == 0 behavior anyhow.
It is useful to have a way for drivers to indicate that they
don't care about the total number of sectors, rather than require
drivers that have no inherent limit to pick something arbitrarily.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

--- linux-2.5.20/include/linux/bio.h 2002-06-02 18:44:52.000000000 -0700
+++ linux/include/linux/bio.h 2002-06-06 15:52:29.000000000 -0700
@@ -34,11 +34,8 @@
#define BIO_BUG_ON
#endif

-#define BIO_MAX_SECTORS 128
-#define BIO_MAX_SIZE (BIO_MAX_SECTORS << 9)
-
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
--- linux-2.5.20/include/linux/blkdev.h 2002-06-02 18:44:44.000000000 -0700
+++ linux/include/linux/blkdev.h 2002-06-06 15:52:50.000000000 -0700
@@ -418,5 +428,7 @@
page_cache_release(p.v);
}

+extern int bio_max_iovecs(request_queue_t *q, int *iovec_size);
+
#endif
--- linux-2.5.20/fs/bio.c 2002-06-02 18:44:40.000000000 -0700
+++ linux/fs/bio.c 2002-06-06 16:17:03.000000000 -0700
@@ -50,8 +50,6 @@
};
#undef BV

-#define BIO_MAX_PAGES (bvec_array[BIOVEC_NR_POOLS - 1].size)
-
static void *slab_pool_alloc(int gfp_mask, void *data)
{
return kmem_cache_alloc(data, gfp_mask);
@@ -316,6 +314,22 @@
return NULL;
}

+int bio_max_iovecs(request_queue_t *q, int *iovec_size)
+{
+ unsigned max_iovecs = min(q->max_phys_segments, q->max_hw_segments);
+
+ if (q->max_sectors != 0) {
+ unsigned int max_bytes = q->max_sectors << 9;
+ if (*iovec_size > max_bytes) {
+ *iovec_size = max_bytes;
+ return 1;
+ }
+ max_iovecs = min(max_iovecs, max_bytes / *iovec_size);
+ }
+
+ return max_iovecs;
+}
+
static void bio_end_io_kio(struct bio *bio)
{
struct kiobuf *kio = (struct kiobuf *) bio->bi_private;
@@ -338,10 +352,11 @@
**/
void ll_rw_kio(int rw, struct kiobuf *kio, struct block_device *bdev, sector_t sector)
{
- int i, offset, size, err, map_i, total_nr_pages, nr_pages;
+ int offset, size, err, map_i, total_nr_pages, nr_bvecs;
struct bio_vec *bvec;
struct bio *bio;
kdev_t dev = to_kdev_t(bdev->bd_dev);
+ int bytes_per_iovec, max_bvecs;

err = 0;
if ((rw & WRITE) && is_read_only(dev)) {
@@ -367,17 +382,20 @@

map_i = 0;

+ bytes_per_iovec = PAGE_SIZE;
+ max_bvecs = bio_max_iovecs(bdev->bd_queue, &bytes_per_iovec);
+
next_chunk:
- nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - 9);
- if (nr_pages > total_nr_pages)
- nr_pages = total_nr_pages;
+ nr_bvecs = max_bvecs;
+ if (nr_bvecs > total_nr_pages)
+ nr_bvecs = total_nr_pages;

atomic_inc(&kio->io_count);

/*
* allocate bio and do initial setup
*/
- if ((bio = bio_alloc(GFP_NOIO, nr_pages)) == NULL) {
+ if ((bio = bio_alloc(GFP_NOIO, nr_bvecs)) == NULL) {
err = -ENOMEM;
goto out;
}
@@ -389,15 +407,17 @@
bio->bi_private = kio;

bvec = bio->bi_io_vec;
- for (i = 0; i < nr_pages; i++, bvec++, map_i++) {
+ for (;total_nr_pages > 0; bvec++, nr_bvecs--) {
int nbytes = PAGE_SIZE - offset;

if (nbytes > size)
nbytes = size;
+ if (nbytes > bytes_per_iovec)
+ nbytes = bytes_per_iovec;

BUG_ON(kio->maplist[map_i] == NULL);

- if (bio->bi_size + nbytes > (BIO_MAX_SECTORS << 9))
+ if (nr_bvecs == 0)
goto queue_io;

bio->bi_vcnt++;
@@ -407,14 +427,14 @@
bvec->bv_len = nbytes;
bvec->bv_offset = offset;

- /*
- * kiobuf only has an offset into the first page
- */
- offset = 0;
+ offset = (offset + nbytes) & PAGE_MASK;
+ if (offset == 0) {
+ total_nr_pages--;
+ map_i++;
+ }

sector += nbytes >> 9;
size -= nbytes;
- total_nr_pages--;
kio->offset += nbytes;
}

--- linux-2.5.20/fs/mpage.c 2002-06-06 12:30:14.000000000 -0700
+++ linux/fs/mpage.c 2002-06-06 15:44:24.000000000 -0700
@@ -21,12 +21,6 @@
#include <linux/mpage.h>

/*
- * The largest-sized BIO which this code will assemble, in bytes. Set this
- * to PAGE_CACHE_SIZE if your drivers are broken.
- */
-#define MPAGE_BIO_MAX_SIZE BIO_MAX_SIZE
-
-/*
* I/O completion handler for multipage BIOs.
*
* The mpage code never puts partial pages into a BIO (except for end-of-file).
@@ -220,9 +214,13 @@
bio = mpage_bio_submit(READ, bio);

if (bio == NULL) {
- unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
+ int bvec_size = PAGE_SIZE;
+ unsigned nr_bvecs = bio_max_iovecs(bdev->bd_queue,&bvec_size);

- if (nr_bvecs > nr_pages)
+ if (bvec_size != PAGE_SIZE)
+ goto confused;
+
+ if (nr_bvecs > nr_pages)
nr_bvecs = nr_pages;
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
nr_bvecs, GFP_KERNEL);
@@ -422,7 +420,11 @@
bio = mpage_bio_submit(WRITE, bio);

if (bio == NULL) {
- unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
+ int bvec_size = PAGE_SIZE;
+ unsigned nr_bvecs = bio_max_iovecs(bdev->bd_queue,&bvec_size);
+
+ if (bvec_size != PAGE_SIZE)
+ goto confused;

bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
nr_bvecs, GFP_NOFS);

2002-06-07 02:11:14

by Andrew Morton

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

"Adam J. Richter" wrote:
>
> Here is version 4 of my changes to fs/bio.c (ll_rw_kio) and
> fs/mpage.c. Andrew, I believe it this accomodates all of your
> requests. Please let me know if I missed anything. If it looks
> good, I would apperciate your recommendation on how to proceed with
> getting this into Linus's tree.

mpage stuff looks good.

We need to wake Jens up before going any further. (The test
for ->max_sectors != 0 look funny).

The main issue is that this code will now permit really
big BIOs. Up to a megabyte. That'll work OK on just-a-disk,
but apparently ->max_sectors isn't worth squat for stacked
devices.

Back in March I was told "That's why BIO_MAX_SIZE exists. It's the
size it is due to typical hardware restrictions :-/". Well,
we see here that your Smart2 RAID controller has just blown that
idea out of the water.

Also this:

:> Actually: what about the case where we're not using any device
:> stacking at all? Where we're just talking to a plain old
:> disk? That's a common case, and it's worth optimising for.
:
:That would be doable. Just require stacking drivers to mark the queue as
:such -- so md etc would do something ala blk_queue_stacked(q) and that
:would just set the QUEUE_FLAG_STACKED flag. Then we know that
:q->max_sectors indeed can be trusted completely.

Now this idea would let us assemble large BIOs nicely against
"just a disk". But we'd still need BIO_MAX_SIZE, and your
RAID controller will still explode when stacked upon.

What you can do short-term is (yech) teach bio_max_iovecs() to
not return a value greater than 64 kbytes.

Longer term, we really need a way of propagating the "max request size"
info up the stack. At a minimum we need to do this statically.

Ideally we do it dynamically, passing in the starting sector, so
we can calculate the maximum-sized BIO correctly each time and
we never need to get into BIO-splitting horrors.

-

2002-06-07 12:46:12

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Andrew Morton <[email protected]> wrote:
>mpage stuff looks good.

Great! (Actually, I think should change one PAGE_SIZE to
PAGE_CACHE_SIZE in my patch before a final version goes to Linus.)

Regarding q->max_sectors == 0, it looks like some real
hardware devices use it. blk_init_queue does not set it, and some
drivers that call blk_init_queue do not set it, although I assume that
they initialize their queue structures to zero. Examples of this
include the proprietary interface CDROM drives (I mention this example
because I was looking at that code, not because people actually use
those drives anymore).

Regarding "stacked" devices like /dev/loop and raid,
blk_queue_make_request sets q->max_sectors to 255 (MAX_SECTORS in
include/linux/blkdev.h, only used here and as a readahead setting in
md.c). This limits a single bio on a stacked device that keeps this
default to 31 4kB pages, as opposed to the previous limit of 128
sectors == 16 4kB pages. So the only new problem should be with
making a loop or raid device out of could handle a request of 16 4kB
pages but could not handle a request of 31 4kB pages. The only driver
that I could find that would be effected would be a SCSI disk on an
ips SCSI controller (see table below), although my search was
incomplete.

I have have verified that, under my patch, I was able to make
a loop device from an IDE disk partition, make a ext2 file system on
it, mount it, copy the shell to it, and run that shell binary.

I could change MAX_SECTORS to 128, which will give you the
previous limit, which should support everything that I've checked
except a SCSI disk on a QLogic controller, which would not have worked
before either To support QLogic SCSI, it looks like q->max_sectors
might have to be 32 or less (because I think it's max_hw_segments
could be as little as four).

The correct solution for /dev/loop is to create separate
queues for /dev/loop/0, /dev/loop/1, etc., and set the various queue
parameters when each queue is bound to its underlying device or file.
I think I can readily impliment this and remove references to minor
device numbers from loop.c in the process, but I don't want to create
a big dependency chain for getting my current patch into the kernel.

Likewise, the device mapper (or the specialized raid, logical
volume management and disk partition code which the device mapper
depricates) should also do something like the following whenever an
underlying device is added or removed:

int max_phys = parent->childqueue[0].max_phys_segments;
int max_hw = parent->childqueue[0].max_hw_segments;
int max_sec = parent->childqueue[0].max_sectors;
for (i = 1; i < parent->num_children; i++) {
max_phys = min(max_phys, parent->childqueue[0].max_phys_segments);
max_hw = min(max_hw, parent->childqueue[0].max_hw_segments);
max_sec = min(max_sec, parent->childqueue[0].max_sectors);
}
blk_queue_max_phys_segments(&parent->queue, max_phys);
blk_queue_max_hw_segments(&parent->queue, max_hw);
blk_queue_max_sectors(&parent->queue, max_sec);

Anyhow, what I would like to do is to first proceed with my
patch without these moderately involved changes to loop.c and md.c
(the only other stacked device that compiles under 2.5 I believe), but
with the simple change to loop.c and md.c of setting their
q->max_sectors to 16. I'll cc the maintainers of loop.c and md.c, and
I'll also email whoever is developing the device mapper that they
should add a similar line to their initialization. Does that sound
good to you or would you recomment proceeding proceed in some other

>We need to wake Jens up before going any further. (The test
>for ->max_sectors != 0 look funny).

Jens has been cc'ed on all of this. I think I should proceed
as I described in the above paragraph. After that, I'll email Jens
directly and see if that triggers a response. Beyond that, I don't
know what else to do. So, if we don't hear from him within a day or
two after that, I think we should take silence as at least begrudging
consent and submit the change to Linus.

P.S. I've attached a table of the request_queue_t settings of I noted
from a few greps through the kernel sources. Some of the scsi settings
are inferred from #define names rather than from walking through to the
code that actually sets the values.

device max_sectors max_hw_segments max_phys_segments
cciss 512 31 31
ps2esdi 128 (128) (128)
xd 1 or 64 (128) (128)
ide disk 256 256* 256*
pdc4030 IDE 127 256* 256*
IOmega zip 100 atapi 64 256* 256*
blk_init_queue 0 128 128
blk_queue_make_request 255 128 128
i2o_block does not compile, weird formula
scsi (just from grepping files):
3w-xxxx 256 62 128
aic7xxx_old 512
atari_scsi 255 128
cpqfcTS 360 128
eata 64 or 252 128
gdth 32 128
ips 128 17 128
mac_scsi 0 (?) 128
osst (tape drive?) 2 128
qla1280 1024
qlogicisp 64 4 + ? 128
qlogicpti 64
tmscsim 16 128
ultrastor 14F 16 128
ultrastor 24F 33 128
scsi disk 1024
scsi tape 16 128
scsi default 1024

[*] PRD_ENTRIES = PAGE_SIZE/16 = 256 using 4kB pages

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-06-07 16:52:31

by Thunder from the hill

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Hi,

On Fri, 7 Jun 2002, Adam J. Richter wrote:
> int max_phys = parent->childqueue[0].max_phys_segments;
> int max_hw = parent->childqueue[0].max_hw_segments;
> int max_sec = parent->childqueue[0].max_sectors;

> for (i = 1; i < parent->num_children; i++) {
> max_phys = min(max_phys, parent->childqueue[0].max_phys_segments);
> max_hw = min(max_hw, parent->childqueue[0].max_hw_segments);
> max_sec = min(max_sec, parent->childqueue[0].max_sectors);
> }

Just a question: Did you mean parent->childqueue[0] here, or rather
parent->childqueue[i]?

> blk_queue_max_phys_segments(&parent->queue, max_phys);
> blk_queue_max_hw_segments(&parent->queue, max_hw);
> blk_queue_max_sectors(&parent->queue, max_sec);

Regards,
Thunder
--
ship is leaving right on time | Thunder from the hill at ngforever
empty harbour, wave goodbye |
evacuation of the isle | free inhabitant not directly
caveman's paintings drowning | belonging anywhere

2002-06-07 20:20:28

by Andrew Morton

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

"Adam J. Richter" wrote:
>
> ..
> Jens has been cc'ed on all of this. I think I should proceed
> as I described in the above paragraph.

I'd be more comfortable waiting for an ack from Jens, really.

+int bio_max_iovecs(request_queue_t *q, int *iovec_size)
+{
+ unsigned max_iovecs = min(q->max_phys_segments, q->max_hw_segments);

It seems that this test will significantly reduce the max BIO
size for some devices. What's the thinking here?

-

2002-06-07 22:12:34

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

Andrew Morton wrote:
>+int bio_max_iovecs(request_queue_t *q, int *iovec_size)
>+{
>+ unsigned max_iovecs = min(q->max_phys_segments, q->max_hw_segments);

>It seems that this test will significantly reduce the max BIO
>size for some devices. What's the thinking here?

When you submit a bio with n iovecs, there is no guarantee
that any of those will have contiguous underlying physical addresses
or or that any of those addresses that can be made contiguous from the
view of a DMA device (most architectures lack an iommu anyhow, and
there is no guarantee that there would be enough entries available in
the iommu to do this, and I vaguely recal that existing iommu's have a
small number of entries, like 8 or 16). This is true even when each
of the iovecs covers exactly one full page.

Since there is no guarantee that any of these bios can be
merged in the general case, bio submitters have to ensure that
the number of IO vectors in each bio does not exeed q->max_phys_segments
*and* does not exceed q->max_hw_segments. Otherwise, the request that
finally gets to the device driver may exceed the drivers capabilities.

Of course, in cases where the bios happen to point to
physically contiguous memory, the request merging code will notice and
remerge the bios. Since the bios generated from a big mpage or
ll_rw_kio call are already going to be pretty big, the overhead of the
bio merging code will be very small in proportion to the amount of
data being transferred.

By the way, if you know something more about the block of
memory that you are writing, then you may be able to build bigger
bios. For example, if you are writing a single contiguous range, then
you might be able to build bigger bios (limited by
q->max_segment_size, something that I should make bio_max_iovecs also
consider).

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-06-07 22:46:27

by Adam J. Richter

[permalink] [raw]

Subject: Re: Patch??: linux-2.5.20/fs/bio.c - ll_rw_kio could generate bio's bigger than queue could handle

"Thunder from the hill" <[email protected]> wrote:
[...]
>> for (i = 1; i < parent->num_children; i++) {
>> max_phys = min(max_phys, parent->childqueue[0].max_phys_segments);
>> max_hw = min(max_hw, parent->childqueue[0].max_hw_segments);
>> max_sec = min(max_sec, parent->childqueue[0].max_sectors);
>> }

>Just a question: Did you mean parent->childqueue[0] here, or rather
>parent->childqueue[i]?

parent->childqueue[i]. Thanks for correcting it.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."