2006-12-04 19:28:29

by Chen, Kenneth W

[permalink] [raw]
Subject: [patch] speed up single bio_vec allocation

On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is
created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
available at the end of bio. I think we can utilize that memory for
bio_vec allocation. The purpose is not so much on saving memory consumption
for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.

So here is a patch to do just that for 1 segment bio_vec (we currently only
have space for 1 on 2.6.19). And the detection whether there are spare space
available is dynamically calculated at compile time. If there are no space
available, there will be no run time cost at all because gcc simply optimize
away all the code added in this patch. If there are space available, the only
run time check is to see what the size of iovec is and we do appropriate
assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole
lot back from not calling bvec_alloc_bs() function.

I tried to use cache_line_size() to find out the alignment of struct bio, but
stumbled on that it is a runtime function for x86_64. So instead I made bio
to hint to the slab allocator to align on 32 byte (slab will use the larger
value of hw cache line and caller hints of "align"). I think it is a sane
number for majority of the CPUs out in the world.


Signed-off-by: Ken Chen <[email protected]>


--- ./fs/bio.c.orig 2006-12-03 17:20:36.000000000 -0800
+++ ./fs/bio.c 2006-12-03 21:29:20.000000000 -0800
@@ -29,11 +29,14 @@
#include <scsi/sg.h> /* for struct sg_iovec */

#define BIO_POOL_SIZE 256
-
+#define BIO_ALIGN 32 /* minimal bio structure alignment */
static kmem_cache_t *bio_slab __read_mostly;

#define BIOVEC_NR_POOLS 6

+#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \
+ (ALIGN(sizeof(struct bio), BIO_ALIGN) == \
+ ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
/*
* a small number of entries is fine, not going to be performance critical.
* basically we just need to survive
@@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi

BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);

- mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
+ if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
+ mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
mempool_free(bio, bio_set->bio_pool);
}

@@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;

bio_init(bio);
- if (likely(nr_iovecs)) {
+
+ /*
+ * if bio_vec can fit into remaining cache line of struct
+ * bio, go ahead use it and skip mempool allocation.
+ */
+ if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
+ bvl = (struct bio_vec*) (bio + 1);
+ bio->bi_max_vecs = 1;
+ } else if (likely(nr_iovecs)) {
unsigned long idx = 0; /* shut up gcc */

bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
@@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
struct biovec_slab *bvs = bvec_slabs + i;

size = bvs->nr_vecs * sizeof(struct bio_vec);
- bvs->slab = kmem_cache_create(bvs->name, size, 0,
+ bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
}
}


2006-12-04 20:06:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is
> created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
> available at the end of bio. I think we can utilize that memory for
> bio_vec allocation. The purpose is not so much on saving memory consumption
> for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.
>
> So here is a patch to do just that for 1 segment bio_vec (we currently only
> have space for 1 on 2.6.19). And the detection whether there are spare space
> available is dynamically calculated at compile time. If there are no space
> available, there will be no run time cost at all because gcc simply optimize
> away all the code added in this patch. If there are space available, the only
> run time check is to see what the size of iovec is and we do appropriate
> assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole
> lot back from not calling bvec_alloc_bs() function.
>
> I tried to use cache_line_size() to find out the alignment of struct bio, but
> stumbled on that it is a runtime function for x86_64. So instead I made bio
> to hint to the slab allocator to align on 32 byte (slab will use the larger
> value of hw cache line and caller hints of "align"). I think it is a sane
> number for majority of the CPUs out in the world.

Any benchmarks for this one?

> --- ./fs/bio.c.orig 2006-12-03 17:20:36.000000000 -0800
> +++ ./fs/bio.c 2006-12-03 21:29:20.000000000 -0800
> @@ -29,11 +29,14 @@
> #include <scsi/sg.h> /* for struct sg_iovec */
>
> #define BIO_POOL_SIZE 256
> -
> +#define BIO_ALIGN 32 /* minimal bio structure alignment */
> static kmem_cache_t *bio_slab __read_mostly;
>
> #define BIOVEC_NR_POOLS 6
>
> +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \
> + (ALIGN(sizeof(struct bio), BIO_ALIGN) == \
> + ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
> /*
> * a small number of entries is fine, not going to be performance critical.
> * basically we just need to survive
> @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi
>
> BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
>
> - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
> + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
> + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);

I think you should use a flag for this instead.

> mempool_free(bio, bio_set->bio_pool);
> }
>
> @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
> struct bio_vec *bvl = NULL;
>
> bio_init(bio);
> - if (likely(nr_iovecs)) {
> +
> + /*
> + * if bio_vec can fit into remaining cache line of struct
> + * bio, go ahead use it and skip mempool allocation.
> + */
> + if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
> + bvl = (struct bio_vec*) (bio + 1);
> + bio->bi_max_vecs = 1;
> + } else if (likely(nr_iovecs)) {
> unsigned long idx = 0; /* shut up gcc */

Either move this logic to bvec_alloc_bs(), or remember to clear bvl as
is done there.

> @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
> struct biovec_slab *bvs = bvec_slabs + i;
>
> size = bvs->nr_vecs * sizeof(struct bio_vec);
> - bvs->slab = kmem_cache_create(bvs->name, size, 0,
> + bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN,
> SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
> }
> }

Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
bio is allocated. It doesn't add a lot of overhead even for the case
where we do > 1 page bios, and it gets rid of the dual allocation for
the 1 page bio.

--
Jens Axboe

2006-12-04 20:37:13

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] speed up single bio_vec allocation

Jens Axboe wrote on Monday, December 04, 2006 12:07 PM
> On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is
> > created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
> > available at the end of bio. I think we can utilize that memory for
> > bio_vec allocation. The purpose is not so much on saving memory consumption
> > for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.
> >
> > So here is a patch to do just that for 1 segment bio_vec (we currently only
> > have space for 1 on 2.6.19). And the detection whether there are spare space
> > available is dynamically calculated at compile time. If there are no space
> > available, there will be no run time cost at all because gcc simply optimize
> > away all the code added in this patch. If there are space available, the only
> > run time check is to see what the size of iovec is and we do appropriate
> > assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole
> > lot back from not calling bvec_alloc_bs() function.
> >
> > I tried to use cache_line_size() to find out the alignment of struct bio, but
> > stumbled on that it is a runtime function for x86_64. So instead I made bio
> > to hint to the slab allocator to align on 32 byte (slab will use the larger
> > value of hw cache line and caller hints of "align"). I think it is a sane
> > number for majority of the CPUs out in the world.
>
> Any benchmarks for this one?

About 0.2% on database transaction processing benchmark. It was done a while
back on top of a major Linux vendor kernel. I will retest it again for 2.6.19.


> [...]
>
> Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> bio is allocated. It doesn't add a lot of overhead even for the case
> where we do > 1 page bios, and it gets rid of the dual allocation for
> the 1 page bio.

I will try that too. I'm a bit touchy about sharing a cache line for different
bio. But given that there are 200,000 I/O per second we are currently pushing
the kernel, the chances of two cpu working on two bio that sits in the same
cache line are pretty small.

- Ken

2006-12-04 20:43:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > [...]
> >
> > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> > bio is allocated. It doesn't add a lot of overhead even for the case
> > where we do > 1 page bios, and it gets rid of the dual allocation for
> > the 1 page bio.
>
> I will try that too. I'm a bit touchy about sharing a cache line for
> different bio. But given that there are 200,000 I/O per second we are
> currently pushing the kernel, the chances of two cpu working on two
> bio that sits in the same cache line are pretty small.

Yep I really think so. Besides, it's not like we are repeatedly writing
to these objects in the first place.

--
Jens Axboe

2006-12-06 10:08:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Mon, Dec 04 2006, Jens Axboe wrote:
> On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > > [...]
> > >
> > > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> > > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> > > bio is allocated. It doesn't add a lot of overhead even for the case
> > > where we do > 1 page bios, and it gets rid of the dual allocation for
> > > the 1 page bio.
> >
> > I will try that too. I'm a bit touchy about sharing a cache line for
> > different bio. But given that there are 200,000 I/O per second we are
> > currently pushing the kernel, the chances of two cpu working on two
> > bio that sits in the same cache line are pretty small.
>
> Yep I really think so. Besides, it's not like we are repeatedly writing
> to these objects in the first place.

This is what I had in mind, in case it wasn't completely clear. Not
tested, other than it compiles. Basically it eliminates the small
bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
12-bytes on 32-bit archs instead and uses the room at the end for the
bio_vec structure.

I still can't help but think we can do better than this, and that this
is nothing more than optimizing for a benchmark. For high performance
I/O, you will be doing > 1 page bio's anyway and this patch wont help
you at all. Perhaps we can just kill bio_vec slabs completely, and
create bio slabs instead with differing sizes. So instead of having 1
bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
for the bio_vec list at the end. That would always eliminate the extra
allocation, at the cost of blowing the 256-page case into a order 1 page
allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
which is something I've always tried to avoid.


diff --git a/fs/bio.c b/fs/bio.c
index aa4d09b..94b5e05 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -32,7 +32,7 @@

static kmem_cache_t *bio_slab __read_mostly;

-#define BIOVEC_NR_POOLS 6
+#define BIOVEC_NR_POOLS 5

/*
* a small number of entries is fine, not going to be performance critical.
@@ -55,7 +55,7 @@ struct biovec_slab {

#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
- BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+ BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
};
#undef BV

@@ -76,7 +76,8 @@ struct bio_set {
*/
static struct bio_set *fs_bio_set;

-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs)
+static struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx,
+ struct bio_set *bs)
{
struct bio_vec *bvl;

@@ -84,15 +85,25 @@ static inline struct bio_vec *bvec_alloc
* see comment near bvec_array define!
*/
switch (nr) {
- case 1 : *idx = 0; break;
- case 2 ... 4: *idx = 1; break;
- case 5 ... 16: *idx = 2; break;
- case 17 ... 64: *idx = 3; break;
- case 65 ... 128: *idx = 4; break;
- case 129 ... BIO_MAX_PAGES: *idx = 5; break;
- default:
- return NULL;
+ case 2 ... 4:
+ *idx = 0;
+ break;
+ case 5 ... 16:
+ *idx = 1;
+ break;
+ case 17 ... 64:
+ *idx = 2;
+ break;
+ case 65 ... 128:
+ *idx = 3;
+ break;
+ case 129 ... BIO_MAX_PAGES:
+ *idx = 4;
+ break;
+ default:
+ return NULL;
}
+
/*
* idx now points to the pool we want to allocate from
*/
@@ -109,11 +120,13 @@ static inline struct bio_vec *bvec_alloc

void bio_free(struct bio *bio, struct bio_set *bio_set)
{
- const int pool_idx = BIO_POOL_IDX(bio);
+ if ((bio->bi_flags & (1 << BIO_BVL_EMBED)) == 0) {
+ const int pool_idx = BIO_POOL_IDX(bio);

- BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
+ BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
+ mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
+ }

- mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
mempool_free(bio, bio_set->bio_pool);
}

@@ -166,7 +179,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;

bio_init(bio);
- if (likely(nr_iovecs)) {
+
+ if (nr_iovecs == 1) {
+ bvl = (void *) bio + sizeof(*bio);
+ memset(bvl, 0, sizeof(*bvl));
+ bio->bi_flags |= 1 << BIO_BVL_EMBED;
+ bio->bi_max_vecs = 1;
+ } else if (nr_iovecs) {
unsigned long idx = 0; /* shut up gcc */

bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
@@ -1189,8 +1208,8 @@ static void __init biovec_init_slabs(voi
struct biovec_slab *bvs = bvec_slabs + i;

size = bvs->nr_vecs * sizeof(struct bio_vec);
- bvs->slab = kmem_cache_create(bvs->name, size, 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+ bvs->slab = kmem_cache_create(bvs->name, size, 0, SLAB_PANIC,
+ NULL, NULL);
}
}

@@ -1199,8 +1218,7 @@ static int __init init_bio(void)
int megabytes, bvec_pool_entries;
int scale = BIOVEC_NR_POOLS;

- bio_slab = kmem_cache_create("bio", sizeof(struct bio), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+ bio_slab = kmem_cache_create("bio", sizeof(struct bio) + sizeof(struct bio_vec), 0, SLAB_PANIC, NULL, NULL);

biovec_init_slabs();

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 092dbd0..7cf7c5c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -125,6 +125,7 @@ struct bio {
#define BIO_BOUNCED 5 /* bio is a bounce bio */
#define BIO_USER_MAPPED 6 /* contains user pages */
#define BIO_EOPNOTSUPP 7 /* not supported */
+#define BIO_BVL_EMBED 8 /* bvl embedded in bio */
#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

/*

--
Jens Axboe

2006-12-06 10:55:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Wed, Dec 06 2006, Jens Axboe wrote:
> I still can't help but think we can do better than this, and that this
> is nothing more than optimizing for a benchmark. For high performance
> I/O, you will be doing > 1 page bio's anyway and this patch wont help
> you at all. Perhaps we can just kill bio_vec slabs completely, and
> create bio slabs instead with differing sizes. So instead of having 1
> bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> for the bio_vec list at the end. That would always eliminate the extra
> allocation, at the cost of blowing the 256-page case into a order 1 page
> allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> which is something I've always tried to avoid.

That would look something like this. You'd probably want to re-tweak the
slab sizes now, so that we get the most out of the slab page. If we
accept the 2^1 order allocation for the largest bio, we can make it
larger than 256 pages at no extra cost. Probably around 500 pages would
still fit inside the two pages.

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 08a40f4..b4bf3b3 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -813,7 +813,7 @@ static int crypt_ctr(struct dm_target *t
goto bad4;
}

- cc->bs = bioset_create(MIN_IOS, MIN_IOS, 4);
+ cc->bs = bioset_create(MIN_IOS, 4);
if (!cc->bs) {
ti->error = "Cannot allocate crypt bioset";
goto bad_bs;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index da663d2..581c24a 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -60,7 +60,7 @@ static int resize_pool(unsigned int new_
if (!_io_pool)
return -ENOMEM;

- _bios = bioset_create(16, 16, 4);
+ _bios = bioset_create(16, 4);
if (!_bios) {
mempool_destroy(_io_pool);
_io_pool = NULL;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fc4f743..98f6768 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -973,7 +973,7 @@ static struct mapped_device *alloc_dev(i
if (!md->tio_pool)
goto bad3;

- md->bs = bioset_create(16, 16, 4);
+ md->bs = bioset_create(16, 4);
if (!md->bs)
goto bad_no_bioset;

diff --git a/fs/bio.c b/fs/bio.c
index aa4d09b..452e8f7 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -30,10 +30,6 @@

#define BIO_POOL_SIZE 256

-static kmem_cache_t *bio_slab __read_mostly;
-
-#define BIOVEC_NR_POOLS 6
-
/*
* a small number of entries is fine, not going to be performance critical.
* basically we just need to survive
@@ -41,23 +37,23 @@ static kmem_cache_t *bio_slab __read_mos
#define BIO_SPLIT_ENTRIES 8
mempool_t *bio_split_pool __read_mostly;

-struct biovec_slab {
+struct bio_slab {
int nr_vecs;
- char *name;
+ const char *name;
kmem_cache_t *slab;
};

-/*
- * if you change this list, also change bvec_alloc or things will
- * break badly! cannot be bigger than what you can fit into an
- * unsigned short
- */
-
-#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
-static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
- BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+#define BIOSLAB_NR_POOLS 6
+#define BIOSLAB(x) { .nr_vecs = x, .name = "bio-"__stringify(x) }
+static struct bio_slab bio_slabs[BIOSLAB_NR_POOLS] __read_mostly = {
+ BIOSLAB(0),
+ BIOSLAB(4),
+ BIOSLAB(16),
+ BIOSLAB(64),
+ BIOSLAB(128),
+ BIOSLAB(BIO_MAX_PAGES),
};
-#undef BV
+#undef BIOSLAB

/*
* bio_set is used to allow other portions of the IO system to
@@ -66,8 +62,7 @@ static struct biovec_slab bvec_slabs[BIO
* and the bvec_slabs[].
*/
struct bio_set {
- mempool_t *bio_pool;
- mempool_t *bvec_pools[BIOVEC_NR_POOLS];
+ mempool_t *bio_pools[BIOSLAB_NR_POOLS];
};

/*
@@ -76,45 +71,12 @@ struct bio_set {
*/
static struct bio_set *fs_bio_set;

-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs)
-{
- struct bio_vec *bvl;
-
- /*
- * see comment near bvec_array define!
- */
- switch (nr) {
- case 1 : *idx = 0; break;
- case 2 ... 4: *idx = 1; break;
- case 5 ... 16: *idx = 2; break;
- case 17 ... 64: *idx = 3; break;
- case 65 ... 128: *idx = 4; break;
- case 129 ... BIO_MAX_PAGES: *idx = 5; break;
- default:
- return NULL;
- }
- /*
- * idx now points to the pool we want to allocate from
- */
-
- bvl = mempool_alloc(bs->bvec_pools[*idx], gfp_mask);
- if (bvl) {
- struct biovec_slab *bp = bvec_slabs + *idx;
-
- memset(bvl, 0, bp->nr_vecs * sizeof(struct bio_vec));
- }
-
- return bvl;
-}
-
void bio_free(struct bio *bio, struct bio_set *bio_set)
{
const int pool_idx = BIO_POOL_IDX(bio);

- BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
-
- mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
- mempool_free(bio, bio_set->bio_pool);
+ BIO_BUG_ON(pool_idx >= BIOSLAB_NR_POOLS);
+ mempool_free(bio, bio_set->bio_pools[pool_idx]);
}

/*
@@ -160,26 +122,51 @@ void bio_init(struct bio *bio)
**/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
- struct bio *bio = mempool_alloc(bs->bio_pool, gfp_mask);
+ struct bio *bio = NULL;
+ long idx;
+
+ switch (nr_iovecs) {
+ case 0:
+ idx = 0;
+ break;
+ case 1 ... 4:
+ idx = 1;
+ break;
+ case 5 ... 16:
+ idx = 2;
+ break;
+ case 17 ... 64:
+ idx = 3;
+ break;
+ case 65 ... 128:
+ idx = 4;
+ break;
+ case 129 ... BIO_MAX_PAGES:
+ idx = 5;
+ break;
+ default:
+ goto out;
+ }
+
+ bio = mempool_alloc(bs->bio_pools[idx], gfp_mask);

if (likely(bio)) {
struct bio_vec *bvl = NULL;

bio_init(bio);
- if (likely(nr_iovecs)) {
- unsigned long idx = 0; /* shut up gcc */
-
- bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
- if (unlikely(!bvl)) {
- mempool_free(bio, bs->bio_pool);
- bio = NULL;
- goto out;
- }
+
+ if (nr_iovecs) {
+ struct bio_slab *bslab = &bio_slabs[idx];
+
+ bvl = (void *) bio + sizeof(*bio);
+ memset(bvl, 0, bslab->nr_vecs * sizeof(struct bio_vec));
+
bio->bi_flags |= idx << BIO_POOL_OFFSET;
- bio->bi_max_vecs = bvec_slabs[idx].nr_vecs;
+ bio->bi_max_vecs = bslab->nr_vecs;
}
bio->bi_io_vec = bvl;
}
+
out:
return bio;
}
@@ -1120,89 +1107,78 @@ struct bio_pair *bio_split(struct bio *b
* create memory pools for biovec's in a bio_set.
* use the global biovec slabs created for general use.
*/
-static int biovec_create_pools(struct bio_set *bs, int pool_entries, int scale)
+static int bio_create_pools(struct bio_set *bs, int pool_entries, int scale)
{
int i;

- for (i = 0; i < BIOVEC_NR_POOLS; i++) {
- struct biovec_slab *bp = bvec_slabs + i;
- mempool_t **bvp = bs->bvec_pools + i;
+ for (i = 0; i < BIOSLAB_NR_POOLS; i++) {
+ struct bio_slab *bslab = bio_slabs + i;
+ mempool_t **bvp = bs->bio_pools + i;

if (pool_entries > 1 && i >= scale)
pool_entries >>= 1;

- *bvp = mempool_create_slab_pool(pool_entries, bp->slab);
+ *bvp = mempool_create_slab_pool(pool_entries, bslab->slab);
if (!*bvp)
return -ENOMEM;
}
return 0;
}

-static void biovec_free_pools(struct bio_set *bs)
+static void bio_free_pools(struct bio_set *bs)
{
int i;

- for (i = 0; i < BIOVEC_NR_POOLS; i++) {
- mempool_t *bvp = bs->bvec_pools[i];
+ for (i = 0; i < BIOSLAB_NR_POOLS; i++) {
+ mempool_t *bp = bs->bio_pools[i];

- if (bvp)
- mempool_destroy(bvp);
+ if (bp)
+ mempool_destroy(bp);
}

}

void bioset_free(struct bio_set *bs)
{
- if (bs->bio_pool)
- mempool_destroy(bs->bio_pool);
-
- biovec_free_pools(bs);
+ bio_free_pools(bs);

kfree(bs);
}

-struct bio_set *bioset_create(int bio_pool_size, int bvec_pool_size, int scale)
+struct bio_set *bioset_create(int bio_pool_size, int scale)
{
struct bio_set *bs = kzalloc(sizeof(*bs), GFP_KERNEL);

if (!bs)
return NULL;

- bs->bio_pool = mempool_create_slab_pool(bio_pool_size, bio_slab);
- if (!bs->bio_pool)
- goto bad;
-
- if (!biovec_create_pools(bs, bvec_pool_size, scale))
+ if (!bio_create_pools(bs, bio_pool_size, scale))
return bs;

-bad:
bioset_free(bs);
return NULL;
}

-static void __init biovec_init_slabs(void)
+static void __init bio_init_slabs(void)
{
int i;

- for (i = 0; i < BIOVEC_NR_POOLS; i++) {
+ for (i = 0; i < BIOSLAB_NR_POOLS; i++) {
int size;
- struct biovec_slab *bvs = bvec_slabs + i;
+ struct bio_slab *bs = bio_slabs + i;

- size = bvs->nr_vecs * sizeof(struct bio_vec);
- bvs->slab = kmem_cache_create(bvs->name, size, 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+ size = bs->nr_vecs * sizeof(struct bio_vec) + sizeof(struct bio);
+ bs->slab = kmem_cache_create(bs->name, size, 0, SLAB_PANIC,
+ NULL, NULL);
}
}

static int __init init_bio(void)
{
- int megabytes, bvec_pool_entries;
- int scale = BIOVEC_NR_POOLS;
-
- bio_slab = kmem_cache_create("bio", sizeof(struct bio), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+ int scale = BIOSLAB_NR_POOLS;
+ int megabytes;

- biovec_init_slabs();
+ bio_init_slabs();

megabytes = nr_free_pages() >> (20 - PAGE_SHIFT);

@@ -1225,9 +1201,8 @@ static int __init init_bio(void)
* the system is completely unable to allocate memory, so we only
* need enough to make progress.
*/
- bvec_pool_entries = 1 + scale;

- fs_bio_set = bioset_create(BIO_POOL_SIZE, bvec_pool_entries, scale);
+ fs_bio_set = bioset_create(BIO_POOL_SIZE, scale);
if (!fs_bio_set)
panic("bio: can't allocate bios\n");

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 092dbd0..d6d0047 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -276,7 +276,7 @@ extern struct bio_pair *bio_split(struct
extern mempool_t *bio_split_pool;
extern void bio_pair_release(struct bio_pair *dbio);

-extern struct bio_set *bioset_create(int, int, int);
+extern struct bio_set *bioset_create(int, int);
extern void bioset_free(struct bio_set *);

extern struct bio *bio_alloc(gfp_t, int);

--
Jens Axboe

2006-12-06 18:46:13

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] speed up single bio_vec allocation

Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > I will try that too. I'm a bit touchy about sharing a cache line for
> > > different bio. But given that there are 200,000 I/O per second we are
> > > currently pushing the kernel, the chances of two cpu working on two
> > > bio that sits in the same cache line are pretty small.
> >
> > Yep I really think so. Besides, it's not like we are repeatedly writing
> > to these objects in the first place.
>
> This is what I had in mind, in case it wasn't completely clear. Not
> tested, other than it compiles. Basically it eliminates the small
> bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> 12-bytes on 32-bit archs instead and uses the room at the end for the
> bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement. I will post the result as soon as I get it.


> I still can't help but think we can do better than this, and that this
> is nothing more than optimizing for a benchmark. For high performance
> I/O, you will be doing > 1 page bio's anyway and this patch wont help
> you at all. Perhaps we can just kill bio_vec slabs completely, and
> create bio slabs instead with differing sizes. So instead of having 1
> bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> for the bio_vec list at the end. That would always eliminate the extra
> allocation, at the cost of blowing the 256-page case into a order 1 page
> allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more. 256 segment biovec allocation is really rare. I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.

- Ken

2006-12-07 22:33:48

by Nate Diller

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On 12/7/06, Chen, Kenneth W <[email protected]> wrote:
> Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> > the current code is straightforward and obviously correct. you want
> > to make the alloc/dealloc paths more complex, by special-casing for an
> > arbitrary limit of "small" I/O, AFAICT. of *course* you can expect
> > less overhead when you're doing one large I/O vs. two small ones,
> > that's the whole reason we have all this code to try to coalesce
> > contiguous I/O, do readahead, swap page clustering, etc. we *want*
> > more complexity if it will get us bigger I/Os. I don't see why we
> > want more complexity to reduce the *inherent* penalty of doing smaller
> > ones.
>
> You should check out the latest proposal from Jens Axboe which treats
> all biovec size the same and stuff it along with struct bio. I think
> it is a better approach than my first cut of special casing 1 segment
> biovec. His patch will speed up all sized I/O.

i rather agree with his reservations on that, since we'd be making the
allocator's job harder by requesting order 1 pages for all allocations
on x86_64 large I/O patterns. but it reduces complexity instead of
increasing it ... can you produce some benchmarks not just for your
workload but for one that triggers the order 1 case? biovec-(256)
transfers are more common than you seem to think, and if the allocator
can't do it, that forces the bio code to fall back to 2 x biovec-128,
which, as you indicated above, would show a real penalty.

NATE

2006-12-08 02:27:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

"Chen, Kenneth W" <[email protected]> writes:
>
> I tried to use cache_line_size() to find out the alignment of struct bio, but
> stumbled on that it is a runtime function for x86_64.

It's a single global variable access:

#define cache_line_size() (boot_cpu_data.x86_cache_alignment)

Or do you mean it caused cache misses? boot_cpu_data is cache aligned
and practically read only, so there shouldn't be any false sharing at least.

-Andi

2006-12-08 04:23:10

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] speed up single bio_vec allocation

Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
> "Chen, Kenneth W" <[email protected]> writes:
> > I tried to use cache_line_size() to find out the alignment of struct bio, but
> > stumbled on that it is a runtime function for x86_64.
>
> It's a single global variable access:
>
> #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
>
> Or do you mean it caused cache misses? boot_cpu_data is cache aligned
> and practically read only, so there shouldn't be any false sharing at least.

No, I was looking for a generic constant that describes cache line size.
I needed a constant in order to avoid runtime check and to rely on compiler
to optimize a conditional check away.

2006-12-07 19:36:35

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] speed up single bio_vec allocation

Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
> > > I still can't help but think we can do better than this, and that this
> > > is nothing more than optimizing for a benchmark. For high performance
> > > I/O, you will be doing > 1 page bio's anyway and this patch wont help
> > > you at all. Perhaps we can just kill bio_vec slabs completely, and
> > > create bio slabs instead with differing sizes. So instead of having 1
> > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> > > for the bio_vec list at the end. That would always eliminate the extra
> > > allocation, at the cost of blowing the 256-page case into a order 1 page
> > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> > > which is something I've always tried to avoid.
> >
> > I took a quick query of biovec-* slab stats on various production machines,
> > majority of the allocation is on 1 and 4 segments, usages falls off quickly
> > on 16 or more. 256 segment biovec allocation is really rare. I think it
> > makes sense to heavily bias towards smaller biovec allocation and have
> > separate biovec allocation for really large ones.
>
> what file system? have you tested with more than one? have you
> tested with file systems that build their own bio's instead of using
> get_block() calls? have you tested with large files or streaming
> workloads? how about direct I/O?
>
> i think that a "heavy bias" toward small biovecs is FS and workload
> dependent, and that it's irresponsible to make such unjustified
> changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world. After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage. If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio. However if you
do two 512K I/O, two calls to mempool is made. So in some sense,
current scheme is unfair to small I/O.

- Ken

2006-12-08 08:00:27

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Thu, Dec 07 2006, Nate Diller wrote:
> On 12/7/06, Chen, Kenneth W <[email protected]> wrote:
> >Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> >> the current code is straightforward and obviously correct. you want
> >> to make the alloc/dealloc paths more complex, by special-casing for an
> >> arbitrary limit of "small" I/O, AFAICT. of *course* you can expect
> >> less overhead when you're doing one large I/O vs. two small ones,
> >> that's the whole reason we have all this code to try to coalesce
> >> contiguous I/O, do readahead, swap page clustering, etc. we *want*
> >> more complexity if it will get us bigger I/Os. I don't see why we
> >> want more complexity to reduce the *inherent* penalty of doing smaller
> >> ones.
> >
> >You should check out the latest proposal from Jens Axboe which treats
> >all biovec size the same and stuff it along with struct bio. I think
> >it is a better approach than my first cut of special casing 1 segment
> >biovec. His patch will speed up all sized I/O.
>
> i rather agree with his reservations on that, since we'd be making the
> allocator's job harder by requesting order 1 pages for all allocations
> on x86_64 large I/O patterns. but it reduces complexity instead of
> increasing it ... can you produce some benchmarks not just for your
> workload but for one that triggers the order 1 case? biovec-(256)
> transfers are more common than you seem to think, and if the allocator
> can't do it, that forces the bio code to fall back to 2 x biovec-128,
> which, as you indicated above, would show a real penalty.

The question is if the slab allocator is only doing 2^0 order
allocations for the 256-page bio_vec currently - it's at 4096 bytes, so
potentially (I suspect) the worst size it could be.

On the 1 vs many page bio_vec patterns, I agree with Nate. I do see lots
of larger bio_vecs here. > 1 page bio_vec usage is also becoming more
prevalent, not less. So optimizing for a benchmark case that
predominately uses 1 page bio's is indeed a silly thing.

--
Jens Axboe

2006-12-07 19:22:22

by Nate Diller

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On 12/6/06, Chen, Kenneth W <[email protected]> wrote:
> Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > > I will try that too. I'm a bit touchy about sharing a cache line for
> > > > different bio. But given that there are 200,000 I/O per second we are
> > > > currently pushing the kernel, the chances of two cpu working on two
> > > > bio that sits in the same cache line are pretty small.
> > >
> > > Yep I really think so. Besides, it's not like we are repeatedly writing
> > > to these objects in the first place.
> >
> > This is what I had in mind, in case it wasn't completely clear. Not
> > tested, other than it compiles. Basically it eliminates the small
> > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> > 12-bytes on 32-bit archs instead and uses the room at the end for the
> > bio_vec structure.
>
> Yeah, I had a very similar patch queued internally for the large benchmark
> measurement. I will post the result as soon as I get it.
>
>
> > I still can't help but think we can do better than this, and that this
> > is nothing more than optimizing for a benchmark. For high performance
> > I/O, you will be doing > 1 page bio's anyway and this patch wont help
> > you at all. Perhaps we can just kill bio_vec slabs completely, and
> > create bio slabs instead with differing sizes. So instead of having 1
> > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> > for the bio_vec list at the end. That would always eliminate the extra
> > allocation, at the cost of blowing the 256-page case into a order 1 page
> > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> > which is something I've always tried to avoid.
>
> I took a quick query of biovec-* slab stats on various production machines,
> majority of the allocation is on 1 and 4 segments, usages falls off quickly
> on 16 or more. 256 segment biovec allocation is really rare. I think it
> makes sense to heavily bias towards smaller biovec allocation and have
> separate biovec allocation for really large ones.

what file system? have you tested with more than one? have you
tested with file systems that build their own bio's instead of using
get_block() calls? have you tested with large files or streaming
workloads? how about direct I/O?

i think that a "heavy bias" toward small biovecs is FS and workload
dependent, and that it's irresponsible to make such unjustified
changes just to show improvement on your particular benchmark.

i do however agree with killing SLAB_HWCACHE_ALIGN for biovecs,
pending reasonable regression benchmarks.

NATE

2006-12-07 21:53:02

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] speed up single bio_vec allocation

Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> the current code is straightforward and obviously correct. you want
> to make the alloc/dealloc paths more complex, by special-casing for an
> arbitrary limit of "small" I/O, AFAICT. of *course* you can expect
> less overhead when you're doing one large I/O vs. two small ones,
> that's the whole reason we have all this code to try to coalesce
> contiguous I/O, do readahead, swap page clustering, etc. we *want*
> more complexity if it will get us bigger I/Os. I don't see why we
> want more complexity to reduce the *inherent* penalty of doing smaller
> ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio. I think
it is a better approach than my first cut of special casing 1 segment
biovec. His patch will speed up all sized I/O.

- Ken

2006-12-08 04:37:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Friday 08 December 2006 05:23, Chen, Kenneth W wrote:
> Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
> > "Chen, Kenneth W" <[email protected]> writes:
> > > I tried to use cache_line_size() to find out the alignment of struct bio, but
> > > stumbled on that it is a runtime function for x86_64.
> >
> > It's a single global variable access:
> >
> > #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
> >
> > Or do you mean it caused cache misses? boot_cpu_data is cache aligned
> > and practically read only, so there shouldn't be any false sharing at least.
>
> No, I was looking for a generic constant that describes cache line size.

The same kernel binary runs on CPUs with
different cache line sizes. For example P4 has 128 bytes, Core2 64 bytes.

However there is a worst case alignment that is used for static
alignments which is L1_CACHE_BYTES. It would be normally 128 bytes
on a x86 kernel, unless it is especially compiled for a CPU with
a smaller cache line size.

-Andi

2006-12-07 21:46:08

by Nate Diller

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On 12/7/06, Chen, Kenneth W <[email protected]> wrote:
> Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
> > > > I still can't help but think we can do better than this, and that this
> > > > is nothing more than optimizing for a benchmark. For high performance
> > > > I/O, you will be doing > 1 page bio's anyway and this patch wont help
> > > > you at all. Perhaps we can just kill bio_vec slabs completely, and
> > > > create bio slabs instead with differing sizes. So instead of having 1
> > > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> > > > for the bio_vec list at the end. That would always eliminate the extra
> > > > allocation, at the cost of blowing the 256-page case into a order 1 page
> > > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> > > > which is something I've always tried to avoid.
> > >
> > > I took a quick query of biovec-* slab stats on various production machines,
> > > majority of the allocation is on 1 and 4 segments, usages falls off quickly
> > > on 16 or more. 256 segment biovec allocation is really rare. I think it
> > > makes sense to heavily bias towards smaller biovec allocation and have
> > > separate biovec allocation for really large ones.
> >
> > what file system? have you tested with more than one? have you
> > tested with file systems that build their own bio's instead of using
> > get_block() calls? have you tested with large files or streaming
> > workloads? how about direct I/O?
> >
> > i think that a "heavy bias" toward small biovecs is FS and workload
> > dependent, and that it's irresponsible to make such unjustified
> > changes just to show improvement on your particular benchmark.
>
> It is no doubt that the above data is just a quick estimate on one
> usage model. There are tons of other usage in the world. After all,
> any algorithm in the kernel has to be generic and self tune to
> specific environment.
>
> On very large I/O, the relative overhead in allocating biovec will
> decrease because larger I/O needs more code to do setup, more code
> to perform I/O completion, more code in the device driver etc. So
> time spent on one mempool alloc will amortize over the size of I/O.
> On a smaller I/O size, the overhead is more visible and thus makes
> sense to me to cut down that relative overhead.
>
> In fact, the large I/O already have unfair advantage. If you do 1MB
> I/O, only 1 call to mempool to get a 256 segment bio. However if you
> do two 512K I/O, two calls to mempool is made. So in some sense,
> current scheme is unfair to small I/O.

so that's a fancy way of saying "no, i didn't do any benchmarks", yes?

the current code is straightforward and obviously correct. you want
to make the alloc/dealloc paths more complex, by special-casing for an
arbitrary limit of "small" I/O, AFAICT. of *course* you can expect
less overhead when you're doing one large I/O vs. two small ones,
that's the whole reason we have all this code to try to coalesce
contiguous I/O, do readahead, swap page clustering, etc. we *want*
more complexity if it will get us bigger I/Os. I don't see why we
want more complexity to reduce the *inherent* penalty of doing smaller
ones.

btw, i am happy to see that you are working on performance, especially
to the degree you reduce overhead by cleaning things up and removing
cruft and unneeded complexity...

NATE

2006-12-08 22:15:00

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] speed up single bio_vec allocation

> Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
> > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > This is what I had in mind, in case it wasn't completely clear. Not
> > tested, other than it compiles. Basically it eliminates the small
> > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> > 12-bytes on 32-bit archs instead and uses the room at the end for the
> > bio_vec structure.
>
> Yeah, I had a very similar patch queued internally for the large benchmark
> measurement. I will post the result as soon as I get it.


Jens, this improves 0.25% on our db transaction processing benchmark setup.
The patch tested is (on top of 2.6.19):
http://marc.theaimsgroup.com/?l=linux-kernel&m=116539972229021&w=2

- Ken

2006-12-14 20:22:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] speed up single bio_vec allocation

On Fri, Dec 08 2006, Chen, Kenneth W wrote:
> > Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
> > > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > This is what I had in mind, in case it wasn't completely clear. Not
> > > tested, other than it compiles. Basically it eliminates the small
> > > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> > > 12-bytes on 32-bit archs instead and uses the room at the end for the
> > > bio_vec structure.
> >
> > Yeah, I had a very similar patch queued internally for the large benchmark
> > measurement. I will post the result as soon as I get it.
>
>
> Jens, this improves 0.25% on our db transaction processing benchmark setup.
> The patch tested is (on top of 2.6.19):
> http://marc.theaimsgroup.com/?l=linux-kernel&m=116539972229021&w=2

Ok, well it's not much but if it's significant it's not too bad either
:-)

Some tests I ran locally showed it being _slower_, which is a little
odd. They were basically hammering requests through the block layer with
a null end.

--
Jens Axboe