2005-03-29 02:38:39

by Chen, Kenneth W

[permalink] [raw]
Subject: [patch] optimization: defer bio_vec deallocation

Kernel needs at least one bio and one bio_vec structure to process one I/O.
For every I/O, kernel also does two pairs of mempool_alloc/free, one for
bio and one for bio_vec. It is not exactly cheap in setup/tear down bio_vec
structure. bio_alloc_bs() does more things in that function other than the
minimally required mempool_alloc().

One optimization we are proposing is to defer the deallocation of bio_vec
at the next bio allocation. Let bio hang on to the bio_vec for the last
bio_put. And at next bio alloc time, check whether it has the same iovec
requirement. If it is, bingo! bio already has it. If not, then we free
previous iovec and allocate a new one. So in steady state, When I/O size
does not change much, we benefit from the deferred free, saving one pair
of alloc/free. If I/O request has random size or in dynamic state, then
we fall back and do the normal two pairs of alloc/free. We have measured
that the following patch give measurable performance gain for industry
standard db benchmark. Comments?


Signed-off-by: Ken Chen <[email protected]>

--- linux-2.6.12-rc1/fs/bio.c.orig 2005-03-28 13:49:37.000000000 -0800
+++ linux-2.6.12-rc1/fs/bio.c 2005-03-28 15:59:57.000000000 -0800
@@ -109,19 +109,15 @@ static inline struct bio_vec *bvec_alloc
*/
static void bio_destructor(struct bio *bio)
{
- const int pool_idx = BIO_POOL_IDX(bio);
struct bio_set *bs = bio->bi_set;
-
- BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
-
- mempool_free(bio->bi_io_vec, bs->bvec_pools[pool_idx]);
mempool_free(bio, bs->bio_pool);
}

inline void bio_init(struct bio *bio)
{
bio->bi_next = NULL;
- bio->bi_flags = 1 << BIO_UPTODATE;
+ bio->bi_flags &= ~(BIO_POOL_MASK - 1);
+ bio->bi_flags |= 1 << BIO_UPTODATE;
bio->bi_rw = 0;
bio->bi_vcnt = 0;
bio->bi_idx = 0;
@@ -130,7 +126,6 @@ inline void bio_init(struct bio *bio)
bio->bi_hw_front_size = 0;
bio->bi_hw_back_size = 0;
bio->bi_size = 0;
- bio->bi_max_vecs = 0;
bio->bi_end_io = NULL;
atomic_set(&bio->bi_cnt, 1);
bio->bi_private = NULL;
@@ -158,20 +153,37 @@ struct bio *bio_alloc_bioset(int gfp_mas

bio_init(bio);
if (likely(nr_iovecs)) {
- unsigned long idx;
-
- bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
- if (unlikely(!bvl)) {
- mempool_free(bio, bs->bio_pool);
- bio = NULL;
- goto out;
+ if (unlikely(nr_iovecs != bio->bi_max_vecs)) {
+ unsigned long idx;
+ if (bio->bi_max_vecs) {
+ struct bio_set *_bs = bio->bi_set;
+ idx = BIO_POOL_IDX(bio);
+ mempool_free(bio->bi_io_vec, _bs->bvec_pools[idx]);
+ bio->bi_max_vecs = 0;
+ bio->bi_flags &= BIO_POOL_MASK - 1;
+ }
+ bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
+ if (unlikely(!bvl)) {
+ mempool_free(bio, bs->bio_pool);
+ bio = NULL;
+ goto out;
+ }
+ bio->bi_flags |= idx << BIO_POOL_OFFSET;
+ bio->bi_max_vecs = bvec_slabs[idx].nr_vecs;
+ bio->bi_io_vec = bvl;
+ bio->bi_set = bs;
+ }
+ } else {
+ /* a zero io_vec allocation, need to free io_vec */
+ if (bio->bi_max_vecs) {
+ unsigned long idx = BIO_POOL_IDX(bio);
+ struct bio_set *_bs = bio->bi_set;
+ mempool_free(bio->bi_io_vec, _bs->bvec_pools[idx]);
+ bio->bi_io_vec = NULL;
+ bio->bi_max_vecs = 0;
+ bio->bi_set = bs;
}
- bio->bi_flags |= idx << BIO_POOL_OFFSET;
- bio->bi_max_vecs = bvec_slabs[idx].nr_vecs;
}
- bio->bi_io_vec = bvl;
- bio->bi_destructor = bio_destructor;
- bio->bi_set = bs;
}
out:
return bio;
@@ -1013,6 +1025,24 @@ bad:
return NULL;
}

+static void bio_ctor(void *p, kmem_cache_t *cachep, unsigned long flags)
+{
+ struct bio * bio = (struct bio*) p;
+ memset(bio, 0, sizeof(*bio));
+ bio->bi_destructor = bio_destructor;
+}
+
+static void bio_dtor(void *p, kmem_cache_t *cachep, unsigned long flags)
+{
+ struct bio * bio = (struct bio*) p;
+
+ if (bio->bi_max_vecs) {
+ unsigned long idx = BIO_POOL_IDX(bio);
+ struct bio_set *bs = bio->bi_set;
+ mempool_free(bio->bi_io_vec, bs->bvec_pools[idx]);
+ }
+}
+
static void __init biovec_init_slabs(void)
{
int i;
@@ -1033,7 +1063,8 @@ static int __init init_bio(void)
int scale = BIOVEC_NR_POOLS;

bio_slab = kmem_cache_create("bio", sizeof(struct bio), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+ bio_ctor, bio_dtor);

biovec_init_slabs();





2005-03-29 02:59:52

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] optimization: defer bio_vec deallocation

On Mon, Mar 28, 2005 at 06:38:23PM -0800, Chen, Kenneth W wrote:
> We have measured that the following patch give measurable performance gain
> for industry standard db benchmark. Comments?

If you can't publish results from that certain benchmark due its stupid
restrictions, could you also try running an alternative benchmark that
you can show results from ?

These nebulous claims of 'measurable gains' could mean anything.
I'm assuming you see a substantial increase in throughput, but
how much is it worth in exchange for complicating the code?

Dave


2005-03-29 03:07:48

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] optimization: defer bio_vec deallocation

On Mon, Mar 28, 2005 at 06:38:23PM -0800, Chen, Kenneth W wrote:
> We have measured that the following patch give measurable performance gain
> for industry standard db benchmark. Comments?

Dave Jones wrote on Monday, March 28, 2005 7:00 PM
> If you can't publish results from that certain benchmark due its stupid
> restrictions, could you also try running an alternative benchmark that
> you can show results from ?
>
> These nebulous claims of 'measurable gains' could mean anything.
> I'm assuming you see a substantial increase in throughput, but
> how much is it worth in exchange for complicating the code?

Are you asking for micro-benchmark result? I had a tough time last time
around when I presented micro-benchmark result on LKML. I got kicked in
the butt for lack of evidence with performance data running real bench on
real hardware.

I guess either way, I'm bruised one way or the other.


2005-03-29 08:33:22

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] optimization: defer bio_vec deallocation

On Mon, Mar 28 2005, Chen, Kenneth W wrote:
> On Mon, Mar 28, 2005 at 06:38:23PM -0800, Chen, Kenneth W wrote:
> > We have measured that the following patch give measurable performance gain
> > for industry standard db benchmark. Comments?
>
> Dave Jones wrote on Monday, March 28, 2005 7:00 PM
> > If you can't publish results from that certain benchmark due its stupid
> > restrictions, could you also try running an alternative benchmark that
> > you can show results from ?
> >
> > These nebulous claims of 'measurable gains' could mean anything.
> > I'm assuming you see a substantial increase in throughput, but
> > how much is it worth in exchange for complicating the code?
>
> Are you asking for micro-benchmark result? I had a tough time last time
> around when I presented micro-benchmark result on LKML. I got kicked in
> the butt for lack of evidence with performance data running real bench on
> real hardware.
>
> I guess either way, I'm bruised one way or the other.

Just _some_ results would be nice, Dave is right in that 'measurable
gains' doesn't really say anything at all. Personally I would like to
see a profile diff, for instance. And at least something like 'we get 1%
gain bla bla'.

Now, about the patch. I cannot convince myself that it is not deadlock
prone, if someone waits for a bvec to be freed. Will slab reclaim always
prune the bio slab and push the bvecs back into the mempool, or can
there be cases where this doesn't happen?

--
Jens Axboe

2005-03-29 09:16:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] optimization: defer bio_vec deallocation

"Chen, Kenneth W" <[email protected]> wrote:
>
> On Mon, Mar 28, 2005 at 06:38:23PM -0800, Chen, Kenneth W wrote:
> > We have measured that the following patch give measurable performance gain
> > for industry standard db benchmark. Comments?
>
> Dave Jones wrote on Monday, March 28, 2005 7:00 PM
> > If you can't publish results from that certain benchmark due its stupid
> > restrictions, could you also try running an alternative benchmark that
> > you can show results from ?
> >
> > These nebulous claims of 'measurable gains' could mean anything.
> > I'm assuming you see a substantial increase in throughput, but
> > how much is it worth in exchange for complicating the code?
>
> Are you asking for micro-benchmark result?

There are a number of test tools floating about. reaim, orasim, osdb, others.

A number of them are fairly crufty and toy-like, but still more useful than
microbenchmarks, and they permit others to evaluate patches and to perform
further optimisation.

It's in everyone's interest that we get away from a test which has such
dopey restrictions.

2005-03-29 18:45:31

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] optimization: defer bio_vec deallocation

> Dave Jones wrote on Monday, March 28, 2005 7:00 PM
> > If you can't publish results from that certain benchmark due its stupid
> > restrictions,

Forgot to thank Dave earlier for his understanding. I can't even mention the
4 letter acronym for the benchmark. Sorry, I did not make the rule nor have
the power to change the rule.


Jens Axboe wrote on Tuesday, March 29, 2005 12:13 AM
> Just _some_ results would be nice, Dave is right in that 'measurable
> gains' doesn't really say anything at all. Personally I would like to
> see a profile diff, for instance. And at least something like 'we get 1%
> gain bla bla'.

OK, performance gain for this industry db benchmark is 0.3%.


> Now, about the patch. I cannot convince myself that it is not deadlock
> prone, if someone waits for a bvec to be freed. Will slab reclaim always
> prune the bio slab and push the bvecs back into the mempool, or can
> there be cases where this doesn't happen?

So on allocation, I should always get memory from slab first, if fail then
get from mempool. Mark the bvec appropriately where the memory came from.
On deallocating bio, check bvec flag and return memory if they came from
mempool. Would that address your concern?


2005-03-29 18:49:07

by Dave Jones

[permalink] [raw]
Subject: Re: [patch] optimization: defer bio_vec deallocation

On Tue, Mar 29, 2005 at 10:44:53AM -0800, Chen, Kenneth W wrote:
> > Dave Jones wrote on Monday, March 28, 2005 7:00 PM
> > > If you can't publish results from that certain benchmark due its stupid
> > > restrictions,
>
> Forgot to thank Dave earlier for his understanding. I can't even mention the
> 4 letter acronym for the benchmark. Sorry, I did not make the rule nor have
> the power to change the rule.

That's ok, I substituted its name for some four letter words of my own. :)

Dave

2005-03-30 10:30:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] optimization: defer bio_vec deallocation

On Tue, Mar 29 2005, Chen, Kenneth W wrote:
> Jens Axboe wrote on Tuesday, March 29, 2005 12:13 AM
> > Just _some_ results would be nice, Dave is right in that 'measurable
> > gains' doesn't really say anything at all. Personally I would like to
> > see a profile diff, for instance. And at least something like 'we get 1%
> > gain bla bla'.
>
> OK, performance gain for this industry db benchmark is 0.3%.

OK.

> > Now, about the patch. I cannot convince myself that it is not deadlock
> > prone, if someone waits for a bvec to be freed. Will slab reclaim always
> > prune the bio slab and push the bvecs back into the mempool, or can
> > there be cases where this doesn't happen?
>
> So on allocation, I should always get memory from slab first, if fail then
> get from mempool. Mark the bvec appropriately where the memory came
> from. On deallocating bio, check bvec flag and return memory if they
> came from mempool. Would that address your concern?

Hmmm no, I don't think we are talking about the same thing (what you
describe is what currently happens anyways, this is how mempools work).
Am I guarenteed to get a bio with a bvec already assigned when doing a
bio allocation, if one exists?

--
Jens Axboe