2008-03-11 10:53:03

by Daniel Phillips

[permalink] [raw]
Subject: [RFC] Stacking bio support

Or: The Head of the Axe

With a nice shiny new interface paradigm in hand that can be easily
adapted to a variety of alternate back ends[1], let me now supply an
important building block for such a back end. This three patch series
improves both the existing block layer and device mapper without
disrupting anything, and lays the groundwork for generalizing the block
layer to be much more flexible than it is now. The ultimate goal of
this effort is a complete merge of device mapper capabilities into the
generic block layer, which could equally well be called lvm3 or bio2.
While admittedly an ambitious undertaking, I should not need to explain
why it is necessary, or why one cannot get to the end of a long road
without taking the first step.

This patch set adds support for block device stacking to the block io
layer. Due to tightening up some code and removing unnecessary slab
objects, the patch set as a whole reduces core kernel size slightly.
Further code size reductions can be expected as things progress, and
modest memory savings as well.

1: bio.single.alloc-2.6.23.12

This patch eliminates 50% of the slab allocations in the bio fast path,
tightens up some code, and turns struct bio into a variable sized
object. It replaces the bio slab by an array of bio slabs and removes
the existing biovec slab array. A small amount of system memory is
saved and some code is removed.

diffstat patch/1/bio.single.alloc-2.6.23.12
drivers/md/dm.c | 2
fs/bio.c | 187
++++++++++++++--------------------------------------
include/linux/bio.h | 2
3 files changed, 54 insertions(+), 137 deletions(-)

2: bio.hide.endio-2.6.23.12

In preparation for moving the bio endio function pointer from a fixed
location in the bio to a location on the internal stack, direct
structure references to the bio endio field are disallowed. This patch
renames the endio field, introduces wrappers to get/put field contents,
and edits all existing references to the field to go through the
wrappers.

This is a large patch, consisting entirely of trivial changes. May I
have some volunteer proofreading on this, please? With this many edits
there are sure to be a few slips. Not all of the affected files have
even been compiled. I would appreciate very much if interested
observers would volunteer.

diffstat patch/2/bio.hide.endio-2.6.23.12
block/ll_rw_blk.c | 6 ++---
drivers/block/floppy.c | 2 -
drivers/block/pktcdvd.c | 8 +++----
drivers/md/dm-crypt.c | 2 -
drivers/md/dm-emc.c | 2 -
drivers/md/dm-io.c | 2 -
drivers/md/dm.c | 2 -
drivers/md/faulty.c | 4 +--
drivers/md/md.c | 6 ++---
drivers/md/multipath.c | 4 +--
drivers/md/raid1.c | 44
+++++++++++++++++++++----------------------
drivers/md/raid10.c | 24 +++++++++++------------
drivers/md/raid5.c | 18 ++++++++---------
drivers/s390/block/xpram.c | 2 -
drivers/scsi/scsi_lib.c | 2 -
fs/bio.c | 12 +++++------
fs/block_dev.c | 2 -
fs/buffer.c | 2 -
fs/direct-io.c | 4 +--
fs/gfs2/super.c | 3 --
fs/jfs/jfs_logmgr.c | 4 +--
fs/jfs/jfs_metapage.c | 4 +--
fs/mpage.c | 4 +--
fs/ocfs2/cluster/heartbeat.c | 2 -
fs/xfs/linux-2.6/xfs_aops.c | 6 ++---
fs/xfs/linux-2.6/xfs_buf.c | 4 +--
include/linux/bio.h | 12 ++++++++++-
kernel/power/swap.c | 2 -
mm/bounce.c | 8 +++----
mm/page_io.c | 2 -
30 files changed, 104 insertions(+), 95 deletions(-)

3: bio.stack-2.6.23.12

This adds an internal stack to each struct bio and introduces two new
bio operations:

data = bio_push(bio, worksize, endio);
data = bio_pop(bio);

The first is used before submitting a bio and the second is used in the
endio handler, which gives the driver a nice way to share context
between the two events. If the requested amount stack space is not
available in the bio, the bio stack is automatically extended.

Currently the stack size in the bio is set to zero and is always
extended on the first bio_push. An upcoming revision will add a
mechanism for a block driver to specify the initial amount of stack
space it knows will always be needed. Some time further in the future,
a mechanism for discovering the stack requirements of several block
devices in a stack may be added, so that a typical bio submission is
able to traverse the entire stack with only a single bio allocation.
(I learned from Andreas Dilger last month at FAST that Lustre already
implements a mechanism along these lines.)

In addition to stacking context data, the bio endio handler is also
placed on the stack. This reflects the fact that the only valid
purpose for putting data on the bio stack is to communicate it to a
stacked endio handler, so it makes sense to set the endio handler at
the same time as allocating stack space.

It is not as clear whether the bio target device should be moved to the
stack as well. If it were, the generic endio handler could easily
account congestion data etc for the device to which the bio was
submitted. For now, this field just stays as it was.

This patch changes the wrapper functions introduced in the previous
patch to access the endio function through the stack instead of a
direct field lookup.

As currently implemented, an empty bio stack adds eight bytes to a bio
on a 32 bit machine. The expectation is that considerably more memory
in stacking drivers will be saved in return. (As it happens, this size
increase disappears in the noise of slab alignment effects, but all the
same...) A bio variant could be created without any stack for the
common case of direct bio submission to a nonstacking driver, though
the expected memory saving is arguably not worth the effort.

The fast path of the bio stacking code looks like this:

void *bio_push(struct bio *bio, unsigned size, bio_end_io_t *endio)
{
struct bioframe *frame = bio->bi_stack;
size += sizeof(struct bioframe);
size += -size & (BIOSTACK_ALIGN - 1);
bio->bi_stack += size;
*(struct bioframe *)bio->bi_stack = (struct bioframe){
.stacksize = frame->stacksize - size,
.framesize = size, .endio = endio };
return frame->space;
}

void *bio_pop(struct bio *bio)
{
struct bioframe *frame = bio->bi_stack;
frame = bio->bi_stack -= frame->framesize;
return frame->space;
}

For the push, just an 8 byte structure assignment, add to memory and
some arithmetic. The pop is just an add to memory. Both are lockless
and cacheline efficient. The slab allocations that will be replaced by
this mechanism are not lockless.

diffstat patch/3/bio.stack-2.6.23.12
fs/bio.c | 48
++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/bio.h | 11 +++++++++--
2 files changed, 57 insertions(+), 2 deletions(-)

Results

Booting a typical workstation to a graphical desktop with a vanilla
2.6.23.13 kernel leaves the slabs looking like this:

cat /proc/slabinfo | grep bio | cut -b-38
biovec-256 2 2 3072
biovec-128 2 5 1536
biovec-64 4 10 768
biovec-16 5 15 256
biovec-4 11 59 64
biovec-1 59 203 16
bio 96 150 128

Ater bio.bvec.combine, thinks look a little different:

cat /proc/slabinfo | grep bio | cut -b-38
bio-256 2 2 3200
bio-128 2 4 1664
bio-64 16 16 896
bio-16 20 20 384
bio-4 30 30 128
bio-1 90 150 128

The original bio slab is gone and the biovec slabs have been replaced by
bio slabs with slightly larger object size. (The slab object size
distribution will eventually be retuned so that the smallest slabs are
not same size when rounded to a hardware cache line boundary.)

After bio.stack, the slabs look like:

cat /proc/slabinfo | grep bio | cut -b-38
biospace 0 0 128
bio-256 2 2 3200
bio-128 2 4 1664
bio-64 5 16 896
bio-16 9 20 384
bio-4 6 30 128
bio-1 120 150 128

A new biospace slab has appeared to hold bio stack extension, otherwise
things look much the same.

4: dm.reduce.allocs-2.6.23.12

Each bio submitted to a device mapper device currently requires a
minimum of six slab allocations:

* Submitter creates the original bio (2 slab allocations)

* device mapper makes a clone of the bio (2 slab allocations)

* device mapper allocates a dm_io and a dm_target_io structure,
attaches the dm_target_io to the dm_io and attaches the dm_io
to the private pointer of the clone (2 slab allocations)

Each additional layer in a stacked device adds at least four more
allocations. Further allocations are typically done inside the target
driver, for example, ddsnap does at least one additional allocation for
all bio transfers except origin reads.

Thus, for a simple stack consisting of a ddsnap driver on top of an lvm
partition a minimum of 11 slab allocations take place each time a bio
travels down the stack. The object of the fourth patch in this set is
to reduce this to a single allocation per bio in the common case, by
taking advantage of the bio stacking functionality implemented in the
first three patches.

The bio.single.alloc patch removes three slab allocations by allowing a
bio or a clone to be created with a single allocation instead of two.
The dm.reduce.allocs patch, when finished, will replace the three slab
allocations at each device mapper level with a single call to bio_push.
Finally, ddsnap can be modified to use bio_push instead of allocating a
hook structure, which arrives at the desired single allocation for the
entire path.

All About Clone_and_Map

Each device mapper virtual device is actually a concatenation of device
mapper targets. Device mapper has to handle the case where an incoming
bio has to be split across an arbitrary number of target devices that
it overlaps. This is implemented in the call chain:

dm_request ->
__split_bio ->
__clone_and_map ->
__map_bio

The __clone_and_map function is the one of interest (note the CRUD START
and CRUD END comments in dm.c.).

Device mapper proceeds by working from beginning to end of the original
bio, creating and submitting clone bios that do not cross target
boundaries until the remainder fits entirely within a single target.
The final case is the common one, and in fact the only case when the
virtual device has a single target. Today's patch optimizes the
remainder case to use the push/pop mechanism to allow the original bio
to be resubmitted instead of submitting a clone. The other cases are
left alone for now, so the code does not actually get smaller yet.
But you can compare my new "dm_map" function to the __map_bio function
used by the clones and see that it is quite a bit shorter already, and
so is the code that drives it from __clone_and_map.

In the general case where a bio crosses target boundaries, cloning
cannot be avoided because we want to initiate multiple dependent
transfers in parallel. However, the dm_io and dm_target_io allocations
can be replaced by stack allocations within the bio.

Finally, the whole call chain above can be cleaned up considerably,
because it really should be a single function, and suffers from
considerable fluff associated with communicating between the wrongly
factored pieces. The result will look lot like what the generic bio
layer needs in order to do the tricks that device mapper does.

In passing I have to say that __clone_and_map, apart from looking pretty
ugly, is a really nice hack. Joe Thornber surely got out of the right
side of bed the day he dreamed that up. It is just too bad that our
little community managed to annoy him so much that he ran off to work
in a bank, thus depriving us of further insights from him, and leaving
Device Mapper core in a state of suspended animation for the last four
years while Sun did not stand still. Food for thought.

Finally

Once again, thankyou intrepid reader for reading all the way to here. I
have heard at least one complaint that my posts are too long. Well, I
have this to say about that: if you think today's post is a lot of
words to read, don't even think about reading War and Peace :-)

Stay tuned for more yummy patches on the road to lvm3.

[1] ddsetup, http://lkml.org/lkml/2008/3/5/82, An alternative interface
to device mapper

http://zumastor.org/lvm3/bio.single.alloc-2.6.23.12
http://zumastor.org/lvm3/bio.hide.endio-2.6.23.12
http://zumastor.org/lvm3/bio.stack-2.6.23.12
http://zumastor.org/lvm3/dm.reduce.allocs-2.6.23.12


2008-03-11 11:33:37

by Ph. Marek

[permalink] [raw]
Subject: Re: [RFC] Stacking bio support

Hello Daniel!

On Dienstag, 11. M?rz 2008, Daniel Phillips wrote:
> 3: bio.stack-2.6.23.12
>
> This adds an internal stack to each struct bio and introduces two new
> bio operations:
>
> data = bio_push(bio, worksize, endio);
> data = bio_pop(bio);
>
> The first is used before submitting a bio and the second is used in the
> endio handler, which gives the driver a nice way to share context
> between the two events. If the requested amount stack space is not
> available in the bio, the bio stack is automatically extended.
...
> (I learned from Andreas Dilger last month at FAST that Lustre already
> implements a mechanism along these lines.)
Win32 has IRP stacks, which do mostly the same AFAIU.
http://msdn2.microsoft.com/en-us/library/ms796144.aspx

How do you handle the reallocation?
- If you don't do it (but rely on the fact that the initial allocation is
enough), you might end up with NO_MORE_IRP_STACK_LOCATIONS
http://msdn2.microsoft.com/en-us/library/ms793675.aspx
- If you do reallocate, the allocations have to register themselves in
the emergency pool (see the current thread about swapping over NFS)

I don't say that it's impossible ... just that some "interesting" things will
await you.

> Currently the stack size in the bio is set to zero and is always
> extended on the first bio_push. An upcoming revision will add a
> mechanism for a block driver to specify the initial amount of stack
> space it knows will always be needed. Some time further in the future,
> a mechanism for discovering the stack requirements of several block
> devices in a stack may be added, so that a typical bio submission is
> able to traverse the entire stack with only a single bio allocation.
That's different from the Win32 way AFAIK - there it's defined that every
layer *has* to use its own stack location. (But it's been some time since I
needed that, so I might be wrong.)


But I sure hope you succeed!


Regards,

Phil

2008-03-11 12:07:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Stacking bio support

On Tuesday 11 March 2008 04:33, Ph. Marek wrote:
> Win32 has IRP stacks, which do mostly the same AFAIU.
> http://msdn2.microsoft.com/en-us/library/ms796144.aspx

That seems to be filling a similar need all right, though it looks
like a fancier (read: clunkier) solution.

> How do you handle the reallocation?
> - If you don't do it (but rely on the fact that the initial allocation is
> enough), you might end up with NO_MORE_IRP_STACK_LOCATIONS
> http://msdn2.microsoft.com/en-us/library/ms793675.aspx
> - If you do reallocate, the allocations have to register themselves in
> the emergency pool (see the current thread about swapping over NFS)

Yes, I reallocate. I do not currently register these with the
emergency pool, good spotting. I intend to do all such reallocations
with GFP_MEMALLOC (out of tree deadlock-prevention allocation flag) and
rely on (out of tree) bio throttling to prevent the memalloc reserve
from being exhausted. Hopefully these things will be in-tree in due
course.

Incidentally, the bio stack should make the bio throttling somewhat
more elegant, a nice circular effect.

> I don't say that it's impossible ... just that some "interesting" things will
> await you.

Tell me about it :-)

> That's different from the Win32 way AFAIK - there it's defined that every
> layer *has* to use its own stack location. (But it's been some time since I
> needed that, so I might be wrong.)

I think you are right. In fact, I thought about this for a couple of
years, always getting hung up at exactly that point. When I stopped
trying to see the stack as a fixed size object with preassigned frames,
the rest fell into place. One obvious problem with the pre-assigned
approach: you don't always know the path ahead of time that a bio
will take to a physical device.

> But I sure hope you succeed!

Thankyou for your useful comments. I do need to present a solution
complete with deadlock prevention. I guess the bio code will end up
simpler there too, because with the memalloc anti-deadlock approach,
the array of bio mempools can go away.

Regards,

Daniel

2008-03-11 12:38:45

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [RFC] Stacking bio support

On Tue, Mar 11 2008 at 14:07 +0200, Daniel Phillips <[email protected]> wrote:
> On Tuesday 11 March 2008 04:33, Ph. Marek wrote:
>> Win32 has IRP stacks, which do mostly the same AFAIU.
>> http://msdn2.microsoft.com/en-us/library/ms796144.aspx
>
> That seems to be filling a similar need all right, though it looks
> like a fancier (read: clunkier) solution.
>
>> How do you handle the reallocation?
>> - If you don't do it (but rely on the fact that the initial allocation is
>> enough), you might end up with NO_MORE_IRP_STACK_LOCATIONS
>> http://msdn2.microsoft.com/en-us/library/ms793675.aspx
>> - If you do reallocate, the allocations have to register themselves in
>> the emergency pool (see the current thread about swapping over NFS)
>
> Yes, I reallocate. I do not currently register these with the
> emergency pool, good spotting. I intend to do all such reallocations
> with GFP_MEMALLOC (out of tree deadlock-prevention allocation flag) and
> rely on (out of tree) bio throttling to prevent the memalloc reserve
> from being exhausted. Hopefully these things will be in-tree in due
> course.
>

I guess that is not worse then current implementation. So many slab
allocations saved, lets you have a few of your own.
>From passed experience, though, relocation is changed to a link-list
(chaining) of sorts the first time relocation starts to hit consistently.
for 1-in-100 case it can stay, any higher then that better allocate more
space and chain it to the old. It also fits better with the pools paradigm.
I would leave the reallocation for a while but make sure it is all hidden
behind the right API to be easily enhanced later on.

> Incidentally, the bio stack should make the bio throttling somewhat
> more elegant, a nice circular effect.
>
>> I don't say that it's impossible ... just that some "interesting" things will
>> await you.
>
> Tell me about it :-)
>
>> That's different from the Win32 way AFAIK - there it's defined that every
>> layer *has* to use its own stack location. (But it's been some time since I
>> needed that, so I might be wrong.)
>
> I think you are right. In fact, I thought about this for a couple of
> years, always getting hung up at exactly that point. When I stopped
> trying to see the stack as a fixed size object with preassigned frames,
> the rest fell into place. One obvious problem with the pre-assigned
> approach: you don't always know the path ahead of time that a bio
> will take to a physical device.
>
>> But I sure hope you succeed!
>
> Thankyou for your useful comments. I do need to present a solution
> complete with deadlock prevention. I guess the bio code will end up
> simpler there too, because with the memalloc anti-deadlock approach,
> the array of bio mempools can go away.
>
> Regards,
>
> Daniel

Me too, I'll be watching out on your progress, It looks like the building
blocks of some very advanced possibilities. ("Pure Data")

Do you have a public git tree that we can inspect from time to time.

Cheers
Boaz

2008-03-14 13:59:33

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Stacking bio support

Hi Daniel -

I was able to get the first 3 patches ported to the stable tree of 2.6.23, but there were compilation errors - fixed in the 4th patch in the stream I'll post next. Regardless of that, it still failed to boot on a 2-way AMD64. Do you have a set of these patches that builds correctly & boots! :-)

There are checkpatch.pl notices about your patches as well. In any event, like I said above, I'll post the 4 patches that get it to at least build properly.

Alan

2008-03-14 14:06:28

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [RFC] Stacking bio support

Sorry, this is patch 2/4...

2008-03-16 21:38:30

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] Stacking bio support

On Friday 14 March 2008 06:59, Alan D. Brunelle wrote:
> I was able to get the first 3 patches ported to the stable tree of
> 2.6.23, but there were compilation errors - fixed in the 4th patch
> in the stream I'll post next. Regardless of that, it still failed
> to boot on a 2-way AMD64. Do you have a set of these patches that
> builds correctly & boots! :-)

Here you are:

http://code.google.com/p/zumastor/source/browse/www/lvm3/bio.single.alloc-2.6.24.3
http://code.google.com/p/zumastor/source/browse/www/lvm3/bio.hide.endio-2.6.24.3
http://code.google.com/p/zumastor/source/browse/www/lvm3/bio.stack-2.6.24.3
http://code.google.com/p/zumastor/source/browse/www/lvm3/dm.reduce.allocs-2.6.24.3

This compiles and boots for me, though I only tried device mapper
under UML, and then only lightly. It would be nice to see some reports
on the stability on this, one way or the other. Next move is to clean
up __clone_and_map the rest of the way.

Thanks for your merge. Not easy, hmm? The mainline bio changes
generated a bunch of tracking work, but they are sensible changes so
no regrets (the bogus endio bytes parameter and int return are gone).
Now... it appears to me that bi_size is an entirely redundant field,
so while we are turfing bio junk out...

Regards,

Daniel