The 2.5 port of device-mapper is available from:
bk://device-mapper.bkbits.net/2.5-stable
There are 6 changesets in here that are summarised at the end of this
email.
If you wish to use it you will need to install libdevmapper:
ftp://ftp.sistina.com/pub/LVM2/device-mapper/device-mapper-latest.tgz
and then either use dmsetup, or the LVM tools:
ftp://ftp.sistina.com/pub/LVM2/tools/LVM2.0-latest.tgz
Initial testing has been successful, though it hasn't been exercised
nearly as much as the stable 2.4 releases.
There is a chunk of ~50 lines in dm.c, which I have clearly labelled,
this is stop gap block splitting code that will be replaced with the
correct use of queue->merge_bvec_fn etc. However before I can make
this change there are a couple of other patches to the block layer
that I want to merge.
Since the feature freeze is looming I would appreciate it if this code
was merged now. Allowing me to patch/argue to get the extra
performance at my leisure. Good plan ?
Slightly more info available at: http://people.sistina.com/~thornber/
- Joe Thornber
[email protected], 2002-10-09 17:06:21+01:00, [email protected]
[mempool]
Most people use mempools in conjunction with slabs, this defines
a couple of utility functions for allocing/freeing.
[email protected], 2002-10-09 17:09:58+01:00, [email protected]
[vmalloc]
Introduce vcalloc, I only really want it to automate the size overflow
check when allocating arrays.
[email protected], 2002-10-09 17:28:44+01:00, [email protected]
[Device mapper]
The core of the device-mapper driver. No interface or target types
included as yet.
ftp://ftp.sistina.com/pub/LVM2/tools/LVM
[email protected], 2002-10-09 17:33:00+01:00, [email protected]
[Device mapper]
The linear target maps a contigous range of logical sectors onto an
contiguous range of physical sectors.
[email protected], 2002-10-09 17:36:55+01:00, [email protected]
[Device mapper]
The stripe target. Maps a range of logical sectors across many
physical volumes.
[email protected], 2002-10-09 17:41:10+01:00, [email protected]
[Device mapper]
Provide a traditional ioctl based interface to control device-mapper
from userland.
Joe Thornber wrote:
>
> The 2.5 port of device-mapper is available from:
>
> bk://device-mapper.bkbits.net/2.5-stable
Is it available in a form which everyone can access?
> There are 6 changesets in here that are summarised at the end of this
> email.
>
> If you wish to use it you will need to install libdevmapper:
>
> ftp://ftp.sistina.com/pub/LVM2/device-mapper/device-mapper-latest.tgz
>
That appears to have 2.4.19 diffs only?
What I really wanted to know is "are you still using kiobufs"?
If so, what do we need to do to not do that?
On Wed, Oct 09, 2002 at 11:31:34AM -0700, Andrew Morton wrote:
> Joe Thornber wrote:
> >
> > The 2.5 port of device-mapper is available from:
> >
> > bk://device-mapper.bkbits.net/2.5-stable
>
> Is it available in a form which everyone can access?
http://people.sistina.com/~thornber/dm_2002-10-09.tar.bz2
> What I really wanted to know is "are you still using kiobufs"?
What I just submitted doesn't include the snapshot target as yet, I
want people to focus on the dm core rather than getting side tracked
by one of the targets.
> If so, what do we need to do to not do that?
For the last few months all snapshot io has been going through kcopyd,
*except* the snapshot metadata update (which did use a kiobuf). I
have a patch which sends the metadata through kcopyd in the works, so
when I submit the snapshot target it will not use a kiobuf - I'm no
fan of kiobufs.
- Joe
On Wed, 2002-10-09 at 19:12, Joe Thornber wrote:
> The 2.5 port of device-mapper is available from:
>
> bk://device-mapper.bkbits.net/2.5-stable
>
> There are 6 changesets in here that are summarised at the end of this
> email.
Until I get it in patch form I can't merge it...
On Sat, 2002-10-12 at 21:05, Alan Cox wrote:
> On Wed, 2002-10-09 at 19:12, Joe Thornber wrote:
> > The 2.5 port of device-mapper is available from:
> >
> > bk://device-mapper.bkbits.net/2.5-stable
> >
> > There are 6 changesets in here that are summarised at the end of this
> > email.
>
> Until I get it in patch form I can't merge it...
(Please ignore. I sent an old postponed mail. I dont want to be mailed
lots of copies of the patches 8))
Just curious, but device-mapper and 2.5.42 do not seem to jive very
well. Please advise.
--
Austin Gonyou <[email protected]>
Coremetrics, Inc.
On Mon, Oct 14, 2002 at 11:59:17AM -0500, Austin Gonyou wrote:
> Just curious, but device-mapper and 2.5.42 do not seem to jive very
> well. Please advise.
You need device-jiver. Doubtful it'll make the freeze in time though.
Seriously, what exactly does this bug report mean ?
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
On Mon, 2002-10-14 at 12:11, Dave Jones wrote:
> On Mon, Oct 14, 2002 at 11:59:17AM -0500, Austin Gonyou wrote:
> > Just curious, but device-mapper and 2.5.42 do not seem to jive very
> > well. Please advise.
>
> You need device-jiver. Doubtful it'll make the freeze in time though.
> Seriously, what exactly does this bug report mean ?
Sorry..sorry...I'm saying that the recent device-mapper doesn't seem to
try to compile with 2.5.42, and was inquiring as to:
1. Is it not supposed to work yet?
2. Are there any patches, to make this work as yet?
#2 is one of the main reasons I'm responding to this thread. I'm an
interested party for testing LVM2 on 2.5, but can't seem to get there.
There was discussion about using EVMS, but well, it didn't seem to work
as seamlessly as LVM upgrades had, and I've not had the time to trouble
shoot it till later this week.
> Dave
>
> --
> | Dave Jones. http://www.codemonkey.org.uk
--
Austin Gonyou <[email protected]>
Coremetrics, Inc.
On Mon, Oct 14, 2002 at 11:59:17AM -0500, Austin Gonyou wrote:
> Just curious, but device-mapper and 2.5.42 do not seem to jive very
> well. Please advise.
Try the patches at:
http://people.sistina.com/~thornber/patches/2.5-unstable/
There are three minor changes:
09.patch
[Device-mapper]
Remove linux/iobuf.h include.
10.patch
[Device-mapper]
Add call to blk_queue_bounce() at the beginning of the request function.
11.patch
[Device-mapper]
Pass md into dm_suspend/dm_resume instead of a kdev_t.
- Joe
On Mon, 2002-10-14 at 12:56, Joe Thornber wrote:
> On Mon, Oct 14, 2002 at 11:59:17AM -0500, Austin Gonyou wrote:
> > Just curious, but device-mapper and 2.5.42 do not seem to jive very
> > well. Please advise.
>
> Try the patches at:
>
> http://people.sistina.com/~thornber/patches/2.5-unstable/
>
Thanks Joe, I will. I will let you know ASAP.
--
Austin Gonyou <[email protected]>
Coremetrics, Inc.
On Mon, Oct 14 2002, Joe Thornber wrote:
> 10.patch
> [Device-mapper]
> Add call to blk_queue_bounce() at the beginning of the request function.
What on earth for? I also see that you are setting BLK_BOUNCE_HIGH as
the bounce limit unconditionally for your queue. Puzzled.
When does dm even have to touch the data in the bio?
--
Jens Axboe
On Tue, Oct 15, 2002 at 10:21:52AM +0200, Jens Axboe wrote:
> On Mon, Oct 14 2002, Joe Thornber wrote:
> > 10.patch
> > [Device-mapper]
> > Add call to blk_queue_bounce() at the beginning of the request function.
>
> What on earth for? I also see that you are setting BLK_BOUNCE_HIGH as
> the bounce limit unconditionally for your queue. Puzzled.
This is just me stupidly copying loop.c, already found out it doesn't
work. Please ignore.
> When does dm even have to touch the data in the bio?
Tell me; if I'm splitting a bio using bio_clone, and then map the bio
to a driver that calls blk_queue_bounce. How can I avoid the
BUG_ON((*bio_orig)->bi_idx);
triggering ? Or is bio_clone not to be used anymore ?
- Joe
On Tue, Oct 15 2002, Joe Thornber wrote:
> On Tue, Oct 15, 2002 at 10:21:52AM +0200, Jens Axboe wrote:
> > On Mon, Oct 14 2002, Joe Thornber wrote:
> > > 10.patch
> > > [Device-mapper]
> > > Add call to blk_queue_bounce() at the beginning of the request function.
> >
> > What on earth for? I also see that you are setting BLK_BOUNCE_HIGH as
> > the bounce limit unconditionally for your queue. Puzzled.
>
> This is just me stupidly copying loop.c, already found out it doesn't
> work. Please ignore.
Well it should work, but dm never ever wants to bounce any data on its
own. That should only happen at the target queue, if at all.
> > When does dm even have to touch the data in the bio?
>
> Tell me; if I'm splitting a bio using bio_clone, and then map the bio
> to a driver that calls blk_queue_bounce. How can I avoid the
>
> BUG_ON((*bio_orig)->bi_idx);
>
> triggering ? Or is bio_clone not to be used anymore ?
(btw, do you have a complete patch against a recent kernel?)
Bouncing has just never been used with a cloned bio, so there might be a
corner case or two that needs to be fixed up. But walk me through your
request handling, please. It seems you are always allocating a
clone_info and bio clone for io?
--
Jens Axboe
On Tue, Oct 15, 2002 at 11:36:08AM +0200, Jens Axboe wrote:
> (btw, do you have a complete patch against a recent kernel?)
Split patches are in here and were built against whatever bk pulled
yesterday.
http://people.sistina.com/~thornber/patches/2.5-unstable
> Bouncing has just never been used with a cloned bio, so there might be a
> corner case or two that needs to be fixed up.
OK, in that case I think we have found one of those corner cases.
Please review the BUG_ON at the top of blk_queue_bounce.
> But walk me through your
> request handling, please. It seems you are always allocating a
> clone_info and bio clone for io?
Before I dive into what dm is doing now I'll describe a couple of
changes that I feel need to be made to the block layer.
Flushing mapped io
------------------
Whever a mapping driver changes the mapping of a device that is in use
it should ensure that any io already mapped with the previous
mapping/table has completed. At the moment I do this by hooking the
bi_endio fn and incrementing an pending count (see
dm_suspend/dm_resume). Your new merge_bvec_fn (which I like) means
that the request has effectively been partially remapped *before* it
gets into the request function. ie. the request is now mapping
specific. So before I can use merge_bvec_fn we need to move the
suspend/resume functionality up into ll_rw_block (eg, move the pending
count into the queue ?). *Every* driver that dynamically changes a
mapping needs this functionality, LVM1 ignores it but is so simple it
*might* get away with it, dm certainly needs it, and I presume EVMS
has similar code in their application.
Agree ?
Splitting pages
---------------
ATM the merge_bvec_fn is a predicate which states whether a page
_can't_ be appended to a bio (a curious way round to do it). The
driver will always have to contain code that copes with the case when
a mapping boundary occurs within a page (this happens). I would
really like to go the whole hog and change the merge_fn so that it
returns how much of the page can be accepted. As far as I can tell
(you're probably about to put me right), the only reason you haven't
done this is because:
/*
* I/O completion handler for multipage BIOs.
*
* The mpage code never puts partial pages into a BIO (except for end-of-file).
* If a page does not map to a contiguous run of blocks then it simply falls
* back to block_read_full_page().
*
* Why is this? If a page's completion depends on a number of different BIOs
* which can complete in any order (or at the same time) then determining the
* status of that page is hard. See end_buffer_async_read() for the details.
* There is no point in duplicating all that complexity.
*/
I believe it _is_ possible to implement this without incurring
significant CPU overhead or extra memory usage in the common case that
the mapping boundaries do occur on page breaks (This was another
reason why I was trying to get the users of bio_add_page using a
slightly higher level api that takes the bio submitting out of their
hands).
BTW: is there any point calling the merge_bvec_fn in bio_add_page when
the bios length is zero ?
dm splitting now
----------------
The splitting code as it exists ATM is ignoring the merge_bvec_fn.
I'm really not worried about performance yet, that said bonnie++
results are very similar on a raw partition or a dm mapping.
> request handling, please. It seems you are always allocating a
> clone_info and bio clone for io?
The clone_info is a little struct that sits on the stack, keeping
track of the splitting progress. It is not dynamically allocated.
static void __split_bio(struct mapped_device *md, struct bio *bio)
{
struct clone_info ci;
...
while (ci.sector_count)
__clone_and_map(&ci);
If you look at the 2.4 version of dm you will see I'm using a struct
io_hook to keep track of mapped bios in order to maintain the pending
count for dm_suspend/dm_resume. Since I will often need to allocate a
bio_clone anyway I decided to do away with the struct io_hook, and
always allocate the clone.
I will be very happy when the above two changes have been made to the
block layer and the request function can become a simple, single
remapping. No splitting or memory allocations will then be
neccessary.
- Joe
On Tue, Oct 15 2002, Joe Thornber wrote:
> On Tue, Oct 15, 2002 at 11:36:08AM +0200, Jens Axboe wrote:
> > (btw, do you have a complete patch against a recent kernel?)
>
> Split patches are in here and were built against whatever bk pulled
> yesterday.
>
> http://people.sistina.com/~thornber/patches/2.5-unstable
ok
> > Bouncing has just never been used with a cloned bio, so there might be a
> > corner case or two that needs to be fixed up.
>
> OK, in that case I think we have found one of those corner cases.
> Please review the BUG_ON at the top of blk_queue_bounce.
I remember this BUG_ON() and I also remember why I added it. It was
basically laziness, what it tells you is that blk_queue_bounce() is for
now expecting that it should iterate all vecs in a bio. It really needs
to go from bi_idx -> bi_vcnt. I _think_ this was when
bio_for_each_segment() meant iterating from 0 -> bi_vcnt. These days it
correctly does:
#define bio_for_each_segment(bvl, bio, i)
__bio_for_each_segment(bvl, bio, i, (bio)->bi_idx)
so blk_queue_bounce() could potentially Just Work if you remove the
BUG_ON() and count segments and size to set in the allocated bio instead
of just copying bio_orig bi_size and bi_vcnt. Oh, and start sector too.
It would be nice to have the bio->bi_voffset patches for this :-)
> > But walk me through your
> > request handling, please. It seems you are always allocating a
> > clone_info and bio clone for io?
>
> Before I dive into what dm is doing now I'll describe a couple of
> changes that I feel need to be made to the block layer.
>
> Flushing mapped io
> ------------------
>
> Whever a mapping driver changes the mapping of a device that is in use
> it should ensure that any io already mapped with the previous
> mapping/table has completed. At the moment I do this by hooking the
> bi_endio fn and incrementing an pending count (see
> dm_suspend/dm_resume). Your new merge_bvec_fn (which I like) means
> that the request has effectively been partially remapped *before* it
> gets into the request function. ie. the request is now mapping
> specific. So before I can use merge_bvec_fn we need to move the
> suspend/resume functionality up into ll_rw_block (eg, move the pending
> count into the queue ?). *Every* driver that dynamically changes a
> mapping needs this functionality, LVM1 ignores it but is so simple it
> *might* get away with it, dm certainly needs it, and I presume EVMS
> has similar code in their application.
>
> Agree ?
Agreed, we definitely need some sort of suspend (and flush) + resume
queue hooks.
> Splitting pages
> ---------------
>
> ATM the merge_bvec_fn is a predicate which states whether a page
> _can't_ be appended to a bio (a curious way round to do it). The
> driver will always have to contain code that copes with the case when
> a mapping boundary occurs within a page (this happens). I would
Yes
> really like to go the whole hog and change the merge_fn so that it
> returns how much of the page can be accepted. As far as I can tell
Irk. This was in the first patches sent to Linus, but it gets ugly due
to several reason. More below.
> (you're probably about to put me right), the only reason you haven't
> done this is because:
>
> /*
> * I/O completion handler for multipage BIOs.
> *
> * The mpage code never puts partial pages into a BIO (except for end-of-file).
> * If a page does not map to a contiguous run of blocks then it simply falls
> * back to block_read_full_page().
> *
> * Why is this? If a page's completion depends on a number of different BIOs
> * which can complete in any order (or at the same time) then determining the
> * status of that page is hard. See end_buffer_async_read() for the details.
> * There is no point in duplicating all that complexity.
> */
>
> I believe it _is_ possible to implement this without incurring
> significant CPU overhead or extra memory usage in the common case that
> the mapping boundaries do occur on page breaks (This was another
> reason why I was trying to get the users of bio_add_page using a
> slightly higher level api that takes the bio submitting out of their
> hands).
mpage wasn't really considered, it's a simple enough user that it can be
made to use basically any interface. The main reason is indeed the way
it is used right now, directly from any submitter of bio's.
I'd be happy to take patches that change how this works. First patch
could change merge_bvec_fn() to return number of bytes you can accept at
this location. This can be done without breaking anything else, as long
as bio_add_page() is changed to read:
if (q->merge_bvec_fn) {
ret = q->merge_bvec_fn(q, bio, bvec);
if (ret != bvec->bv_len)
return 1;
}
Second patch can then add a wrapper around bio_add_page() for queueing a
range of pages for io to a given device. That solves the very problem
for not having this in the first place - having a single page in more
than one bio, this will break several bi_end_io() handling functions. If
you wrap submission and end_io, it could easily be made to work.
> BTW: is there any point calling the merge_bvec_fn in bio_add_page when
> the bios length is zero ?
Nah, not really.
> dm splitting now
> ----------------
>
> The splitting code as it exists ATM is ignoring the merge_bvec_fn.
> I'm really not worried about performance yet, that said bonnie++
> results are very similar on a raw partition or a dm mapping.
>
> > request handling, please. It seems you are always allocating a
> > clone_info and bio clone for io?
>
> The clone_info is a little struct that sits on the stack, keeping
> track of the splitting progress. It is not dynamically allocated.
>
> static void __split_bio(struct mapped_device *md, struct bio *bio)
> {
> struct clone_info ci;
>
> ...
>
> while (ci.sector_count)
> __clone_and_map(&ci);
>
>
> If you look at the 2.4 version of dm you will see I'm using a struct
> io_hook to keep track of mapped bios in order to maintain the pending
> count for dm_suspend/dm_resume. Since I will often need to allocate a
> bio_clone anyway I decided to do away with the struct io_hook, and
> always allocate the clone.
>
> I will be very happy when the above two changes have been made to the
> block layer and the request function can become a simple, single
> remapping. No splitting or memory allocations will then be
> neccessary.
Ok fine.
--
Jens Axboe
On Tuesday 15 October 2002 05:34, Jens Axboe wrote:
> On Tue, Oct 15 2002, Joe Thornber wrote:
> > Flushing mapped io
> > ------------------
> >
> > Whever a mapping driver changes the mapping of a device that is in use
> > it should ensure that any io already mapped with the previous
> > mapping/table has completed. At the moment I do this by hooking the
> > bi_endio fn and incrementing an pending count (see
> > dm_suspend/dm_resume). Your new merge_bvec_fn (which I like) means
> > that the request has effectively been partially remapped *before* it
> > gets into the request function. ie. the request is now mapping
> > specific. So before I can use merge_bvec_fn we need to move the
> > suspend/resume functionality up into ll_rw_block (eg, move the pending
> > count into the queue ?). *Every* driver that dynamically changes a
> > mapping needs this functionality, LVM1 ignores it but is so simple it
> > *might* get away with it, dm certainly needs it, and I presume EVMS
> > has similar code in their application.
> >
> > Agree ?
>
> Agreed, we definitely need some sort of suspend (and flush) + resume
> queue hooks.
Also agreed. Not having to worry about device suspends/resumes will probably
remove a lot of internal complexity in both EVMS and DM. And like Joe
said, moving this functionality into the block layer seems to be a
pre-requisite for the merge_bvec_fn() to really work properly and avoid races
between adding pages to a bio and changing the mapping of the targetted
device.
> > I believe it _is_ possible to implement this without incurring
> > significant CPU overhead or extra memory usage in the common case that
> > the mapping boundaries do occur on page breaks (This was another
> > reason why I was trying to get the users of bio_add_page using a
> > slightly higher level api that takes the bio submitting out of their
> > hands).
>
> mpage wasn't really considered, it's a simple enough user that it can be
> made to use basically any interface. The main reason is indeed the way
> it is used right now, directly from any submitter of bio's.
>
> I'd be happy to take patches that change how this works. First patch
> could change merge_bvec_fn() to return number of bytes you can accept at
> this location. This can be done without breaking anything else, as long
> as bio_add_page() is changed to read:
>
> if (q->merge_bvec_fn) {
> ret = q->merge_bvec_fn(q, bio, bvec);
> if (ret != bvec->bv_len)
> return 1;
> }
>
> Second patch can then add a wrapper around bio_add_page() for queueing a
> range of pages for io to a given device. That solves the very problem
> for not having this in the first place - having a single page in more
> than one bio, this will break several bi_end_io() handling functions. If
> you wrap submission and end_io, it could easily be made to work.
Is there any reason why the call to merge_bvec_fn() couldn't be moved to the
layer above bio_add_page()? Calling this for every individual page seems to
be a lot of extra overhead if you are trying to construct a very large
request. In other words, in this wrapper around bio_add_page(), call
merge_bvec_fn() with the initial bio (and hence starting sector of the
request) and the driver could return the max number of bytes/sectors that
could possibly be handled starting at that sector. Then the wrapper would
know up-front how many pages could be safely added to the bio before going
through any of the work to actually add them.
Also, am I correct in assuming that for merge_bvec_fn() to work properly, a
driver's merge_bvec_fn() must also call the merge_bvec_fn() of the driver
below it? For example, lets say we have a DM linear device that maps to two
underlying devices (or in LVM-speak, a linear LV that spans two PVs), both of
which are MD RAID-1 devices. For a given large request, DM may decide that it
is fully contained within one of its two target devices, and allow all the
requested pages to be added to the bio. However, it also needs to ask MD what
its limits are for that request, or MD would still have to go through the
trouble to split up the bio after it has been submitted.
> > I will be very happy when the above two changes have been made to the
> > block layer and the request function can become a simple, single
> > remapping. No splitting or memory allocations will then be
> > neccessary.
>
> Ok fine.
Agreed. With the above block layer changes, most of the bio splitting that
EVMS does could be removed, and replaced with appropriate merge_bvec_fn()'s.
About the only exception I currently see to this is something like
bad-block-relocation, where the driver can dynamically change the mapping at
runtime, even without the device being suspended. However, since such dynamic
remappings should be very rare, they should probably just be handled as a
special case within that driver.
--
Kevin Corry
[email protected]
http://evms.sourceforge.net/
On Tuesday 15 October 2002 09:52, Kevin Corry wrote:
> Also, am I correct in assuming that for merge_bvec_fn() to work properly, a
> driver's merge_bvec_fn() must also call the merge_bvec_fn() of the driver
> below it? For example, lets say we have a DM linear device that maps to two
> underlying devices (or in LVM-speak, a linear LV that spans two PVs), both
> of which are MD RAID-1 devices. For a given large request, DM may decide
> that it is fully contained within one of its two target devices, and allow
> all the requested pages to be added to the bio. However, it also needs to
> ask MD what its limits are for that request, or MD would still have to go
> through the trouble to split up the bio after it has been submitted.
This example actually should have been RAID-0 (striping), not RAID-1
(mirroring). Sorry if this caused any confusion.
--
Kevin Corry
[email protected]
http://evms.sourceforge.net/