In the course of working on potential kernel crashes in
ll_rw_kio and fs/mpage.c, I have been thinking that there has got
to be a better way to more efficiently construct large I/O requests,
and I now have an idea of how to do it, although it may at first
seem to be regressing back to struct buffer_head. Currently, most
block IO operations will at best generate IO errors if bio_alloc
starts to fail, and code that attempts to build big block IO's has
to be a bit complex to build bio's that are as big as possible but
still not too big or having too many IO vectors for whatever the
underlying device driver says it is willing to handle.
PROSOSAL: struct bio *bio_chain(struct bio* old);
newbio = bio_chain(oldbio);
would be similar to:
newbio = bio_alloc(GFP_KERNEL, 1);
newbio->bi_bdev = oldbio->bi_bdev;
newbio->bi_rw = oldbio->bi_rw;
submit_bio(oldio);
...except for two important differences:
1. bio_chain NEVER gets a memory allocation failure. If it is
not able to allocate any memory, it fiddles with old->bi_destructor
and waits for the "old" bio to complete, and returns the old bio.
This is important because most current block device drivers will, at
best, generate errors if they fail to allocate a bio.
2. bio_chain will set some kind of hint that newbio is
probably contiguous with oldbio. So, if oldbio is still on the
device queue when newbio is eventually submitted, the merge code
will first check the request that oldbio is in for merging. So,
the merging in that case will be O(1). Also, if bio_chain is also
used to submit newbio and it is able to do the merge, bio_chain can
simply return the bio that was going to be discarded by the merge,
since bio_chain is only defined to return a bio capable of holding
one IO vector and all bio's in a request meet that criterion, and
the bio being discarded by the merge will already have bi_bdev and
bi_rw set to the correct values if it was mergeable in the first
place.
I realize there may be locking issues in implementing this.
Under this scheme, code that wants to build up big bio's
can be much simpler, because it does not have to think about
q->max_phys_segments or q->max_hw_segments (it just has to make sure
that each request is smaller than q->max_sectors). Also, part of the
IO could be started earlier if the device is idle, although that
trade-off may not be worthwhile.
Code that builds up big bio's would look something like this:
/* Allocate a lot of iovecs in the first bio, so that
the merge code is less likely to have to do a memory
allocation */
bio = alloced_bio = bio_alloc(GFP_KERNEL, bio_max_iovecs(q));
if (!bio) { /* Work even if first bio_alloc in loop fails. */
up(&devprivate->backup_bio_mutex);
bio = devprivate->backup_bio;
}
bio->bi_bdev = bdev;
bio->bi_rw = rw;
bio->bi_vcnt = 1;
for(;;) {
bio->bi_sector = sector;
bio->bi_io_vec[0].bv_page = ...;
bio->bi_io_vec[0].bv_offset = ...;
bio->bi_io_vec[0].bv_len = ...;
if (last_one)
break;
bio = bio_chain(bio);
sector += ....;
}
if (alloced_bio)
submit_bio(bio);
else {
devprivate->backup_bio = bio_chain(bio);
down(&devprivate->backup_bio_mutex);
}
The bio_chain calls should be very fast in the normal repeating
case. They will immediately succeed in merging oldbio with the bio that
was submitted before and just return oldbio. Since oldbio already has
all of the correct values set, bio_chain does not need to fill anything
in in this case. So, while the loops is theoretically building up and
destroying bio's, is, really just assembling big bio's, at least in
terms of the CPU costs.
For performance reasons, the semantics of bio_chain could be
restricted further. For example, oldbio could be required to have
bi->bi_vcnt == 1 and bi->bi_idx == 0, and the bio_chain optimization
might only be done if the bio is submitted by bio_chain, not by
submit_bio, or perhaps we could even require that the bio returned by
bio_chain can only be submitted by bio_chain or freed.
By the way, I also want to implement a mechanism like this for
the USB requests to make them also impervious to memory allocation failures
after device initialization.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
On Sat, Jun 15 2002, Adam J. Richter wrote:
> >> Can't I make a macro to do a table lookup from bio->bi_max?
>
> >Not really. If I do
>
> > bio_alloc(GFP_KERNEL, 27);
>
> >then I'll get a 32-slot bvec. But presumably, I don't
> >want to put more than 27 pages into it.
>
> If you called bio_alloc with a smaller number, that would
> just be the result of a small IO that you knew could not generate
> more iovecs than that. So, that scenario will not happen.
But the macro lookup would fail for bios not coming from bio_alloc()
--
Jens Axboe
On Sat, Jun 15 2002, Andrew Morton wrote:
> "Adam J. Richter" wrote:
> >
> > ...
> > newbio = q->one_more_bvec(q, bio, page, offset, len);
> >
>
> That's a comfortable interface. Or maybe just:
>
> struct bio *bio_add_bvec(bio, page, offset, len);
I agree, that's the interface that I want.
> Couple of points:
>
> - It's tricky to determine how many bvecs are available in
> a bio. There is no straightforward "how big is it" field
> in struct bio.
That's true. It's trivial for bios coming out of bio pools, for
privately allocated ones it gets a bit worse. Should not be hard to
clean up, though.
> - AFAIK, there are no established conventions for BIO assembly.
> We have conventions which state what the various fields do
> while the BIO is being processed by the block layer, but not
> for when the client is assembling the BIO.
>
> What I did, and what I'd suggest as a convention is:
>
> During BIO assembly, bi_vcnt indicates the maximum number of
> bvecs which the BIO can hold. And bi_idx indexes the next-free
> bvec within the BIO.
Hmm I don't like that too much. For reference, bi_vcnt from the block
layer is the number of bio_vecs in the bio. And bi_idx is the index into
the 'current' bio_vec. To tie that in with the above, how about just
changing bi_max to be a real number. Internal bio can still find the
pool from that, and private bios can just fill it out.
--
Jens Axboe
On Jun 13, 2002 18:56 -0700, Adam J. Richter wrote:
> 2. bio_chain will set some kind of hint that newbio is
> probably contiguous with oldbio. So, if oldbio is still on the
> device queue when newbio is eventually submitted, the merge code
> will first check the request that oldbio is in for merging. So,
> the merging in that case will be O(1).
I really like this part of it. I always disliked the fact that you
might have a large request at one layer that had to be split up for
submission, only to be re-merged later. The fact that it could do
the merge in O(1) would be a good thing.
> Under this scheme, code that wants to build up big bio's
> can be much simpler, because it does not have to think about
> q->max_phys_segments or q->max_hw_segments (it just has to make sure
> that each request is smaller than q->max_sectors).
The recent discussions in this area have also been unsatisfying, and
this proposal works nicely to remove any knowledge of block device
specific limitations from the submitter.
If you are going to OLS this summer, there is a BOF related to
this "splitting BIO requests in the block layer" or something like
that.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
On Thu, Jun 13 2002, Adam J. Richter wrote:
>
> In the course of working on potential kernel crashes in
> ll_rw_kio and fs/mpage.c, I have been thinking that there has got
> to be a better way to more efficiently construct large I/O requests,
> and I now have an idea of how to do it, although it may at first
> seem to be regressing back to struct buffer_head. Currently, most
> block IO operations will at best generate IO errors if bio_alloc
> starts to fail, and code that attempts to build big block IO's has
> to be a bit complex to build bio's that are as big as possible but
> still not too big or having too many IO vectors for whatever the
> underlying device driver says it is willing to handle.
>
>
> PROSOSAL: struct bio *bio_chain(struct bio* old);
>
> newbio = bio_chain(oldbio);
>
> would be similar to:
> newbio = bio_alloc(GFP_KERNEL, 1);
> newbio->bi_bdev = oldbio->bi_bdev;
> newbio->bi_rw = oldbio->bi_rw;
> submit_bio(oldio);
>
> ...except for two important differences:
>
> 1. bio_chain NEVER gets a memory allocation failure. If it is
> not able to allocate any memory, it fiddles with old->bi_destructor
> and waits for the "old" bio to complete, and returns the old bio.
> This is important because most current block device drivers will, at
> best, generate errors if they fail to allocate a bio.
>
> 2. bio_chain will set some kind of hint that newbio is
> probably contiguous with oldbio. So, if oldbio is still on the
> device queue when newbio is eventually submitted, the merge code
> will first check the request that oldbio is in for merging. So,
> the merging in that case will be O(1). Also, if bio_chain is also
> used to submit newbio and it is able to do the merge, bio_chain can
> simply return the bio that was going to be discarded by the merge,
> since bio_chain is only defined to return a bio capable of holding
> one IO vector and all bio's in a request meet that criterion, and
> the bio being discarded by the merge will already have bi_bdev and
> bi_rw set to the correct values if it was mergeable in the first
> place.
I attempted something similar to this a long time ago, when analyzing
(and attempting to fix) i/o scheduler over head when submitting lots of
contigous buffer_heads. I know you mention that...
> I realize there may be locking issues in implementing this.
but I think that statement is very naive :-)
Essentially you are keeping a cookie to the old bio submitted, so that
you can merge new bio with it fairly cheaply (it's still not very cheap,
although you do eliminate the biggest offender, the list merge scan).
The first problem is that once you submit the bio, there are some things
that are not yours to touch anymore. You cannot change bi_next for
instance, you have no idea what state the bio is in wrt I/O completion.
Sure you can hold a reference to the bio, but all that really gets you
in this situation is a handle to it, nothing more. How do you decide
when to invalidate this cookie that you have?
For this to work, you would have to stall the request queue while all
this is in progress.
Now, I do agree with the reasoning for this -- the max_segments etc big
bio submission problem we currently have gets solved in a hell of a lot
nicer way than the max_iovecs() approach you have suggested. I don't
like that at all since I think that we will always just fall back to the
weakest link in the submission chain. The whole 'how big can this bio
be' question is a lot more complex than most people realise. It's not
just a matter of xxx sectors, or xxx pages. Once you get in to the phys
segments and hw segments problem, you end up with just saying 'build a
bio that does not violate the least restrictive rule'. Weakest link.
Then there's the matter of splitting bio's, which is always an ugly
topic. I agree that we don't want to do this, if avoidable.
So what do I like? As I see it, the only nice way to make sure that a
bio always has the best possible size is to just ask 'can we put another
page into this bio'. That way we can apply the right (and not just
minimum) limits every time.
--
Jens Axboe
On Fri, 14 Jun 2002 Jens Axboe wrote:
>On Thu, Jun 13 2002, Adam J. Richter wrote:
[...]
>> newbio = bio_chain(oldbio);
[...]
>> I realize there may be locking issues in implementing this.
>but I think that statement is very naive :-)
>Essentially you are keeping a cookie to the old bio submitted, so that
>you can merge new bio with it fairly cheaply (it's still not very cheap,
>although you do eliminate the biggest offender, the list merge scan).
>The first problem is that once you submit the bio, there are some things
>that are not yours to touch anymore. You cannot change bi_next for
>instance, you have no idea what state the bio is in wrt I/O completion.
>Sure you can hold a reference to the bio, but all that really gets you
>in this situation is a handle to it, nothing more. How do you decide
>when to invalidate this cookie that you have?
At any time, there could be only one "hinted" bio in a
request: the last bio in the request. So you only have to
clear the hint when:
1. you merge bio's,
2. elv_next_request is called,
3. newbio is submitted.
In all three cases q->queue_lock gets taken, so we should
not need to add any additional spin_lock_irq's, and the two lines
to clear the hint pointers should be trivial.
>So what do I like? As I see it, the only nice way to make sure that a
>bio always has the best possible size is to just ask 'can we put another
>page into this bio'. That way we can apply the right (and not just
>minimum) limits every time.
That would be pretty nice too, but that lacks three potential
advantages of my bio_chain appoach, in order of importance:
1. bio_chain is avoids memory allocation failures,
2. bio_chain can start the first IO slightly sooner,
3. bio_chain makes bio submitters slightly simpler.
The disadvantage of my single I/O vector bio_chain apprach is
that it results in another procedure call per vector. We also already
have code that builds multiple IO vectors in ll_rw_kio and the mpage.c
routines.
I think I would be happy enough with your approach, but, just
to eliminate possibilities of memory allocation failure, I think I
want to still have some kind of bio_chain, perhaps without the merge
hinting, but with a parameter to allow for allocating a bio up to the
size of oldbio, like so:
struct bio *bio_chain(struct bio *oldbio, int gfp_mask,
int nvecs /* <= oldbio->bi_max */);
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
"Adam J. Richter" wrote:
>
> That would be pretty nice too, but that lacks three potential
> advantages of my bio_chain appoach, in order of importance:
>
> 1. bio_chain is avoids memory allocation failures,
> 2. bio_chain can start the first IO slightly sooner,
> 3. bio_chain makes bio submitters slightly simpler.
>
> The disadvantage of my single I/O vector bio_chain apprach is
> that it results in another procedure call per vector. We also already
> have code that builds multiple IO vectors in ll_rw_kio and the mpage.c
> routines.
>
> ...
I have not yet seen a BIO allocation failure in testing. This
would indicate that either the BIO pool is too large, or I'm
running the wrong tests. Either way, I don't think we have
demonstrated any otherwise-unsolvable problems with BIO allocation.
What bugs me about your approach is this: I have been gradually
nudging the kernel's IO paths away from enormously-long per-page
call paths and per-page locking into the direction of a sequence
of short little loops, each of which does the same stuff against
a bunch of pages.
This is by no means finished. Ultimately, I want to go into the
page allocator and snip off 16 pages in a single acquisition of
the zone lock. Put these into the pagecache in a single acquisition
of the mapping lock. Add them to the LRU with a single acquisition
of the pagemap_lru_lock. Slot them all into a single BIO and send
them all off with a single submit_bio(). So the common-case unit of I/O
in the kernel will not be a page - it will be a great chunk of pages.
Everything is pretty much in place to do this now. The main piece
which is missing is the gang page allocator (Hi, Bill).
It'll be damn fast, and nicely scalable. It's all about reducing the
L1 cache footprint. Making best use of data when it is in cache.
Making best use of locks once they have been acquired. If it is
done right, it'll be almost as fast as 64k PAGE_CACHE_SIZE, with
none of its disadvantages.
In this context, bio_chain() is regression, because we're back
into doing stuff once-per-page, and longer per-page call graphs.
I'd rather not have to do it if it can be avoided.
-
On Fri, Jun 14, 2002 at 04:00:52PM -0700, Andrew Morton wrote:
> Everything is pretty much in place to do this now. The main piece
> which is missing is the gang page allocator (Hi, Bill).
> It'll be damn fast, and nicely scalable. It's all about reducing the
> L1 cache footprint. Making best use of data when it is in cache.
> Making best use of locks once they have been acquired. If it is
> done right, it'll be almost as fast as 64k PAGE_CACHE_SIZE, with
> none of its disadvantages.
> In this context, bio_chain() is regression, because we're back
> into doing stuff once-per-page, and longer per-page call graphs.
> I'd rather not have to do it if it can be avoided.
gang_cpu is not quite ready to post, but work is happening on it
and it's happening today -- I have a suitable target in hand and
am preparing it for testing. The bits written thus far consist of
a transparent per-cpu pool layer refilled using the gang transfer
mechanism, and I'm in the process of refining that to non-prototypical
code and extending it with appropriate deadlock avoidance so explicit
gang allocation requests can be satisfied.
Cheers,
Bill
Andrew Morton <[email protected]> wrote:
>I have not yet seen a BIO allocation failure in testing. This
>would indicate that either the BIO pool is too large, or I'm
>running the wrong tests. Either way, I don't think we have
>demonstrated any otherwise-unsolvable problems with BIO allocation.
You need to prove that this can never happen once the
device is initialized, not just that no 2.5 user has reported it
yet.
>What bugs me about your approach is this: I have been gradually
>nudging the kernel's IO paths away from enormously-long per-page
>call paths and per-page locking into the direction of a sequence
>of short little loops, each of which does the same stuff against
>a bunch of pages.
You need to reread the last paragraph of my previous post.
I have added some capitalization to help:
>> I think I would be happy enough with your approach, but, just
>>to eliminate possibilities of memory allocation failure, I think I
>>want to still have some kind of bio_chain, perhaps without the merge
>>hinting, BUT WITH A PARAMETER TO ALLOW FOR ALLOCATING A BIO UP TO THE
>>SIZE OF OLDBIO, like so:
>>
>>struct bio *bio_chain(struct bio *oldbio, int gfp_mask,
>> int nvecs /* <= oldbio->bi_max */);
This would not interfere with your plans try to do things
in n page chunks.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
William Lee Irwin III wrote:
>
> On Fri, Jun 14, 2002 at 04:00:52PM -0700, Andrew Morton wrote:
> > Everything is pretty much in place to do this now. The main piece
> > which is missing is the gang page allocator (Hi, Bill).
> > It'll be damn fast, and nicely scalable. It's all about reducing the
> > L1 cache footprint. Making best use of data when it is in cache.
> > Making best use of locks once they have been acquired. If it is
> > done right, it'll be almost as fast as 64k PAGE_CACHE_SIZE, with
> > none of its disadvantages.
> > In this context, bio_chain() is regression, because we're back
> > into doing stuff once-per-page, and longer per-page call graphs.
> > I'd rather not have to do it if it can be avoided.
>
> gang_cpu is not quite ready to post, but work is happening on it
> and it's happening today -- I have a suitable target in hand and
> am preparing it for testing. The bits written thus far consist of
> a transparent per-cpu pool layer refilled using the gang transfer
> mechanism, and I'm in the process of refining that to non-prototypical
> code and extending it with appropriate deadlock avoidance so explicit
> gang allocation requests can be satisfied.
>
Great, thanks.
Performing gang allocation within generic_file_write may not
be practical, especially if the application is being good and
is issuing 8k writes. So there will still be pressure on the
single-page allocator.
Certainly, reads can perform gang allocation.
Which tends to point us in the direction of using the lockless
per-cpu page allocation for writes, and explicit gang allocation
for reads. So possibly, gang allocation should go straight to
the main page list and not drain the per-cpu pools. Leave them
reserved for the single-page allocators - write(2) and anon pages.
But it's early days yet...
-
"Adam J. Richter" wrote:
>
> Andrew Morton <[email protected]> wrote:
> >I have not yet seen a BIO allocation failure in testing. This
> >would indicate that either the BIO pool is too large, or I'm
> >running the wrong tests. Either way, I don't think we have
> >demonstrated any otherwise-unsolvable problems with BIO allocation.
>
> You need to prove that this can never happen once the
> device is initialized, not just that no 2.5 user has reported it
> yet.
BIOs for reads are allocated with GFP_KERNEL - both mpage_readpages
and mpage_readpage.
They used to be allocated GFP_NOIO. This is a subtle but significant
improvement which came in when we stopped using submit_bh for bulk reads.
It means that the only time we threaten the memory reserves is when
allocating BIOs for writes. And those BIOs free more memory than they
consume. This, plus the PF_MEMALLOC reserves which the radix-tree
allocator leaves behind makes us pretty robust.
A poorly-written gigE driver could zoom in and steal the remaining
few megabytes from interrupt context. But even then, the BIO mempools
would have to be exhausted at the time. And I don't see a way in which
they can be exhausted without us having write BIOs in flight.
No, I can't prove it. But I can't think of a contrary scenario
either.
> >What bugs me about your approach is this: I have been gradually
> >nudging the kernel's IO paths away from enormously-long per-page
> >call paths and per-page locking into the direction of a sequence
> >of short little loops, each of which does the same stuff against
> >a bunch of pages.
>
> You need to reread the last paragraph of my previous post.
> I have added some capitalization to help:
>
> >> I think I would be happy enough with your approach, but, just
> >>to eliminate possibilities of memory allocation failure, I think I
> >>want to still have some kind of bio_chain, perhaps without the merge
> >>hinting, BUT WITH A PARAMETER TO ALLOW FOR ALLOCATING A BIO UP TO THE
> >>SIZE OF OLDBIO, like so:
> >>
> >>struct bio *bio_chain(struct bio *oldbio, int gfp_mask,
> >> int nvecs /* <= oldbio->bi_max */);
>
> This would not interfere with your plans try to do things
> in n page chunks.
OK, the caps helped ;)
-
>No, I can't prove it. But I can't think of a contrary scenario
>either.
People feel that way about almost every piece of code that
eventually gets an unexpected failure. Lots of code written to
this sort of feeling leads to sporadic difficult to reproduce
failures that may never be traced to its origin.
Look at it this way, with bio_chain you don't even have to
write error branches for it.
Also, if everyone eventually used bio_chain then maybe you
would not have to set aside such large memory reserves, etc., but
that would be in the more distant future.
Anyhow, I can write bio_chain in a separate file without
touching changing any existing code in the kernel, if I am not going
to implement the merge hint. I think I will do just that so that
we have something that we can discuss more concretely.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
On Fri, 14 Jun 2002, Andrew Morton wrote:
> A poorly-written gigE driver could zoom in and steal the remaining
> few megabytes from interrupt context. But even then, the BIO mempools
> would have to be exhausted at the time. And I don't see a way in which
> they can be exhausted without us having write BIOs in flight.
>
> No, I can't prove it. But I can't think of a contrary scenario
> either.
NBD and iSCSI are two examples that come to mind. Error-handling on the
SCSI stack might get you into trouble too (though you'd probably lose
there anyway). I suspect LVM might need to alloc memory to back snapshots,
but I haven't looked at that. I won't even mention loopback.
But for all the boring scenarios, you should be fine.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
I wrote:
> Anyhow, I can write bio_chain in a separate file without
>touching changing any existing code in the kernel, if I am not going
>to implement the merge hint. I think I will do just that so that
>we have something that we can discuss more concretely.
In case anyone is intersted, here is a sample implementation
that compiles. I do not know if it works. I have renamed it
recycle_bio. This version allocates a bio on the stack (56 bytes
on x86). I also wrote a version that avoided the stack allocation,
but added a field "void *bi_destructor_private;" to struct bio.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
/* In bio.h: */
extern struct bio *__recycle_bio(struct bio *oldbio);
static inline struct bio *recycle_bio(struct bio *oldbio,
int gfp_mask, int nvecs) {
struct bio *newbio;
if (nvecs > oldbio->bi_max)
BUG();
if ((newbio = bio_alloc(gfp_mask, nvecs)) != NULL) {
submit_bio(oldbio->bi_rw, oldbio);
return newbio;
} else
return __recycle_bio(oldbio);
}
/* In bio.c: */
struct bio_completion {
struct bio bio;
struct completion done;
};
static void fake_destruct(struct bio *bio)
{
struct bio_completion *stackbio;
stackbio = list_entry(bio, struct bio_completion, bio);
complete(&stackbio->done);
}
struct bio *__recycle_bio(struct bio *bio)
{
struct bio_completion stackbio;
stackbio.bio = *bio;
stackbio.bio.bi_destructor = fake_destruct;
init_completion(&stackbio.done);
submit_bio(stackbio.bio.bi_rw, &stackbio.bio);
wait_for_completion(&stackbio.done);
return bio;
}
EXPORT_SYMBOL(__recycle_bio);
On Fri, Jun 14 2002, Adam J. Richter wrote:
> Andrew Morton <[email protected]> wrote:
> >I have not yet seen a BIO allocation failure in testing. This
> >would indicate that either the BIO pool is too large, or I'm
> >running the wrong tests. Either way, I don't think we have
> >demonstrated any otherwise-unsolvable problems with BIO allocation.
>
> You need to prove that this can never happen once the
> device is initialized, not just that no 2.5 user has reported it
> yet.
The I/O path allocations all use GFP_NOIO (or GFP_NOFS), which all have
__GFP_WAIT set. So the bio allocations will try normal allocation first,
then fall back to the bio pool. If the bio pool is also empty, we will
block waiting for entries to be freed there. So there never will be a
failure.
--
Jens Axboe
On Fri, Jun 14 2002, Adam J. Richter wrote:
> On Fri, 14 Jun 2002 Jens Axboe wrote:
> >On Thu, Jun 13 2002, Adam J. Richter wrote:
> [...]
> >> newbio = bio_chain(oldbio);
> [...]
> >> I realize there may be locking issues in implementing this.
>
> >but I think that statement is very naive :-)
>
> >Essentially you are keeping a cookie to the old bio submitted, so that
> >you can merge new bio with it fairly cheaply (it's still not very cheap,
> >although you do eliminate the biggest offender, the list merge scan).
> >The first problem is that once you submit the bio, there are some things
> >that are not yours to touch anymore. You cannot change bi_next for
> >instance, you have no idea what state the bio is in wrt I/O completion.
> >Sure you can hold a reference to the bio, but all that really gets you
> >in this situation is a handle to it, nothing more. How do you decide
> >when to invalidate this cookie that you have?
>
> At any time, there could be only one "hinted" bio in a
> request: the last bio in the request. So you only have to
> clear the hint when:
>
> 1. you merge bio's,
> 2. elv_next_request is called,
> 3. newbio is submitted.
>
> In all three cases q->queue_lock gets taken, so we should
> not need to add any additional spin_lock_irq's, and the two lines
> to clear the hint pointers should be trivial.
This logic is flawed. As I said, once you pass the bio to submit_bio,
you can't maintain a pointer to it for these purposes. Grabbing the
queue_lock guarentees absolutely nothing in this regard. Consider loop,
for instance. I/O could be completed by the time bio_submit returns.
Sure you can grab a reference to the bio with bio_get(), but what would
that buy you? Just that the bio at least won't be freed by the time
bio_submit returns, but that's it.
--
Jens Axboe
>> At any time, there could be only one "hinted" bio in a
>> request: the last bio in the request. So you only have to
>> clear the hint when:
>>
>> 1. you merge bio's,
>> 2. elv_next_request is called,
>> 3. newbio is submitted.
>>
>> In all three cases q->queue_lock gets taken, so we should
>> not need to add any additional spin_lock_irq's, and the two lines
>> to clear the hint pointers should be trivial.
>This logic is flawed. As I said, once you pass the bio to submit_bio,
>you can't maintain a pointer to it for these purposes. Grabbing the
>queue_lock guarentees absolutely nothing in this regard. Consider loop,
>for instance. I/O could be completed by the time bio_submit returns.
So, I need a fourth location at in generic_make_request
just before the call to q->make_request_fn, like so:
if (q->make_request_fn != __make_request) {
int flags;
spin_lock_irqsave(q->lock, flags);
clear_hint(bio);
spin_unlock_irqrestore(q->lock, flags);
}
ret = q->make_request_fn(q, bio);
Perhaps it would be clearer for blk_queue_make_request to
set a queue flag, and then have generic_make_request check that flag
rather than checking the value of q->make_request_fn.
If some other driver wants to implement its own weird queuing
system, then it merely doesn't benefit from the bio hints (because
they are cleared as bio's are passed into the weird queuing system).
Better yet, if the weird queuing system uses struct requet the way
__make_request does, then it could conceivably set tha new queue flag
and take care grabbing q->lock and clearing the hints when it takes
requests off of its internal queue if it wants to.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
On Sat, Jun 15 2002, Adam J. Richter wrote:
> >> At any time, there could be only one "hinted" bio in a
> >> request: the last bio in the request. So you only have to
> >> clear the hint when:
> >>
> >> 1. you merge bio's,
> >> 2. elv_next_request is called,
> >> 3. newbio is submitted.
> >>
> >> In all three cases q->queue_lock gets taken, so we should
> >> not need to add any additional spin_lock_irq's, and the two lines
> >> to clear the hint pointers should be trivial.
>
> >This logic is flawed. As I said, once you pass the bio to submit_bio,
> >you can't maintain a pointer to it for these purposes. Grabbing the
> >queue_lock guarentees absolutely nothing in this regard. Consider loop,
> >for instance. I/O could be completed by the time bio_submit returns.
>
> So, I need a fourth location at in generic_make_request
> just before the call to q->make_request_fn, like so:
>
> if (q->make_request_fn != __make_request) {
> int flags;
> spin_lock_irqsave(q->lock, flags);
> clear_hint(bio);
> spin_unlock_irqrestore(q->lock, flags);
> }
> ret = q->make_request_fn(q, bio);
Irk, this is ugly. But how you are moving away from the initial goal (or
maybe this was your goal the whole time, just a single merge hint?) of
passing back the hint instead of maintaing it in the queue. So let me
ask, are you aware of the last_merge I/O scheduler hint? Which does
exactly this already...
--
Jens Axboe
Jens Axboe wrote:
>The I/O path allocations all use GFP_NOIO (or GFP_NOFS), which all have
>__GFP_WAIT set. So the bio allocations will try normal allocation first,
>then fall back to the bio pool. If the bio pool is also empty, we will
>block waiting for entries to be freed there. So there never will be a
>failure.
I did not realize that allocation with __GFP_WAIT was guaranteed
to _never_ fail.
Even so, if __GFP_WAIT never fails, then it can deadlock (for
example, some other device driver has a memory leak). Under a
scheme like bio_chain (provided that it is changed not to call a
memory allocator that can deadlock), the only way you deadlock is
if there really is deadlock bug in the lower layers that process
the underlying request.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
On Sat, Jun 15 2002, Adam J. Richter wrote:
> Jens Axboe wrote:
> >The I/O path allocations all use GFP_NOIO (or GFP_NOFS), which all have
> >__GFP_WAIT set. So the bio allocations will try normal allocation first,
> >then fall back to the bio pool. If the bio pool is also empty, we will
> >block waiting for entries to be freed there. So there never will be a
> >failure.
>
> I did not realize that allocation with __GFP_WAIT was guaranteed
> to _never_ fail.
See the mempool implementation. Even before bio used mempool, it used
the exact same logic to make sure it never fails. Basically the
heuristic in both cases is:
repeat:
bio = normal_alloc
if (bio)
return bio
bio = get_from_pool
if (bio)
return bio
if (gfp_mask & __GFP_WAIT)
start disk i/o
goto repeat;
> Even so, if __GFP_WAIT never fails, then it can deadlock (for
> example, some other device driver has a memory leak). Under a
> scheme like bio_chain (provided that it is changed not to call a
> memory allocator that can deadlock), the only way you deadlock is
> if there really is deadlock bug in the lower layers that process
> the underlying request.
This whole dead lock debate has been done to death before, I suggest you
find the mempool discussions in the lkml archives from the earlier 2.5
series. Basically we maintain deadlock free allocation although we never
fail allocs by saying that a single bio allocation is short lived (or
at least not held indefinetely). That plus a reserve of XX bios makes
sure that someone will always return a bio to the pool sooner or later
and at least the get_from_pool alloc above will succeed sooner or later
even if vm pressure is ridicilous.
--
Jens Axboe
>> So, I need a fourth location at in generic_make_request
>> just before the call to q->make_request_fn, like so:
>>
>> if (q->make_request_fn != __make_request) {
>> int flags;
>> spin_lock_irqsave(q->lock, flags);
>> clear_hint(bio);
>> spin_unlock_irqrestore(q->lock, flags);
>> }
>> ret = q->make_request_fn(q, bio);
>Irk, this is ugly. But how you are moving away from the initial goal (or
>maybe this was your goal the whole time, just a single merge hint?) of
>passing back the hint instead of maintaing it in the queue. So let me
>ask, are you aware of the last_merge I/O scheduler hint? Which does
>exactly this already...
The code that I think I more or less have in my head has not
changed (aside from that fourth test).
Although I was not aware of q->last_merge, I see that it is
not what I want. I want up to one hint per request, for the last bio
in the request. bi_hint would be null for all bio's except possibly
the last bio in a request. I do not want just one hint per queue.
If it is unclear what I mean, perhaps I really need to code it
up to explain it. and we can discuss it from there.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
On Sat, Jun 15 2002, Adam J. Richter wrote:
> >> So, I need a fourth location at in generic_make_request
> >> just before the call to q->make_request_fn, like so:
> >>
> >> if (q->make_request_fn != __make_request) {
> >> int flags;
> >> spin_lock_irqsave(q->lock, flags);
> >> clear_hint(bio);
> >> spin_unlock_irqrestore(q->lock, flags);
> >> }
> >> ret = q->make_request_fn(q, bio);
>
> >Irk, this is ugly. But how you are moving away from the initial goal (or
> >maybe this was your goal the whole time, just a single merge hint?) of
> >passing back the hint instead of maintaing it in the queue. So let me
> >ask, are you aware of the last_merge I/O scheduler hint? Which does
> >exactly this already...
>
> The code that I think I more or less have in my head has not
> changed (aside from that fourth test).
>
> Although I was not aware of q->last_merge, I see that it is
> not what I want. I want up to one hint per request, for the last bio
> in the request. bi_hint would be null for all bio's except possibly
> the last bio in a request. I do not want just one hint per queue.
.. which, as far as I can see, brings us right back into the problems
with stalling the i/o and other nastiness. So far your approach seems
hackish at best, but ->
> If it is unclear what I mean, perhaps I really need to code it
> up to explain it. and we can discuss it from there.
go ahead and code it up if you want and we can discuss it some more.
--
Jens Axboe
>> If it is unclear what I mean, perhaps I really need to code it
>> up to explain it. and we can discuss it from there.
>
>go ahead and code it up if you want and we can discuss it some more.
As embassing as it is to have offered to put up or shut up
and then not put up, I have to admit that I now see a problem with
my proposal.
The answer that I gave to you about the /dev/loop scenario
for bio_chain was incorrect. The correct answer was that bi_hint
would just never be set by bio_chain for queue that use
blk_queue_make_request like /dev/loop.
That scenario made me think, and I now see it as a pretty
substantial drawback that the bio_chain hints would be ignored or
not generated from a "stacked" device like /dev/loop or raid.
I thought about allowing the bi_hint field to propoagate down
to the underlying device under /dev/loop/n, but that scheme would
not work for interleaved raid schemes. In comparison, raid really
shines in the scenario where the higher level software submits its
IO as one big request. And the people who care the most about
block IO performance usually care a lot about raid.
So, I think I'll abandon my proposal and think about you
your idea of some kind of incremental accounting function, which
perhaps could be implemented as:
q->bio_one_more_bvec(q, bio, page, offset, len) /* boolean */
...in the hopes of accomodating complex IO size limitations,
such as for interleaved raid.
Come to think of it, an earlier patch that I posted for
ll_rw_kio and the mpage routines kind of did something similar
to bio_chain for this. It had a routine like q->bio_one_more_bvec
that was not a boolean, but rather would either stick the new
bvec in the bio, or submit the current bio, allocate a fresh bio,
stick this new bvec in that, and return the new bio. To illustrate
in incorrect C syntax, I mean:
newbio = q->one_more_bvec(q, bio, page, offset, len);
The advantage of this is that it simplify bio generators about
as much as my bio_chain proposal would, but it would impliment your
"can we put another page into this bio" model.
It would be easy enough to code this up up conservatively
(I think I basically have that in a previous patch) and then refine
the implementation to think about phys_mergeable, virt_mergeable,
etc. It should at least help with maintenance if "can we put
another page into this bio" is a routine rather than complex
logic replicated in each bio producer.
I would be interested in knowing what you think of this.
I think I'd be willing to try to provide at least a "conservative"
implimentation of this if you want.
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
"Adam J. Richter" wrote:
>
> ...
> newbio = q->one_more_bvec(q, bio, page, offset, len);
>
That's a comfortable interface. Or maybe just:
struct bio *bio_add_bvec(bio, page, offset, len);
Couple of points:
- It's tricky to determine how many bvecs are available in
a bio. There is no straightforward "how big is it" field
in struct bio.
- AFAIK, there are no established conventions for BIO assembly.
We have conventions which state what the various fields do
while the BIO is being processed by the block layer, but not
for when the client is assembling the BIO.
What I did, and what I'd suggest as a convention is:
During BIO assembly, bi_vcnt indicates the maximum number of
bvecs which the BIO can hold. And bi_idx indexes the next-free
bvec within the BIO.
If you do this, *now* you have enough information to write
bio_add_bvec:
struct bio *bio_add_bvec(bio, page, offset, len)
{
if (block layer says bio is maximum size) ||
(bio->bi_idx == bio->bi_vcnt) {
bio->bi_vcnt = bio->bi_idx;
bio->bi_idx = 0;
submit_bio(bio); /* Caller has set up bi_end_io() */
bio = NULL;
} else {
bio->bi_io_vec[bio->bi_idx].page = page;
...
bio->bi_idx++;
}
return bio;
}
It returns NULL if the bio was sent so that the caller gets
to allocate the new BIO. If you want to allocate the new BIO
here you'll need to pass in gfp_flags and the new bio's
desired size as well. And copy that size into bi_vcnt.
ll_rw_kio() can be adapted to the above convention too, which will
simplify it a little.
-
>- It's tricky to determine how many bvecs are available in
> a bio. There is no straightforward "how big is it" field
> in struct bio.
Can't I make a macro to do a table lookup from bio->bi_max?
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
"Adam J. Richter" wrote:
>
> >- It's tricky to determine how many bvecs are available in
> > a bio. There is no straightforward "how big is it" field
> > in struct bio.
>
> Can't I make a macro to do a table lookup from bio->bi_max?
Not really. If I do
bio_alloc(GFP_KERNEL, 27);
then I'll get a 32-slot bvec. But presumably, I don't
want to put more than 27 pages into it.
-
>> Can't I make a macro to do a table lookup from bio->bi_max?
>Not really. If I do
> bio_alloc(GFP_KERNEL, 27);
>then I'll get a 32-slot bvec. But presumably, I don't
>want to put more than 27 pages into it.
If you called bio_alloc with a smaller number, that would
just be the result of a small IO that you knew could not generate
more iovecs than that. So, that scenario will not happen.
If that scenario did happen, by the way, it would still
be safe. If you cannot handle the larger request for some reason
that is not apparent to q->one_more_bvec, then you would make
your own one_more_bvec routine (probably a wrapper).
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."