I'm experiencing a problem using the bio pool created in
2.5.7 and I'm not quite able to put my finger on the cause
and hoped somone might have the knowledge and insight to
understand this problem.
In EVMS, we are adding code to deal with BIO splitting, to
enable our feature modules, such as DriveLinking, LVM, & MD
Linear, etc to break large BIOs up on chunk size or lower
level device boundaries.
In the first implementation of this, EVMS created it own
private pool of BIOs to use both for internally generated
synchronous IOs as well as for the source of BIOs used to
create the resulting split BIOs. In this implementation,
everything worked well, even under heavy loads.
However, after some thought, I concluded it was redundant
of EVMS to create its own pool of BIOs, when 2.5 had already
created a pool along with several support routines for
appropriately dealing with them.
So I made the changes to EVMS to use the 2.5 BIO pool. I
started testing a volume using linear concatination and BIO
splitting. In the test, I have an ext2 filesystem formatted
with a block size of 4096. The BIO split function was
tweaked to maximize the stress by splitting all BIOs into
512 byte pieces. So this test is generating 8 additional
BIOs for each one coming down for this volume.
The allocation and initialization of the resulting split
BIOs seems to be correct and works in light loads. However,
under heavier loads, the assert in scsi_merge.c:82
{BUG_ON(!sgpnt)} fires, due to the fact that scatter gather
pool for MAX_PHYS_SEGMENTS (128) is empty. This is occurring
at interrupt time when __scsi_end_request is attempting to
queue the next request.
Its not perfectly clear to me how switching from a private
BIO pool to the 2.5 BIO pool should affect the usage of the
scsi driver's scatter gather pools.
Rather than simply increasing the size of scatter gather
pools I hope to understand how these changes resulted in
this behaviour so the proper solution can be determined.
Another data point: I have observed that the BIO pool does
get depleted below the 50% point of its mininum value, and
in such cases mempool_alloc (the internal worker for
bio_alloc) tries to free up more memory (I assume to grow
the pool) by waking bdflush. As a result, even more
pressure is put on the BIO pool when the dirty buffers
are being flushed.
</speculation on>
BIO splitting does increase the pressure on the BIO pool.
mempool_alloc increases pressure on all IO pools when it
wakes bdflush. BIOs splitting alone (when from a private
pool) didn't create sufficient IO pressure to deplete the
currently sized pools in the IO path. Can the behaviour
of mempool_alloc, triggering bdflush, in addition to BIO
splitting adequately explain why the scsi scatter gather
pool would become depleted?
</speculation off>
Have I caused a problem by unrealistically increasing
pressure on the BIO pool by a factor of 8? Or have I
discovered a problem that can occur on very heavy loads?
What are your thoughts on a recommended solution?
Thanks.
Mark
Mark Peloquin wrote:
>
> I'm experiencing a problem using the bio pool created in
> 2.5.7 and I'm not quite able to put my finger on the cause
> and hoped somone might have the knowledge and insight to
> understand this problem.
>
> In EVMS, we are adding code to deal with BIO splitting, to
> enable our feature modules, such as DriveLinking, LVM, & MD
> Linear, etc to break large BIOs up on chunk size or lower
> level device boundaries.
Could I suggest that this code not be part of EVMS, but that
you implement it as a library within the core kernel? Lots of
stuff is going to need BIO splitting - software RAID, ataraid,
XFS, etc. May as well talk with Jens, Martin Petersen, Arjan,
Neil Brown. Do it once, do it right...
> ...
>
> The allocation and initialization of the resulting split
> BIOs seems to be correct and works in light loads. However,
> under heavier loads, the assert in scsi_merge.c:82
> {BUG_ON(!sgpnt)} fires, due to the fact that scatter gather
> pool for MAX_PHYS_SEGMENTS (128) is empty. This is occurring
> at interrupt time when __scsi_end_request is attempting to
> queue the next request.
You're not the only one... That is placeholder code which
Jens plans to complete at a later time.
> Its not perfectly clear to me how switching from a private
> BIO pool to the 2.5 BIO pool should affect the usage of the
> scsi driver's scatter gather pools.
>
> Rather than simply increasing the size of scatter gather
> pools I hope to understand how these changes resulted in
> this behaviour so the proper solution can be determined.
>
> Another data point: I have observed that the BIO pool does
> get depleted below the 50% point of its mininum value, and
> in such cases mempool_alloc (the internal worker for
> bio_alloc) tries to free up more memory (I assume to grow
> the pool) by waking bdflush. As a result, even more
> pressure is put on the BIO pool when the dirty buffers
> are being flushed.
Makes sense.
> ...
>
> Have I caused a problem by unrealistically increasing
> pressure on the BIO pool by a factor of 8? Or have I
> discovered a problem that can occur on very heavy loads?
> What are your thoughts on a recommended solution?
Hopefully, once scsi_merge is able to handle the allocation
failure correctly, we won't have a problem any more.
As a temp thing I guess you could increase the size of that
mempool.
-
Andrew Morton wrote:
>
> Mark Peloquin wrote:
> >
> ...
> > In EVMS, we are adding code to deal with BIO splitting, to
> > enable our feature modules, such as DriveLinking, LVM, & MD
> > Linear, etc to break large BIOs up on chunk size or lower
> > level device boundaries.
>
> Could I suggest that this code not be part of EVMS, but that
> you implement it as a library within the core kernel? Lots of
> stuff is going to need BIO splitting - software RAID, ataraid,
> XFS, etc. May as well talk with Jens, Martin Petersen, Arjan,
> Neil Brown. Do it once, do it right...
>
I take that back.
We really, really do not want to perform BIO splitting at all.
It requires that the kernel perform GFP_NOIO allocations at
the worst possible time, and it's just broken.
What I would much prefer is that the top-level BIO assembly
code be able to find out, beforehand, what the maximum
permissible BIO size is at the chosen offset. It can then
simple restrict the BIO to that size.
Simply:
max = bio_max_bytes(dev, block);
which gets passed down the exact path as the requests themselves.
Each layer does:
int foo_max_bytes(sector_t sector)
{
int my_maxbytes, his_maxbytes;
sector_t my_sector;
my_sector = my_translation(sector);
his_maxbytes = next_device(me)->max_bytes(my_sector);
my_maxbytes = whatever(my_sector);
return min(my_maxbytes, his_maxbytes);
}
and, at the bottom:
int ide_max_bytes(sector_t sector)
{
return 248 * 512;
}
BIO_MAX_SECTORS and request_queue.max_sectors go away.
Tell me why this won't work?
-
> Andrew Morton wrote:
> >
> > Mark Peloquin wrote:
> > >
> > ...
> > > In EVMS, we are adding code to deal with BIO splitting, to
> > > enable our feature modules, such as DriveLinking, LVM, & MD
> > > Linear, etc to break large BIOs up on chunk size or lower
> > > level device boundaries.
> >
> > Could I suggest that this code not be part of EVMS, but that
> > you implement it as a library within the core kernel? Lots of
> > stuff is going to need BIO splitting - software RAID, ataraid,
> > XFS, etc. May as well talk with Jens, Martin Petersen, Arjan,
> > Neil Brown. Do it once, do it right...
> >
> I take that back.
>
> We really, really do not want to perform BIO splitting at all.
> It requires that the kernel perform GFP_NOIO allocations at
> the worst possible time, and it's just broken.
>
> What I would much prefer is that the top-level BIO assembly
> code be able to find out, beforehand, what the maximum
> permissible BIO size is at the chosen offset. It can then
> simple restrict the BIO to that size.
>
> Simply:
>
> max = bio_max_bytes(dev, block);
>
> which gets passed down the exact path as the requests themselves.
> Each layer does:
>
> int foo_max_bytes(sector_t sector)
> {
> int my_maxbytes, his_maxbytes;
> sector_t my_sector;
>
> my_sector = my_translation(sector);
> his_maxbytes = next_device(me)->max_bytes(my_sector);
> my_maxbytes = whatever(my_sector);
> return min(my_maxbytes, his_maxbytes);
> }
>
> and, at the bottom:
>
> int ide_max_bytes(sector_t sector)
> {
> return 248 * 512;
> }
>
> BIO_MAX_SECTORS and request_queue.max_sectors go away.
>
> Tell me why this won't work?
This would require the BIO assembly code to make at least one
call to find the current permissible BIO size at offset xyzzy.
Depending on the actual IO size many foo_max_bytes calls may
be required. Envision the LVM or RAID case where physical
extents or chunks sizes can be as small as 8Kb I believe. For
a 64Kb IO, its conceivable that 9 calls to foo_max_bytes may
be required to package that IO into permissibly sized BIOs.
What your proposing is doable, but not without a cost.
This cost would be incurred to some degree on every IO, rather
than just on the exception case. Certain underlying storage
layouts would pay a higher cost, but they also had a higher
cost if they had to split BIOs themselves.
Perhaps foo_max_bytes also accept a size and could be coded
to return a list of sizes, only one call would be required
to determine all the permissible BIO sizes needed to package
an IO of a specified size.
What your proposal guarantees is that BIOs would never have
to split up at all.
Mark
Mark Peloquin wrote:
>
> ...
> > Tell me why this won't work?
>
> This would require the BIO assembly code to make at least one
> call to find the current permissible BIO size at offset xyzzy.
> Depending on the actual IO size many foo_max_bytes calls may
> be required. Envision the LVM or RAID case where physical
> extents or chunks sizes can be as small as 8Kb I believe. For
> a 64Kb IO, its conceivable that 9 calls to foo_max_bytes may
> be required to package that IO into permissibly sized BIOs.
True. But probably the common case is not as bad as that, and
these repeated calls are probably still cheaper than allocating
and populating the redundant top-level BIO.
Also, the top-level code can be cache-friendly. The bad way
to write it would be to do:
while (more to send) {
maxbytes = bio_max_bytes(block);
build_and_send_a_bio(block, maxbytes);
block += maxbytes / whatever;
}
That creates long code paths and L1 cache thrashing. Kernel
tends to do that rather a lot in the IO paths.
The good way is:
int maxbytes[something];
int i = 0;
while (more_to_send) {
maxbytes[i] = bio_max_bytes(block);
block += maxbytes[i++] / whatever;
}
i = 0;
while (more_to_send) {
build_and_send_a_bio(block, maxbytes[i]);
block += maxbytes[i++] / whatever;
}
if you get my drift. This way the computational costs of
the second and succeeding bio_max_bytes() calls are very
small.
One thing which concerns me about the whole scheme at
present is that the uncommon case (volume managers, RAID,
etc) will end up penalising the common case - boring
old ext2 on boring old IDE/SCSI.
Right now, BIO_MAX_SECTORS is only 64k, and IDE can
take twice that. I'm not sure what the largest
request size is for SCSI - certainly 128k.
Let's not create any designed-in limitations at this
stage of the game.
-
Andew Morton wrote:
> Mark Peloquin wrote:
> >
> > ...
> > > Tell me why this won't work?
> >
> > This would require the BIO assembly code to make at least one
> > call to find the current permissible BIO size at offset xyzzy.
> > Depending on the actual IO size many foo_max_bytes calls may
> > be required. Envision the LVM or RAID case where physical
> > extents or chunks sizes can be as small as 8Kb I believe. For
> > a 64Kb IO, its conceivable that 9 calls to foo_max_bytes may
> > be required to package that IO into permissibly sized BIOs.
> True. But probably the common case is not as bad as that, and
> these repeated calls are probably still cheaper than allocating
> and populating the redundant top-level BIO.
Perhaps, but calls are expensive. Repeated calls down stacked block
devices will add up. In only the most unusually cases will there
be a need to allocate more than one top-level BIO. So the savings
for most cases requiring splitting will just be a single BIO. The
other BIOs will still need to be allocated and populated, but it
would just happen outside the block devices. The savings of doing
repeated calls vs allocating a single BIO is not obvious to me.
> Also, the top-level code can be cache-friendly. The bad way
> to write it would be to do:
>
> while (more to send) {
> maxbytes = bio_max_bytes(block);
> build_and_send_a_bio(block, maxbytes);
> block += maxbytes / whatever;
> }
> That creates long code paths and L1 cache thrashing. Kernel
> tends to do that rather a lot in the IO paths.
> The good way is:
> int maxbytes[something];
> int i = 0;
> while (more_to_send) {
> maxbytes[i] = bio_max_bytes(block);
> block += maxbytes[i++] / whatever;
> }
> i = 0;
> while (more_to_send) {
> build_and_send_a_bio(block, maxbytes[i]);
> block += maxbytes[i++] / whatever;
> }
> if you get my drift. This way the computational costs of
> the second and succeeding bio_max_bytes() calls are very
> small.
Yup.
> One thing which concerns me about the whole scheme at
> present is that the uncommon case (volume managers, RAID,
> etc) will end up penalising the common case - boring
> old ext2 on boring old IDE/SCSI.
Yes it would.
> Right now, BIO_MAX_SECTORS is only 64k, and IDE can
> take twice that. I'm not sure what the largest
> request size is for SCSI - certainly 128k.
In 2.5.7, Jens allows the BIO vectors to hold upto 256
pages, so it would seem that larger than 64k IOs are
planned.
> Let's not create any designed-in limitations at this
> stage of the game.
Not really trying to create any limitations, just trying
to balance what I think could be a performance concern.
Andrew Morton <[email protected]> wrote:
<snip/>
> Right now, BIO_MAX_SECTORS is only 64k, and IDE can
> take twice that. I'm not sure what the largest
> request size is for SCSI - certainly 128k.
Scatter gather lists in the scsi subsystem have a max
length of 255. The actual maximum size is dictated by
the HBA driver (sg_tablesize). The HBA driver can
further throttle the size of a single transfer with
max_sectors.
Experiments with raw IO (both in 2.4 and 2.5) indicate
that pages are not contiguous when the scatter gather
list is built. On i386 this limits the maximum transfer
size of a single scsi command to just less than 1 MB.
Doug Gilbert
> Perhaps, but calls are expensive. Repeated calls down stacked block
> devices will add up. In only the most unusually cases will there
You don't need to repeatedly query. At bind time you can compute the
limit for any device heirarchy and be done with it.
Alan
Alan Cox wrote:
>
> > Perhaps, but calls are expensive. Repeated calls down stacked block
> > devices will add up. In only the most unusually cases will there
>
> You don't need to repeatedly query. At bind time you can compute the
> limit for any device heirarchy and be done with it.
>
S'pose so. The ideal request size is variable, based
on the alignment. So for exampe if the start block is
halfway into a stripe, the ideal BIO size is half a stripe.
But that's a pretty simple table to generate once-off,
as you say.
-
Andrew Morton wrote:
>Alan Cox wrote:
>
>>>Perhaps, but calls are expensive. Repeated calls down stacked block
>>>devices will add up. In only the most unusually cases will there
>>>
>>You don't need to repeatedly query. At bind time you can compute the
>>limit for any device heirarchy and be done with it.
>>
>
>S'pose so. The ideal request size is variable, based
>on the alignment. So for exampe if the start block is
>halfway into a stripe, the ideal BIO size is half a stripe.
>
>But that's a pretty simple table to generate once-off,
>as you say.
>
But this gets you lowest common denominator sizes for the whole
volume, which is basically the buffer head approach, chop all I/O up
into a chunk size we know will always work. Any sort of nasty boundary
condition at one spot in a volume means the whole thing is crippled
down to that level. It then becomes a black magic art to configure a
volume which is not restricted to a small request size.
Steve
On Fri, Apr 19, 2002 at 02:29:25AM -0500, Stephen Lord wrote:
> But this gets you lowest common denominator sizes for the whole
> volume, which is basically the buffer head approach, chop all I/O up
> into a chunk size we know will always work. Any sort of nasty boundary
> condition at one spot in a volume means the whole thing is crippled
> down to that level. It then becomes a black magic art to configure a
> volume which is not restricted to a small request size.
This is exactly the problem; I don't think it's going to be unusual to
see volumes that have a variety of mappings. For example the
'journal' area of the lv with a single fast pv, 'small file' area with
a linear mapping across normal pv's, and finally a 'large file' area
that has a few slower disks striped together.
The last thing I want in this situation is to split up all the io into
the lowest common chunk size, in this case the striped area which will
typically be < 64k.
LVM and EVMS need to do the splitting and resubmitting of bios
themselves.
- Joe
> This is exactly the problem; I don't think it's going to be unusual to
> see volumes that have a variety of mappings. For example the
> 'journal' area of the lv with a single fast pv, 'small file' area with
> a linear mapping across normal pv's, and finally a 'large file' area
> that has a few slower disks striped together.
Optimise for the sane cases. Remember the lvm can chain bio's trivially
itself. Its a lot cheaper to chain them than unchain them
> The last thing I want in this situation is to split up all the io into
> the lowest common chunk size, in this case the striped area which will
> typically be < 64k.
The last thing I want is layers of bio support garbage slowing down a
perfectly sane machine that does not need them...
Alan
> But this gets you lowest common denominator sizes for the whole
> volume, which is basically the buffer head approach, chop all I/O up
> into a chunk size we know will always work. Any sort of nasty boundary
> condition at one spot in a volume means the whole thing is crippled
> down to that level. It then becomes a black magic art to configure a
> volume which is not restricted to a small request size.
Its still cheaper to merge bio chains than split them. The VM issues with
splitting them are not nice at all since you may need to split a bio to
write out a page and it may be the last page
On 18 April 2002 16:57, Andrew Morton wrote:
> > This would require the BIO assembly code to make at least one
> > call to find the current permissible BIO size at offset xyzzy.
> > Depending on the actual IO size many foo_max_bytes calls may
> > be required. Envision the LVM or RAID case where physical
> > extents or chunks sizes can be as small as 8Kb I believe. For
> > a 64Kb IO, its conceivable that 9 calls to foo_max_bytes may
> > be required to package that IO into permissibly sized BIOs.
[snip]
> The good way is:
>
> int maxbytes[something];
> int i = 0;
>
> while (more_to_send) {
> maxbytes[i] = bio_max_bytes(block);
> block += maxbytes[i++] / whatever;
> }
> i = 0;
> while (more_to_send) {
> build_and_send_a_bio(block, maxbytes[i]);
> block += maxbytes[i++] / whatever;
> }
>
> if you get my drift. This way the computational costs of
> the second and succeeding bio_max_bytes() calls are very
> small.
This has the advantage of being simple too.
> One thing which concerns me about the whole scheme at
> present is that the uncommon case (volume managers, RAID,
> etc) will end up penalising the common case - boring
> old ext2 on boring old IDE/SCSI.
Yes but since performance gap between CPU and devices
continue to increase simplicity outweights
CPU cycles wastage here. We are going to wait much longer
for IO to take place anyway.
> Right now, BIO_MAX_SECTORS is only 64k, and IDE can
> take twice that. I'm not sure what the largest
> request size is for SCSI - certainly 128k.
Yep, submitting largest possible block in one go is a win.
--
vda
On Fri, 2002-04-19 at 03:58, Alan Cox wrote:
> > But this gets you lowest common denominator sizes for the whole
> > volume, which is basically the buffer head approach, chop all I/O up
> > into a chunk size we know will always work. Any sort of nasty boundary
> > condition at one spot in a volume means the whole thing is crippled
> > down to that level. It then becomes a black magic art to configure a
> > volume which is not restricted to a small request size.
>
> Its still cheaper to merge bio chains than split them. The VM issues with
> splitting them are not nice at all since you may need to split a bio to
> write out a page and it may be the last page
> -
I am well aware of the problems of allocating more memory in some of
these places - been the bane of my life for the last couple of years
with XFS ;-)
It just feels so bad to have the ability to build a large request and
use one bio structure and know that 99.9% of the time the lower layers
can handle it in one chunk, but instead have to chop it into the lowest
common denominator pieces for the sake of the other 0.1%.
Just looking at how my disks ended up partitioned not many of them are
even on 4K boundaries, so any sort of concat built on them would
have a boundary case which required such a split - I think, still
working on my caffine intake this morning.
Steve
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]
> Just looking at how my disks ended up partitioned not many of them are
> even on 4K boundaries, so any sort of concat built on them would
> have a boundary case which required such a split - I think, still
> working on my caffine intake this morning.
Alignment and concatenation are different things altogether. On the whole I
can blast 64K chunks on a 512 byte alignment out of my controllers. The
partitioning doesn't bother me too much. Do we even want to consider a
device that cannot hit its own sector size boundary ?
Oh and the unusual block size stuff seems to be quite easy for the bottom
layers. The horror lurks up higher. Most file systems can't cope (doesn't
matter too much), isofs can be mixed block size (bletch) but the killer
seems to be how you mmap a file on a device with 2326 byte sectors sanely..
(Just say no ?)
On Fri, 19 Apr 2002, Alan Cox wrote:
> Oh and the unusual block size stuff seems to be quite easy for the bottom
> layers. The horror lurks up higher. Most file systems can't cope (doesn't
> matter too much), isofs can be mixed block size (bletch) but the killer
> seems to be how you mmap a file on a device with 2326 byte sectors sanely..
> (Just say no ?)
mmap() shouldn't be a problem if you manage to stuff the file
into the page cache ;)
Rik
--
http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid. Go buy yourself a real t-shirt"
http://www.surriel.com/ http://distro.conectiva.com/
On Thu, 18 Apr 2002, Andrew Morton wrote:
> Alan Cox wrote:
> >
> > > Perhaps, but calls are expensive. Repeated calls down stacked block
> > > devices will add up. In only the most unusually cases will there
> >
> > You don't need to repeatedly query. At bind time you can compute the
> > limit for any device heirarchy and be done with it.
> >
>
> S'pose so. The ideal request size is variable, based
> on the alignment. So for exampe if the start block is
> halfway into a stripe, the ideal BIO size is half a stripe.
>
> But that's a pretty simple table to generate once-off,
> as you say.
Perhaps we can return request size _and_ stripe alignment at bind time.
Then we can do the right thing for RAID/LVM/etc in a pretty
straightforward manner. LVMs of RAIDs can return an GCD(LVM stripe, RAID
stripe) and non-striped devices can return 0 to indicate no restrictions.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
Alan Cox wrote:
>
> > But this gets you lowest common denominator sizes for the whole
> > volume, which is basically the buffer head approach, chop all I/O up
> > into a chunk size we know will always work. Any sort of nasty boundary
> > condition at one spot in a volume means the whole thing is crippled
> > down to that level. It then becomes a black magic art to configure a
> > volume which is not restricted to a small request size.
>
> Its still cheaper to merge bio chains than split them.
Depends on how small each piece ends up having to be with the lowest
common denominator approach. (Shouldn't end up with too small pieces)
Its easy to miss/forget that merging chains redundantly does have a
bit of extra cost on the completion path - extra callback invokations
(bi_end_io)to collate results.
> The VM issues with
> splitting them are not nice at all since you may need to split a bio to
> write out a page and it may be the last page
Yes, the mempool alloc aspects get quite confusing even when thinking
about
the normal bio path ... (e.g bounce bio's are probably already an aspect
of
concern since we have multiple allocations by the same thread drawing
into the
same pool, a generic condition that has earlier been cited as a source
of potential deadlock) With splitting it gets worse. (BTW, for similar
reasons. drawing from
the common pool may not be the best thing to do when splitting ..,
though
multiple pools probably come with their source of problems)
But then, the situation of writeout of the last page again is not a
common
case. In this case it makes sense to revert to the lowest common
denominator
... , but must we do so in every case ?
Again, it really depends on how small the lowest common denominator
turns
out to be. If one can have the entire layout information abstracted in
a way to be able to compute it in advance for a given block so it
doesn't
limit one to be too conservative fine ... but I don't know if that's
always feasible.
As such, its good to avoid splitting in general relying on good hints,
but
perhaps have room for the stray case that crops up -- either handle the
split,
or maybe have a way pass back an error to retry with smaller size.
Maybe 2 limits (one that indicates that anything bigger than this is
sure to
get split, so always break it up, and another that says that anything
smaller
than this is sure not to be split, so use this size when you can't
afford a
split).
Regards
Suparna
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
In article <[email protected]> you wrote:
> or maybe have a way pass back an error to retry with smaller size.
> Maybe 2 limits (one that indicates that anything bigger than this is
> sure to
> get split, so always break it up, and another that says that anything
> smaller
> than this is sure not to be split, so use this size when you can't
> afford a
> split).
Unfortionatly it's not always size that's the issue. For example in my
code I need to split when a request crosses a certain boundary, and without
going into too much detail, that boundary is 62 Kb aligned, not 64
(for technical reasons ;().
Size won't catch this and while a 64Kb Kb block will always be split, that
you can be sure of, even a 4Kb request, if unlucky, can have the need to
split up.
--
But when you distribute the same sections as part of a whole which is a work
based on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the entire whole,
and thus to each and every part regardless of who wrote it. [sect.2 GPL]
On Mon, Apr 22, 2002 at 08:06:47AM +0100, [email protected] wrote:
> In article <[email protected]> you wrote:
>
> > or maybe have a way pass back an error to retry with smaller size.
> > Maybe 2 limits (one that indicates that anything bigger than this is
> > sure to
> > get split, so always break it up, and another that says that anything
> > smaller
> > than this is sure not to be split, so use this size when you can't
> > afford a
> > split).
>
> Unfortionatly it's not always size that's the issue. For example in my
> code I need to split when a request crosses a certain boundary, and without
> going into too much detail, that boundary is 62 Kb aligned, not 64
> (for technical reasons ;().
Yes, I know .. Size alone isn't the only constraint - this was what
the earlier grow_bio discussion (about max BIO sizes) was all
about. Actually, not only that, in cases where the queue already has
requests which we can merge, even the size the decision gets more complex ...
That's why allowing for the exception cases when we do need to split
seemed like an option to take.
>
> Size won't catch this and while a 64Kb Kb block will always be split, that
> you can be sure of, even a 4Kb request, if unlucky, can have the need to
> split up.
>
This would as you observe be caught in alignment checks. The limit
doesn't have to be a fixed size (e.g function of block) or even size
only thing. Conceptually the question is if it can be generalized into
a compound worst case constraint through layers of lvm et al (at bind
time). It could be expressed as multiple checks.
Regards
Suparna
Alan Cox wrote:
>
> > But this gets you lowest common denominator sizes for the whole
> > volume, which is basically the buffer head approach, chop all I/O up
> > into a chunk size we know will always work. Any sort of nasty boundary
> > condition at one spot in a volume means the whole thing is crippled
> > down to that level. It then becomes a black magic art to configure a
> > volume which is not restricted to a small request size.
>
> Its still cheaper to merge bio chains than split them. The VM issues with
> splitting them are not nice at all since you may need to split a bio to
> write out a page and it may be the last page
How about reserving a small memory pool for splitting when normal
memory allocation fails?
I know we want a clean kernel,
so this mechanism would be implemented in those drivers that
actually need it. I.e. raid0/5 would keep a
emergency split buffer around for bio's bigger than the
stripe size, devices with all sorts of odd requirement could do the
same.
This might look like duplication, but isn't really as the different
devices
might need different splitting anyway. I.e. RAID want to
split into stripe-sized chunks but no smaller, an odd device might
need something different. The disk concatenation in md would
only want to split when you actually hit a boundary.
Also, letting each driver handle the special cases itself
works when someone makes raid-0 on top
of weird adapters.
A kernel with just plain disk drivers wouldn't need
and wouldn't have such mechanisms either.
Helge Hafting
On Thu, Apr 18, 2002 at 01:23:47PM -0500, Mark Peloquin wrote:
>
> > Andrew Morton wrote:
> > >
> > > Mark Peloquin wrote:
> > > >
> > > ...
> > > > In EVMS, we are adding code to deal with BIO splitting, to
> > > > enable our feature modules, such as DriveLinking, LVM, & MD
> > > > Linear, etc to break large BIOs up on chunk size or lower
> > > > level device boundaries.
> > >
> > > Could I suggest that this code not be part of EVMS, but that
> > > you implement it as a library within the core kernel? Lots of
> > > stuff is going to need BIO splitting - software RAID, ataraid,
> > > XFS, etc. May as well talk with Jens, Martin Petersen, Arjan,
> > > Neil Brown. Do it once, do it right...
> > >
> > I take that back.
> >
> > We really, really do not want to perform BIO splitting at all.
> > It requires that the kernel perform GFP_NOIO allocations at
> > the worst possible time, and it's just broken.
> >
> > What I would much prefer is that the top-level BIO assembly
> > code be able to find out, beforehand, what the maximum
> > permissible BIO size is at the chosen offset. It can then
> > simple restrict the BIO to that size.
> >
> > Simply:
> >
> > max = bio_max_bytes(dev, block);
> >
> > which gets passed down the exact path as the requests themselves.
> > Each layer does:
> >
> > int foo_max_bytes(sector_t sector)
> > {
> > int my_maxbytes, his_maxbytes;
> > sector_t my_sector;
> >
> > my_sector = my_translation(sector);
> > his_maxbytes = next_device(me)->max_bytes(my_sector);
> > my_maxbytes = whatever(my_sector);
> > return min(my_maxbytes, his_maxbytes);
> > }
> >
> > and, at the bottom:
> >
> > int ide_max_bytes(sector_t sector)
> > {
> > return 248 * 512;
> > }
> >
> > BIO_MAX_SECTORS and request_queue.max_sectors go away.
> >
> > Tell me why this won't work?
>
> This would require the BIO assembly code to make at least one
> call to find the current permissible BIO size at offset xyzzy.
> Depending on the actual IO size many foo_max_bytes calls may
> be required. Envision the LVM or RAID case where physical
> extents or chunks sizes can be as small as 8Kb I believe. For
> a 64Kb IO, its conceivable that 9 calls to foo_max_bytes may
> be required to package that IO into permissibly sized BIOs.
>
> What your proposing is doable, but not without a cost.
Why not just put the smallest required BIO size in a struct for that device?
Then each read of that struct can be kept in cache...
Is the BIO max size going to change at different offsets?
> > > > > In EVMS, we are adding code to deal with BIO splitting, to
> > > > > enable our feature modules, such as DriveLinking, LVM, & MD
> > > > > Linear, etc to break large BIOs up on chunk size or lower
> > > > > level device boundaries.
> > > >
> > > > Could I suggest that this code not be part of EVMS, but that
> > > > you implement it as a library within the core kernel? Lots of
> > > > stuff is going to need BIO splitting - software RAID, ataraid,
> > > > XFS, etc. May as well talk with Jens, Martin Petersen, Arjan,
> > > > Neil Brown. Do it once, do it right...
> > > >
> > > I take that back.
> > >
> > > We really, really do not want to perform BIO splitting at all.
> > > It requires that the kernel perform GFP_NOIO allocations at
> > > the worst possible time, and it's just broken.
> > >
> > > What I would much prefer is that the top-level BIO assembly
> > > code be able to find out, beforehand, what the maximum
> > > permissible BIO size is at the chosen offset. It can then
> > > simple restrict the BIO to that size.
> > >
[snipped some ideas]
>
> Why not just put the smallest required BIO size in a struct for that device?
> Then each read of that struct can be kept in cache...
>
> Is the BIO max size going to change at different offsets?
My two cents as a non-guru: there are two different reasons for splitting
a large BIO:
1) some layer has hit some uncommon boundary condition, like spanning
linearly appended physical volumes in an LV or something like that
2) a fundamental 'maximum-chunkiness' allowed by some layer has been
exceeded, like stripe size in a raid, or MAX_SECTORS in ide or something
like that.
It would suck if the system generated large BIOS that needed to be split
for every IO operation, due to #2, but it would also suck to add overhead
to every IO operation for #1.
#1 is an exception, and I think it would be acceptible to have a splitting
function/mempool for handling what should be a boundary condition only,
and the concept of a call through the layers to find out #2 at open time
would handle #2 one time per device or something like that.
David
--
/==============================\
| David Mansfield |
| [email protected] |
\==============================/
Mike Fedyk wrote:
>
> ...
> Is the BIO max size going to change at different offsets?
Yes, it is.
And as far as I know, that size is in all cases calculable
before the top-level assembly begins.
-