>Hi,
>
>On Thu, Feb 01, 2001 at 10:25:22AM +0530, [email protected] wrote:
>>
>> >We _do_ need the ability to stack completion events, but as far as the
>> >kiobuf work goes, my current thoughts are to do that by stacking
>> >lightweight "clone" kiobufs.
>>
>> Would that work with stackable filesystems ?
>
>Only if the filesystems were using VFS interfaces which used kiobufs.
>Right now, the only filesystem using kiobufs is XFS, and it only
>passes them down to the block device layer, not to other filesystems.
That would require the vfs interfaces themselves (address space
readpage/writepage ops) to take kiobufs as arguments, instead of struct
page * . That's not the case right now, is it ?
A filter filesystem would be layered over XFS to take this example.
So right now a filter filesystem only sees the struct page * and passes
this along. Any completion event stacking has to be applied with reference
to this.
>> Being able to track the children of a kiobuf would help with I/O
>> cancellation (e.g. to pull sub-ios off their request queues if I/O
>> cancellation for the parent kiobuf was issued). Not essential, I guess,
in
>> general, but useful in some situations.
>
>What exactly is the justification for IO cancellation? It really
>upsets the normal flow of control through the IO stack to have
>voluntary cancellation semantics.
One reason that I saw is that if the results of an i/o are no longer
required due to some condition (e.g. aio cancellation situations, or if the
process that issued the i/o gets killed), then this avoids the unnecessary
disk i/o, if the request hadn't been scheduled as yet.
Too remote a requirement ? If the capability/support doesn't exist at the
driver level I guess its difficult.
--Stephen
_______________________________________________
Kiobuf-io-devel mailing list
[email protected]
http://lists.sourceforge.net/lists/listinfo/kiobuf-io-devel
On Thu, Feb 01, 2001 at 08:14:58PM +0530, [email protected] wrote:
>
> >Hi,
> >
> >On Thu, Feb 01, 2001 at 10:25:22AM +0530, [email protected] wrote:
> >>
> >> >We _do_ need the ability to stack completion events, but as far as the
> >> >kiobuf work goes, my current thoughts are to do that by stacking
> >> >lightweight "clone" kiobufs.
> >>
> >> Would that work with stackable filesystems ?
> >
> >Only if the filesystems were using VFS interfaces which used kiobufs.
> >Right now, the only filesystem using kiobufs is XFS, and it only
> >passes them down to the block device layer, not to other filesystems.
>
> That would require the vfs interfaces themselves (address space
> readpage/writepage ops) to take kiobufs as arguments, instead of struct
> page * . That's not the case right now, is it ?
No, and with the current kiobufs it would not make sense, because they
are to heavy-weight. With page,length,offsett iobufs this makes sense
and is IMHO the way to go.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
Hi,
On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, [email protected] wrote:
> >
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page * . That's not the case right now, is it ?
>
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.
Really? In what way?
> With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.
What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
lean enough to do the job??
Cheers,
Stephen
On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, [email protected] wrote:
> > >
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page * . That's not the case right now, is it ?
> >
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.
>
> Really? In what way?
We can't allocate a huge kiobuf structure just for requesting one page of
IO. It might get better with VM-level IO clustering though.
>
> > With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
>
> What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> lean enough to do the job??
No. I was speaking abou the light-weight kiobuf Linux & Me discussed on
lkml some time ago (though I'd much more like to call it kiovec analogous
to BSD iovecs).
And a page,offset,length tuple is pretty cheap compared to a current kiobuf.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
>
> No. I was speaking abou the light-weight kiobuf Linux & Me discussed on
^^^^^ Linus ...
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).
>
> And a page,offset,length tuple is pretty cheap compared to a current kiobuf.
Christoph (slapping himself for the stupid typo and selfreply ...)
--
Of course it doesn't work. We've performed a software upgrade.
Hi,
On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > >
> > > No, and with the current kiobufs it would not make sense, because they
> > > are to heavy-weight.
> >
> > Really? In what way?
>
> We can't allocate a huge kiobuf structure just for requesting one page of
> IO. It might get better with VM-level IO clustering though.
A kiobuf is *much* smaller than, say, a buffer_head, and we currently
allocate a buffer_head per block for all IO.
A kiobuf contains enough embedded page vector space for 16 pages by
default, but I'm happy enough to remove that if people want. However,
note that that memory is not initialised, so there is no memory access
cost at all for that empty space. Remove that space and instead of
one memory allocation per kiobuf, you get two, so the cost goes *UP*
for small IOs.
> > > With page,length,offsett iobufs this makes sense
> > > and is IMHO the way to go.
> >
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
>
> No. I was speaking abou the light-weight kiobuf Linux & Me discussed on
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).
What is so heavyweight in the current kiobuf (other than the embedded
vector, which I've already noted I'm willing to cut)?
--Stephen
On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > >
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > >
> > > Really? In what way?
> >
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO. It might get better with VM-level IO clustering though.
>
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.
A kiobuf is 124 bytes, a buffer_head 96. And a buffer_head is additionally
used for caching data, a kiobuf not.
>
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want. However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space. Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.
You could still embed it into a surrounding structure, even if there are cases
where an additional memory allocation is needed, yes.
>
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > >
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> >
> > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
>
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?
array_len, io_count, the presence of wait_queue AND end_io, and the lack of
scatter gather in one kiobuf struct (you always need an array), and AFAICS
that is what the networking guys dislike.
They often just want multiple buffers in one physical page, and and array of
those.
Now one could say: just let the networkers use their own kind of buffers
(and that's exactly what is done in the zerocopy patches), but that again leds
to inefficient buffer passing and ungeneric IO handling.
S.th. like:
struct kiovec {
struct page * kv_page; /* physical page */
u_short kv_offset; /* offset into page */
u_short kv_length; /* data length */
};
enum kio_flags {
KIO_LOANED, /* the calling subsystem wants this buf back */
KIO_GIFTED, /* thanks for the buffer, man! */
KIO_COW /* copy on write (XXX: not yet) */
};
struct kio {
struct kiovec * kio_data; /* our kiovecs */
int kio_ndata; /* # of kiovecs */
int kio_flags; /* loaned or giftet? */
void * kio_priv; /* caller private data */
wait_queue_head_t kio_wait; /* wait queue */
};
makes it a lot simpler for the subsytems to integrate.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
> array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> scatter gather in one kiobuf struct (you always need an array), and AFAICS
> that is what the networking guys dislike.
You need a completion pointer. Its arguable whether you want the wait_queue
in the default structure or as part of whatever its contained in and handled
by the completion pointer.
And I've actually bothered to talk to the networking people and they dont have
a problem with the completion pointer.
> Now one could say: just let the networkers use their own kind of buffers
> (and that's exactly what is done in the zerocopy patches), but that again leds
> to inefficient buffer passing and ungeneric IO handling.
Careful. This is the line of reasoning which also says
Aeroplanes are good for travelling long distances
Cars are better for getting to my front door
Therefore everyone should drive a 747 home
It is quite possible that the right thing to do is to do conversions in the
cases it happens. That might seem a good reason for having offset/length
pairs on each block, because streaming from the network to disk you may well
get a collection of partial pages of data you need to write to disk.
Unfortunately the reality of DMA support on almost (but not quite) all
disk controllers is that you don't get that degree of scatter gather.
My I2O controllers and I think the fusion controllers could indeed benefit
and cope with being given a pile of randomly located 1480 byte chunks of
data and being asked to put them on disk.
I do seriously doubt there are any real world situations this is useful.
On Thu, 1 Feb 2001, Alan Cox wrote:
> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.
[snip]
> It is quite possible that the right thing to do is to do
> conversions in the cases it happens.
OTOH, somehow a zero-copy system which converts the zero-copy
metadata every time the buffer is handed to another subsystem
just doesn't sound right ...
(well, maybe it _is_, but it looks quite inefficient at first
glance)
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
> OTOH, somehow a zero-copy system which converts the zero-copy
> metadata every time the buffer is handed to another subsystem
> just doesn't sound right ...
>
> (well, maybe it _is_, but it looks quite inefficient at first
> glance)
I would certainly be a lot happier if there is a single sensible zero copy
format doing the lot, but only if it doesnt turn into a cross between a 747
and bicycle
On Thu, Feb 01, 2001 at 06:25:16PM +0000, Alan Cox wrote:
> > array_len, io_count, the presence of wait_queue AND end_io, and the lack of
> > scatter gather in one kiobuf struct (you always need an array), and AFAICS
> > that is what the networking guys dislike.
>
> You need a completion pointer. Its arguable whether you want the wait_queue
> in the default structure or as part of whatever its contained in and handled
> by the completion pointer.
I personaly think that Ben's function pointer on wakeup work is the alternative in
this area.
> And I've actually bothered to talk to the networking people and they dont have
> a problem with the completion pointer.
I have never said that they don't like it - but having both the waitqueue and the
completion handler in the kiobuf makes it bigger.
> > Now one could say: just let the networkers use their own kind of buffers
> > (and that's exactly what is done in the zerocopy patches), but that again leds
> > to inefficient buffer passing and ungeneric IO handling.
>
> Careful. This is the line of reasoning which also says
>
> Aeroplanes are good for travelling long distances
> Cars are better for getting to my front door
> Therefore everyone should drive a 747 home
Hehe ;)
> It is quite possible that the right thing to do is to do conversions in the
> cases it happens.
Yes, this would be THE alternative to my suggestion.
> That might seem a good reason for having offset/length
> pairs on each block, because streaming from the network to disk you may well
> get a collection of partial pages of data you need to write to disk.
> Unfortunately the reality of DMA support on almost (but not quite) all
> disk controllers is that you don't get that degree of scatter gather.
>
> My I2O controllers and I think the fusion controllers could indeed benefit
> and cope with being given a pile of randomly located 1480 byte chunks of
> data and being asked to put them on disk.
It doesn't really matter that much, because we write to the pagecache
first anyway.
The real thing is that we want to have some common data structure for
describing physical memory used for IO. We could either use special
structures in every subsystem and then copy between them or pass
struct page * and lose meta information. Or we could try to find a
structure that holds enough information to make passing it from one
subsystem to another usefull. The cut-down kio design (heavily inspired
by Larry McVoy's splice paper) should allow just that, nothing more an
nothing less. For use in disk-io and networking or v4l there are probably
other primary data structures needed, and that's ok.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
> It doesn't really matter that much, because we write to the pagecache
> first anyway.
Not for raw I/O. Although for the drivers that can't cope then going via
the page cache is certainly the next best alternative
> The real thing is that we want to have some common data structure for
> describing physical memory used for IO. We could either use special
Yes. You also need a way to describe it in terms of page * in order to do
mm locking for raw I/O (like the video capture stuff wants)
> by Larry McVoy's splice paper) should allow just that, nothing more an
> nothing less. For use in disk-io and networking or v4l there are probably
> other primary data structures needed, and that's ok.
Certainly having the lightweight one a subset of the heavyweight one is a good
target.
On Thu, Feb 01, 2001 at 06:57:41PM +0000, Alan Cox wrote:
> Not for raw I/O. Although for the drivers that can't cope then going via
> the page cache is certainly the next best alternative
True - but raw-io has it's own alignment issues anyway.
> Yes. You also need a way to describe it in terms of page * in order to do
> mm locking for raw I/O (like the video capture stuff wants)
Right. (That's why we have the struct page * always as part of the structure)
> Certainly having the lightweight one a subset of the heavyweight one is a good
> target.
Yes, I'm trying to address that...
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
Hi,
On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> > >
> > > We can't allocate a huge kiobuf structure just for requesting one page of
> > > IO. It might get better with VM-level IO clustering though.
> >
> > A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> > allocate a buffer_head per block for all IO.
>
> A kiobuf is 124 bytes,
... the vast majority of which is room for the page vector to expand
without having to be copied. You don't touch that in the normal case.
> a buffer_head 96. And a buffer_head is additionally
> used for caching data, a kiobuf not.
Buffer_heads are _sometimes_ used for caching data. That's one of the
big problems with them, they are too overloaded, being both IO
descriptors _and_ cache descriptors. If you've got 128k of data to
write out from user space, do you want to set up one kiobuf or 256
buffer_heads? Buffer_heads become really very heavy indeed once you
start doing non-trivial IO.
> > What is so heavyweight in the current kiobuf (other than the embedded
> > vector, which I've already noted I'm willing to cut)?
>
> array_len
kiobufs can be reused after IO. You can depopulate a kiobuf,
repopulate it with new pages and submit new IO without having to
deallocate the kiobuf. You can't do this without knowing how big the
data vector is. Removing that functionality will prevent reuse,
making them _more_ heavyweight.
> io_count,
Right now we can take a kiobuf and turn it into a bunch of
buffer_heads for IO. The io_count lets us track all of those sub-IOs
so that we know when all submitted IO has completed, so that we can
pass the completion callback back up the chain without having to
allocate yet more descriptor structs for the IO.
Again, remove this and the IO becomes more heavyweight because we need
to create a separate struct for the info.
> the presence of wait_queue AND end_io,
That's fine, I'm happy scrapping the wait queue: people can always use
the kiobuf private data field to refer to a wait queue if they want
to.
> and the lack of
> scatter gather in one kiobuf struct (you always need an array)
Again, _all_ data being sent down through the block device layer is
either in buffer heads or is page aligned. You want us to triple the
size of the "heavyweight" kiobuf's data vector for what gain, exactly?
Obviously, extra code will be needed to scan kiobufs if we do that,
and unless we have both per-page _and_ per-kiobuf start/offset pairs
(adding even further to the complexity), those scatter-gather lists
would prevent us from carving up a kiobuf into smaller sub-ios without
copying the whole (expanded) vector.
That's a _lot_ of extra complexity in the disk IO layers.
I'm all for a fast kiobuf_to_sglist converter. But I haven't seen any
evidence that such scatter-gather lists will do anything in the block
device case except complicate the code and decrease performance.
> S.th. like:
...
> makes it a lot simpler for the subsytems to integrate.
Possibly, but I remain to be convinced, because you may end up with a
mechanism which is generic but is not well-tuned for any specific
case, so everything goes slower.
--Stephen
In article <[email protected]> you wrote:
> Buffer_heads are _sometimes_ used for caching data.
Actually they are mostly used, but that should have any value for the
discussion...
> That's one of the
> big problems with them, they are too overloaded, being both IO
> descriptors _and_ cache descriptors.
Agreed.
> If you've got 128k of data to
> write out from user space, do you want to set up one kiobuf or 256
> buffer_heads? Buffer_heads become really very heavy indeed once you
> start doing non-trivial IO.
Sure - I was never arguing in favor of buffer_head's ...
>> > What is so heavyweight in the current kiobuf (other than the embedded
>> > vector, which I've already noted I'm willing to cut)?
>>
>> array_len
> kiobufs can be reused after IO. You can depopulate a kiobuf,
> repopulate it with new pages and submit new IO without having to
> deallocate the kiobuf. You can't do this without knowing how big the
> data vector is. Removing that functionality will prevent reuse,
> making them _more_ heavyweight.
>> io_count,
> Right now we can take a kiobuf and turn it into a bunch of
> buffer_heads for IO. The io_count lets us track all of those sub-IOs
> so that we know when all submitted IO has completed, so that we can
> pass the completion callback back up the chain without having to
> allocate yet more descriptor structs for the IO.
> Again, remove this and the IO becomes more heavyweight because we need
> to create a separate struct for the info.
No. Just allow passing the multiple of the devices blocksize over
ll_rw_block. XFS is doing that and it just needs an audit of the lesser
used block drivers.
>> and the lack of
>> scatter gather in one kiobuf struct (you always need an array)
> Again, _all_ data being sent down through the block device layer is
> either in buffer heads or is page aligned.
That's the point. You are always talking about the block-layer only.
And I think it should be generic instead.
Looks like that is the major point.
> You want us to triple the
> size of the "heavyweight" kiobuf's data vector for what gain, exactly?
double.
> Obviously, extra code will be needed to scan kiobufs if we do that,
> and unless we have both per-page _and_ per-kiobuf start/offset pairs
> (adding even further to the complexity), those scatter-gather lists
> would prevent us from carving up a kiobuf into smaller sub-ios without
> copying the whole (expanded) vector.
No. I think I explained that in my last mail.
> That's a _lot_ of extra complexity in the disk IO layers.
> Possibly, but I remain to be convinced, because you may end up with a
> mechanism which is generic but is not well-tuned for any specific
> case, so everything goes slower.
As kiobufs are widely used for real IO, just as containers, this is
better then nothing.
And IMHO a nice generic concepts that lets different subsystems work
toegther is a _lot_ better then a bunch of over-optimized, rather isolated
subsytems. The IO-Lite people have done a nice research of the effect of
an unified IO-Caching system vs. the typical isolated systems.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
Hi,
On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote:
> > Right now we can take a kiobuf and turn it into a bunch of
> > buffer_heads for IO. The io_count lets us track all of those sub-IOs
> > so that we know when all submitted IO has completed, so that we can
> > pass the completion callback back up the chain without having to
> > allocate yet more descriptor structs for the IO.
>
> > Again, remove this and the IO becomes more heavyweight because we need
> > to create a separate struct for the info.
>
> No. Just allow passing the multiple of the devices blocksize over
> ll_rw_block.
That was just one example: you need the sub-ios just as much when
you split up an IO over stripe boundaries in LVM or raid0, for
example. Secondly, ll_rw_block needs to die anyway: you can expand
the blocksize up to PAGE_SIZE but not beyond, whereas something like
ll_rw_kiobuf can submit a much larger IO atomically (and we have
devices which don't start to deliver good throughput until you use
IO sizes of 1MB or more).
> >> and the lack of
> >> scatter gather in one kiobuf struct (you always need an array)
>
> > Again, _all_ data being sent down through the block device layer is
> > either in buffer heads or is page aligned.
>
> That's the point. You are always talking about the block-layer only.
I'm talking about why the minimal, generic solution doesn't provide
what the block layer needs.
> > Obviously, extra code will be needed to scan kiobufs if we do that,
> > and unless we have both per-page _and_ per-kiobuf start/offset pairs
> > (adding even further to the complexity), those scatter-gather lists
> > would prevent us from carving up a kiobuf into smaller sub-ios without
> > copying the whole (expanded) vector.
>
> No. I think I explained that in my last mail.
How?
If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
to split it in two, I have to make two new vectors (page X, offset 0,
length n) and (page X, offset n, length PAGE_SIZE-n). That implies
copying both vectors.
If I have a page vector with a single offset/length pair, I can build
a new header with the same vector and modified offset/length to split
the vector in two without copying it.
> > Possibly, but I remain to be convinced, because you may end up with a
> > mechanism which is generic but is not well-tuned for any specific
> > case, so everything goes slower.
>
> As kiobufs are widely used for real IO, just as containers, this is
> better then nothing.
Surely having all of the subsystems working fast is better still?
> And IMHO a nice generic concepts that lets different subsystems work
> toegther is a _lot_ better then a bunch of over-optimized, rather isolated
> subsytems. The IO-Lite people have done a nice research of the effect of
> an unified IO-Caching system vs. the typical isolated systems.
I know, and IO-Lite has some major problems (the close integration of
that code into the cache, for example, makes it harder to expose the
zero-copy to user-land).
--Stephen
On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > >
> > > > No, and with the current kiobufs it would not make sense, because they
> > > > are to heavy-weight.
> > >
> > > Really? In what way?
> >
> > We can't allocate a huge kiobuf structure just for requesting one page of
> > IO. It might get better with VM-level IO clustering though.
>
> A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> allocate a buffer_head per block for all IO.
>
> A kiobuf contains enough embedded page vector space for 16 pages by
> default, but I'm happy enough to remove that if people want. However,
> note that that memory is not initialised, so there is no memory access
> cost at all for that empty space. Remove that space and instead of
> one memory allocation per kiobuf, you get two, so the cost goes *UP*
> for small IOs.
>
> > > > With page,length,offsett iobufs this makes sense
> > > > and is IMHO the way to go.
> > >
> > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > > lean enough to do the job??
> >
> > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on
> > lkml some time ago (though I'd much more like to call it kiovec analogous
> > to BSD iovecs).
>
> What is so heavyweight in the current kiobuf (other than the embedded
> vector, which I've already noted I'm willing to cut)?
Hi,
It'd seem that "array_len", "locked", "bounced", "io_count" and "errno"
are the fields that need to go away (apart from the "maplist").
The field "nr_pages" would reincarnate in the kiovec struct (which is
is not a plain array anymore) as the field "nbufs". See below.
Based on what I've seen fly by on the lists here's my understanding of
the proposed new kiobuf/kiovec structures:
===========================================================================
/*
* a simple page,offset,length tuple like Linus wants it
*/
struct kiobuf {
struct page * page; /* The page itself */
u_16 offset; /* Offset to start of valid data */
u_16 length; /* Number of valid bytes of data */
};
struct kiovec {
int nbufs; /* Kiobufs actually referenced */
struct kiobuf * bufs;
}
/*
* the name is just plain stupid, but that shouldn't matter
*/
struct vfs_kiovec {
struct kiovec * iov;
/* private data, mostly for the callback */
void * private;
/* completion callback */
void (*end_io) (struct vfs_kiovec *);
wait_queue_head_t wait_queue;
};
===========================================================================
Is this correct?
If so, I have a few questions/clarifications:
- The [ll_rw_blk, scsi/ide request-functions, scsi/ide
I/O completion handling] functions would be handed the
"X_kiovec" struct, correct?
- So, the soft-RAID / LVM layers need to construct their
own "lvm_kiovec" structs to handle request splits and
the partial completions, correct?
- Then, what are the semantics of request-merges containing
the "X_kiovec" structs in the block I/O queueing layers?
Do we add "X_kiovec->next", "X_kiovec->prev" etc. fields?
It will also require a re-allocation of a new and longer
kiovec->bufs array, correct?
- How are I/O error codes to be propagated back to the
higher (calling) layers? I think that needs to be added
into the "X_kiovec" struct, no?
- How is bouncing to be handled with this setup? (some state
is needed to (a) determine that bouncing occurred, (b) find
out which pages have been bounced and, (c) find out the
bounce-page for each of these bounced pages).
Cheers,
-Chait.
On Thu, 1 Feb 2001, Christoph Hellwig wrote:
> A kiobuf is 124 bytes, a buffer_head 96. And a buffer_head is additionally
> used for caching data, a kiobuf not.
Go measure the cost of a distant cache miss, then complain about having
everything in one structure. Also, 1 kiobuf maps 16-128 times as much
data as a single buffer head.
> enum kio_flags {
> KIO_LOANED, /* the calling subsystem wants this buf back */
> KIO_GIFTED, /* thanks for the buffer, man! */
> KIO_COW /* copy on write (XXX: not yet) */
> };
This is a Really Bad Idea. Having semantics depend on a subtle flag
determined by a caller is a sure way to
>
>
> struct kio {
> struct kiovec * kio_data; /* our kiovecs */
> int kio_ndata; /* # of kiovecs */
> int kio_flags; /* loaned or giftet? */
> void * kio_priv; /* caller private data */
> wait_queue_head_t kio_wait; /* wait queue */
> };
>
> makes it a lot simpler for the subsytems to integrate.
Keep in mind that using distant memory allocations for kio_data will incur
additional cache misses. The atomic count is probably going to be widely
used; I see it being applicable to the network stack, block io layers and
others. Also, how is information about io completion status passed back
to the caller? That information is required across layers so that io can
be properly aborted or proceed with the partial amount of io. Add those
back in and we're right back to the original kiobuf structure.
-ben
On Thu, Feb 01, 2001 at 09:25:08PM +0000, Stephen C. Tweedie wrote:
> > No. Just allow passing the multiple of the devices blocksize over
> > ll_rw_block.
>
> That was just one example: you need the sub-ios just as much when
> you split up an IO over stripe boundaries in LVM or raid0, for
> example.
IIRC that's why you designed (and I thought of independandly) clone-kiobufs.
> Secondly, ll_rw_block needs to die anyway: you can expand
> the blocksize up to PAGE_SIZE but not beyond, whereas something like
> ll_rw_kiobuf can submit a much larger IO atomically (and we have
> devices which don't start to deliver good throughput until you use
> IO sizes of 1MB or more).
Completly agreed.
> If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
> to split it in two, I have to make two new vectors (page X, offset 0,
> length n) and (page X, offset n, length PAGE_SIZE-n). That implies
> copying both vectors.
>
> If I have a page vector with a single offset/length pair, I can build
> a new header with the same vector and modified offset/length to split
> the vector in two without copying it.
You just say in the higher-level structure ignore from x to y even if
they have an offset in their own vector.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
On Thu, Feb 01, 2001 at 11:18:56PM -0500, [email protected] wrote:
> On Thu, 1 Feb 2001, Christoph Hellwig wrote:
>
> > A kiobuf is 124 bytes, a buffer_head 96. And a buffer_head is additionally
> > used for caching data, a kiobuf not.
>
> Go measure the cost of a distant cache miss, then complain about having
> everything in one structure. Also, 1 kiobuf maps 16-128 times as much
> data as a single buffer head.
I'd never dipute that. It was just an answers to Stephen's "a kiobuf is
already smaller".
> > enum kio_flags {
> > KIO_LOANED, /* the calling subsystem wants this buf back */
> > KIO_GIFTED, /* thanks for the buffer, man! */
> > KIO_COW /* copy on write (XXX: not yet) */
> > };
>
> This is a Really Bad Idea. Having semantics depend on a subtle flag
> determined by a caller is a sure way to
The semantics aren't different for the using subsystem. LOANED vs GIFTED
is an issue for the free function, COW will probably be a page-level mm
thing - though I haven't thought a lot about it yet an am not sure wether
it actually makes sense.
>
> >
> >
> > struct kio {
> > struct kiovec * kio_data; /* our kiovecs */
> > int kio_ndata; /* # of kiovecs */
> > int kio_flags; /* loaned or giftet? */
> > void * kio_priv; /* caller private data */
> > wait_queue_head_t kio_wait; /* wait queue */
> > };
> >
> > makes it a lot simpler for the subsytems to integrate.
>
> Keep in mind that using distant memory allocations for kio_data will incur
> additional cache misses.
It could also be a [0] array at the end, allowing for a single allocation,
but that looks more like a implementation detail then a design problem to me.
> The atomic count is probably going to be widely
> used; I see it being applicable to the network stack, block io layers and
> others.
Hmm. Currently it is used only for the multiple buffer_head's per iobuf
cruft, and I don't see why multiple outstanding IOs should be noted in a
kiobuf.
> Also, how is information about io completion status passed back
> to the caller?
Yes, there needs to be an kio_errno field - though I wanted to get rid of
it I had to readd in in later versions of my design.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
Hi,
On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
> >
> > If I have a page vector with a single offset/length pair, I can build
> > a new header with the same vector and modified offset/length to split
> > the vector in two without copying it.
>
> You just say in the higher-level structure ignore from x to y even if
> they have an offset in their own vector.
Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.
This is _precisely_ the mess I want to avoid.
Cheers,
Stephen