2001-02-01 17:03:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, Feb 01, 2001 at 04:49:58PM +0000, Stephen C. Tweedie wrote:
> > Enquiring minds would like to know if you are working towards this
> > revamp of the kiobuf structure at the moment, you have been very quiet
> > recently.
>
> I'm in the middle of some parts of it, and am actively soliciting
> feedback on what cleanups are required.

The real issue is that Linus dislikes the current kiobuf scheme.
I do not like everything he proposes, but lots of things makes sense.

> I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
> have started doing some of the cleanups needed --- removing the
> embedded page vector, and adding support for lightweight stacking of
> kiobufs for completion callback chains.

Ok, great.

> However, filesystem IO is almost *always* page aligned: O_DIRECT IO
> comes from VM pages, and internal filesystem IO comes from page cache
> pages. Buffer cache IOs are the only exception, and kiobufs only fail
> for such IOs once you have multiple buffer_heads being merged into
> single requests.
>
> So, what are the benefits in the disk IO stack of adding length/offset
> pairs to each page of the kiobuf?

I don't see any real advantage for disk IO. The real advantage is that
we can have a generic structure that is also usefull in e.g. networking
and can lead to a unified IO buffering scheme (a little like IO-Lite).

Christoph

--
Of course it doesn't work. We've performed a software upgrade.


2001-02-01 17:35:04

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

> > I'm in the middle of some parts of it, and am actively soliciting
> > feedback on what cleanups are required.
>
> The real issue is that Linus dislikes the current kiobuf scheme.
> I do not like everything he proposes, but lots of things makes sense.

Linus basically designed the original kiobuf scheme of course so I guess
he's allowed to dislike it. Linus disliking something however doesn't mean
its wrong. Its not a technically valid basis for argument.

Linus list of reasons like the amount of state are more interesting

> > So, what are the benefits in the disk IO stack of adding length/offset
> > pairs to each page of the kiobuf?
>
> I don't see any real advantage for disk IO. The real advantage is that
> we can have a generic structure that is also usefull in e.g. networking
> and can lead to a unified IO buffering scheme (a little like IO-Lite).

Networking wants something lighter rather than heavier. Adding tons of
base/limit pairs to kiobufs makes it worse not better

2001-02-01 17:51:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > > I'm in the middle of some parts of it, and am actively soliciting
> > > feedback on what cleanups are required.
> >
> > The real issue is that Linus dislikes the current kiobuf scheme.
> > I do not like everything he proposes, but lots of things makes sense.
>
> Linus basically designed the original kiobuf scheme of course so I guess
> he's allowed to dislike it. Linus disliking something however doesn't mean
> its wrong. Its not a technically valid basis for argument.

Sure. But Linus saing that he doesn't want more of that (shit, crap,
I don't rember what he said exactly) in the kernel is a very good reason
for thinking a little more aboyt it.

Espescially if most arguments look right to one after thinking more about
it...

> Linus list of reasons like the amount of state are more interesting

True. The arument that they are to heaviweight also.
That they should allow scatter gather without an array of structs also.


> > > So, what are the benefits in the disk IO stack of adding length/offset
> > > pairs to each page of the kiobuf?
> >
> > I don't see any real advantage for disk IO. The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
>
> Networking wants something lighter rather than heavier.

Right. That's what the new design was about, besides adding a offset and
length to every page instead of the page array, something also wanted by
the networking in the first place.
Look at the skb_frag struct in the zero-copy patch for what networking
thinks it needs for physical page based buffers.

> Adding tons of base/limit pairs to kiobufs makes it worse not better

>From looking at the networking code and listening to Dave and Ingo it looks
like it makes the thing better for networking, although I can not verify
this due to the lack of familarity with the networking code.

For disk I/O it makes the handling a little easier for the cost of the
additional offset/length fields.

Christoph

P.S. the tuple things is also what Larry had in his inital slice paper.
--
Of course it doesn't work. We've performed a software upgrade.

2001-02-01 17:53:14

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> >
> > I don't see any real advantage for disk IO. The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
>
> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

Networking has fundamentally different requirements. In a network
stack, you want the ability to add fragments to unaligned chunks of
data to represent headers at any point in the stack.

In the disk IO case, you basically don't get that (the only thing
which comes close is raid5 parity blocks). The data which the user
started with is the data sent out on the wire. You do get some
interesting cases such as soft raid and LVM, or even in the scsi stack
if you run out of mailbox space, where you need to send only a
sub-chunk of the input buffer.

In that case, having offset/len as the kiobuf limit markers is ideal:
you can clone a kiobuf header using the same page vector as the
parent, narrow down the start/end points, and continue down the stack
without having to copy any part of the page list. If you had the
offset/len data encoded implicitly into each entry in the sglist, you
would not be able to do that.

--Stephen

2001-02-01 17:57:54

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

> > Linus basically designed the original kiobuf scheme of course so I guess
> > he's allowed to dislike it. Linus disliking something however doesn't mean
> > its wrong. Its not a technically valid basis for argument.
>
> Sure. But Linus saing that he doesn't want more of that (shit, crap,
> I don't rember what he said exactly) in the kernel is a very good reason
> for thinking a little more aboyt it.

No. Linus is not a God, Linus is fallible, regularly makes mistakes and
frequently opens his mouth and says stupid things when he is far too busy.

> Espescially if most arguments look right to one after thinking more about
> it...

I agree with the issues about networking wanting lightweight objects, Im
unconvinced however the existing setup for networking is sanely applicable
for real world applications in other spaces.

Take video capture. I want to stream 60Mbytes/second in multi-megabyte
chunks between my capture cards and a high end raid array. The array wants
1Mbyte or large blocks per I/O to reach 60Mbytes/second performance.

This btw isnt benchmark crap like most of the zero copy networking, this is
a real world application..

The current buffer head stuff is already heavier than the kio stuff. The
networking stuff isnt oriented to that kind of I/O and would end up
needing to do tons of extra processing.

> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

I remain to be convinced by that. However you do get 64bytes/cacheline on
a real processor nowdays so if you touch any of that 64byte block you are
practically zero cost to fill the rest.

Alan

2001-02-01 18:34:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, 1 Feb 2001, Alan Cox wrote:

> > Sure. But Linus saing that he doesn't want more of that (shit, crap,
> > I don't rember what he said exactly) in the kernel is a very good reason
> > for thinking a little more aboyt it.
>
> No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> frequently opens his mouth and says stupid things when he is far too busy.

People may remember Linus saying a resolute no to SMP
support in Linux ;)

In my experience, when Linus says "NO" to a certain
idea, he's usually objecting to bad design decisions
in the proposed implementation of the idea and the
lack of a nice alternative solution ...

... but as soon as a clean, efficient and maintainable
alternative to the original bad idea surfaces, it seems
to be quite easy to convince Linus to include it.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2001-02-01 18:53:31

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, 1 Feb 2001, Alan Cox wrote:

> Linus list of reasons like the amount of state are more interesting

The state is required, not optional, if we are to have a decent basis for
building asyncronous io into the kernel.

> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

I'm still not seeing what I consider valid arguments from the networking
people regarding the use of kiobufs as the interface they present to the
VFS for asynchronous/bulk io. I agree with their needs for a light weight
mechanism for getting small io requests from userland, and even the need
for using lightweight scatter gather lists within the network layer
itself. If the statement is that map_user_kiobuf is too heavy for use on
every single io, sure. But that is a seperate issue.

-ben


2001-02-01 19:00:31

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, Feb 01, 2001 at 04:32:48PM -0200, Rik van Riel wrote:
> On Thu, 1 Feb 2001, Alan Cox wrote:
>
> > > Sure. But Linus saing that he doesn't want more of that (shit, crap,
> > > I don't rember what he said exactly) in the kernel is a very good reason
> > > for thinking a little more aboyt it.
> >
> > No. Linus is not a God, Linus is fallible, regularly makes mistakes and
> > frequently opens his mouth and says stupid things when he is far too busy.
>
> People may remember Linus saying a resolute no to SMP
> support in Linux ;)

And perhaps he was right!

--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com

2001-02-01 19:37:10

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote:
>
> > Adding tons of base/limit pairs to kiobufs makes it worse not better
>
> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

Umm, actually, no, it makes it much worse for many of the cases.

--Stephen

2001-02-01 20:12:58

by Chaitanya Tumuluri

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > >
> > > I don't see any real advantage for disk IO. The real advantage is that
> > > we can have a generic structure that is also usefull in e.g. networking
> > > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> >
> > Networking wants something lighter rather than heavier. Adding tons of
> > base/limit pairs to kiobufs makes it worse not better
>
> Networking has fundamentally different requirements. In a network
> stack, you want the ability to add fragments to unaligned chunks of
> data to represent headers at any point in the stack.
>
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks). The data which the user
> started with is the data sent out on the wire. You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer.

Or the case of BSD-style UIO implementing the readv() and writev() calls.
This may/may-not align perfectly so, address-length lists per page could
be helpful.

I did try an implementation of this for rawio and found that I had to
restrict the a-len lists coming in via the user iovecs to be aligned.

> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list. If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

This would solve the issue with UIO, yes. Also, I think Martin Peterson
(mkp) had taken a stab at doing "clone-kiobufs" for LVM at some point.

Martin?

Cheers,
-Chait.

2001-02-01 20:34:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

In article <[email protected]> you wrote:
> Hi,

> On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> In the disk IO case, you basically don't get that (the only thing
> which comes close is raid5 parity blocks). The data which the user
> started with is the data sent out on the wire. You do get some
> interesting cases such as soft raid and LVM, or even in the scsi stack
> if you run out of mailbox space, where you need to send only a
> sub-chunk of the input buffer.

Though your describption is right, I don't think the case is very common:
Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

In raid1 you need some kind of clone iobuf, which should work with both
cases. In raid0 you need a complete new pagelist anyway, same for raid5.


> In that case, having offset/len as the kiobuf limit markers is ideal:
> you can clone a kiobuf header using the same page vector as the
> parent, narrow down the start/end points, and continue down the stack
> without having to copy any part of the page list. If you had the
> offset/len data encoded implicitly into each entry in the sglist, you
> would not be able to do that.

Sure you could: you embedd that information in a higher-level structure.
I think you want the whole kio concept only for disk-like IO. Then many
of the things you do are completly right and I don't see much problems
(besides thinking that some thing may go away - but that's no major point).

With a generic object that is used over subsytem boundaries things are
different.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2001-02-01 21:47:24

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
>
> > On Thu, Feb 01, 2001 at 05:34:49PM +0000, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks). The data which the user
> > started with is the data sent out on the wire. You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer.
>
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

On raid0 stripes, it's common to have stripes of between 16k and 64k,
so it's rather more common there than you'd like. In any case, you
need the code to handle it, and I don't want to make the code paths
any more complex than necessary.

> In raid1 you need some kind of clone iobuf, which should work with both
> cases. In raid0 you need a complete new pagelist anyway

No you don't. You take the existing one, specify which region of it
is going to the current stripe, and send it off. Nothing more.

> > In that case, having offset/len as the kiobuf limit markers is ideal:
> > you can clone a kiobuf header using the same page vector as the
> > parent, narrow down the start/end points, and continue down the stack
> > without having to copy any part of the page list. If you had the
> > offset/len data encoded implicitly into each entry in the sglist, you
> > would not be able to do that.
>
> Sure you could: you embedd that information in a higher-level structure.

What's the point in a common data container structure if you need
higher-level information to make any sense out of it?

--Stephen

2001-02-01 22:09:51

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:

> I think you want the whole kio concept only for disk-like IO.

No. I want something good for zero-copy IO in general, but a lot of
that concerns the problem of interacting with the user, and the basic
center of that interaction in 99% of the interesting cases is either a
user VM buffer or the page cache --- all of which are page-aligned.

If you look at the sorts of models being proposed (even by Linus) for
splice, you get

len = prepare_read();
prepare_write();
pull_fd();
commit_write();

in which the read is being pulled into a known location in the page
cache -- it's page-aligned, again. I'm perfectly willing to accept
that there may be a need for scatter-gather boundaries including
non-page-aligned fragments in this model, but I can't see one if
you're using the page cache as a mediator, nor if you're doing it
through a user mmapped buffer.

The only reason you need finer scatter-gather boundaries --- and it
may be a compelling reason --- is if you are merging multiple IOs
together into a single device-level IO. That makes perfect sense for
the zerocopy tcp case where you're doing MSG_MORE-type coalescing. It
doesn't help the existing SGI kiobuf block device code, because that
performs its merging in the filesystem layers and the block device
code just squirts the IOs to the wire as-is, but if we want to start
merging those kiobuf-based IOs within make_request() then the block
device layer may want it too.

And Linus is right, the old way of using a *kiobuf[] for that was
painful, but the solution of adding start/length to every entry in
the page vector just doesn't sit right with many components of the
block device environment either.

I may still be persuaded that we need the full scatter-gather list
fields throughout, but for now I tend to think that, at least in the
disk layers, we may get cleaner results by allow linked lists of
page-aligned kiobufs instead. That allows for merging of kiobufs
without having to copy all of the vector information each time.

The killer, however, is what happens if you want to split such a
merged kiobuf. Right now, that's something that I can only imagine
happening in the block layers if we start encoding buffer_head chains
as kiobufs, but if we do that in the future, or if we start merging
genuine kiobuf requests requests, then doing that split later on (for
raid0 etc) may require duplicating whole chains of kiobufs. At that
point, just doing scatter-gather lists is cleaner.

But for now, the way to picture what I'm trying to achieve is that
kiobufs are a bit like buffer_heads --- they represent the physical
pages of some VM object that a higher layer has constructed, such as
the page cache or a user VM buffer. You can chain these objects
together for IO, but that doesn't stop the individual objects from
being separate entities with independent IO completion callbacks to be
honoured.

Cheers,
Stephen

2001-02-02 12:03:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

On Thu, Feb 01, 2001 at 10:07:44PM +0000, Stephen C. Tweedie wrote:
> No. I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.

Yes.

> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
>
> len = prepare_read();
> prepare_write();
> pull_fd();
> commit_write();

Yepp.

> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again. I'm perfectly willing to accept
> that there may be a need for scatter-gather boundaries including
> non-page-aligned fragments in this model, but I can't see one if
> you're using the page cache as a mediator, nor if you're doing it
> through a user mmapped buffer.

True.

> The only reason you need finer scatter-gather boundaries --- and it
> may be a compelling reason --- is if you are merging multiple IOs
> together into a single device-level IO. That makes perfect sense for
> the zerocopy tcp case where you're doing MSG_MORE-type coalescing. It
> doesn't help the existing SGI kiobuf block device code, because that
> performs its merging in the filesystem layers and the block device
> code just squirts the IOs to the wire as-is,

Yes - but that is no soloution for a generic model. AFAICS even XFS
falls back to buffer_head's for small requests.

> but if we want to start
> merging those kiobuf-based IOs within make_request() then the block
> device layer may want it too.

Yes.

> And Linus is right, the old way of using a *kiobuf[] for that was
> painful, but the solution of adding start/length to every entry in
> the page vector just doesn't sit right with many components of the
> block device environment either.

What do you thing is the alternative?

> I may still be persuaded that we need the full scatter-gather list
> fields throughout, but for now I tend to think that, at least in the
> disk layers, we may get cleaner results by allow linked lists of
> page-aligned kiobufs instead. That allows for merging of kiobufs
> without having to copy all of the vector information each time.

But it will have the same problems as the array soloution: there will
be one complete kio structure for each kiobuf, with it's own end_io
callback, etc.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2001-02-03 20:29:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains



On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
>
> On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
>
> > I think you want the whole kio concept only for disk-like IO.
>
> No. I want something good for zero-copy IO in general, but a lot of
> that concerns the problem of interacting with the user, and the basic
> center of that interaction in 99% of the interesting cases is either a
> user VM buffer or the page cache --- all of which are page-aligned.
>
> If you look at the sorts of models being proposed (even by Linus) for
> splice, you get
>
> len = prepare_read();
> prepare_write();
> pull_fd();
> commit_write();
>
> in which the read is being pulled into a known location in the page
> cache -- it's page-aligned, again.

Wrong.

Neither the read nor the write are page-aligned. I don't know where you
got that idea. It's obviously not true even in the common case: it depends
_entirely_ on what the file offsets are, and expecting the offset to be
zero is just being stupid. It's often _not_ zero. With networking it is in
fact seldom zero, because the network packets are seldom aligned either in
size or in location.

Also, there are many reasons why "page" may have different meaning. We
will eventually have a page-cache where the pagecace granularity is not
the same as the user-level visible one. User-level may do mmap at 4kB
boundaries, even if the page cache itself uses 8kB or 16kB pages.

THERE IS NO PAGE-ALIGNMENT. And anything that even _mentions_ the word
page-aligned is going into my trash-can faster than you can say "bug".

Linus

2001-02-05 11:07:29

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
>
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> >
> Neither the read nor the write are page-aligned. I don't know where you
> got that idea. It's obviously not true even in the common case: it depends
> _entirely_ on what the file offsets are, and expecting the offset to be
> zero is just being stupid. It's often _not_ zero. With networking it is in
> fact seldom zero, because the network packets are seldom aligned either in
> size or in location.

The underlying buffer is. The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location. That's _exactly_
what kiobufs were intended to support. The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size. (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)

Cheers,
Stephen

2001-02-05 12:01:43

by Manfred Spraul

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

"Stephen C. Tweedie" wrote:
>
> You simply cannot do physical disk IO on
> non-sector-aligned memory or in chunks which aren't a multiple of
> sector size.

Why not?

Obviously the disk access itself must be sector aligned and the total
length must be a multiple of the sector length, but there shouldn't be
any restrictions on the data buffers.

I remember that even Windoze 95 has scatter-gather support for physical
disk IO with arbitraty buffer chunks. (If the hardware supports it,
otherwise the io subsystem will copy the data into a contiguous
temporary buffer)

--
Manfred

2001-02-05 12:22:03

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
>
> > I may still be persuaded that we need the full scatter-gather list
> > fields throughout, but for now I tend to think that, at least in the
> > disk layers, we may get cleaner results by allow linked lists of
> > page-aligned kiobufs instead. That allows for merging of kiobufs
> > without having to copy all of the vector information each time.
>
> But it will have the same problems as the array soloution: there will
> be one complete kio structure for each kiobuf, with it's own end_io
> callback, etc.

And what's the problem with that?

You *need* this. You have to have that multiple-completion concept in
the disk layers. Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes. It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface. Removing that would be a step backwards.

Cheers,
Stephen

2001-02-05 15:07:40

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> >
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
>
> Why not?
>
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

But there are. Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code). And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
Stephen

2001-02-05 15:19:53

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

> Yes, it's the sort of thing that you would hope should work, but in
> practice it's not reliable.

So the less smart devices need to call something like

kiovec_align(kiovec, 512);

and have it do the bounce buffers ?


2001-02-05 16:37:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
>
> On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> >
> > Neither the read nor the write are page-aligned. I don't know where you
> > got that idea. It's obviously not true even in the common case: it depends
> > _entirely_ on what the file offsets are, and expecting the offset to be
> > zero is just being stupid. It's often _not_ zero. With networking it is in
> > fact seldom zero, because the network packets are seldom aligned either in
> > size or in location.
>
> The underlying buffer is. The VFS (and the current kiobuf code) is
> already happy about IO happening at odd offsets within a page.

Stephen.

Don't bother even talking about this. You're so damn hung up about the
page cache that it's not funny.

Have you ever thought about other things, like networking, special
devices, stuff like that? They can (and do) have packet boundaries that
have nothing to do with pages what-so-ever. They can have such notions as
packets that contain multiple streams in one packet, where it ends up
being split up into several pieces. Where neither the original packet
_nor_ the final pieces have _anything_ to do with "pages".

THERE IS NO PAGE ALIGNMENT.

So stop blathering about it.

Of _course_ the current kiobuf code has page-alignment assumptions. You
_designed_ it that way. So bringing it up as an example is a circular
argument. And a really stupid one at that, as that's the thing I've been
quoting as the single biggest design bug in all of kiobufs. It's the thing
that makes them entirely useless for things like describing "struct
msghdr" etc.

We should get _away_ from this page-alignment fallacy. It's not true. It's
not necessarily even true for the page cache - which has no real
fundamental reasons any more for not being able to be a "variable-size"
cache some time in the future (ie it might be a per-address-space decision
on whether the granularity is 1, 2, 4 or more pages).

Anything that designs for "everything is a page" will automatically be
limited for cases where you might sometimes have 64kB chunks of data.

Instead, just face the realization that "everything is a bunch or ranges",
and leave it at that. It's true _already_ - thing about fragmented IP
packets. We may not handle it that way completely yet, but the zero-copy
networking is going in this direction.

And as long as you keep on harping about page alignment, you're not going
to play in this game. End of story.

Linus

2001-02-05 16:57:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains



On Mon, 5 Feb 2001, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> >
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
>
> Why not?
>
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

In fact, regular IDE DMA allows arbitrary scatter-gather at least in
theory. Linux has never used it, so I don't know how well it works in
practice - I would not be surprised if it ends up causing no end of nasty
corner-cases that have bugs. It's not as if IDE controllers always follow
the documentation ;)

The _total_ length of the buffers have to be a multiple of the sector
size, and there are some alignment issues (each scatter-gather area has to
be at least 16-bit aligned both in physical memory and in length, and
apparently many controllers need 32-bit alignment). And I'd almost be
surprised if there wouldn't be hardware that wanted cache alignment
because they always expect to burst.

But despite a lot of likely practical reasons why it won't work for
arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
wouldn't. And there are bound to be better controllers that could handle
it.

Linus

2001-02-05 17:25:18

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +0000, Alan Cox wrote:
> > Yes, it's the sort of thing that you would hope should work, but in
> > practice it's not reliable.
>
> So the less smart devices need to call something like
>
> kiovec_align(kiovec, 512);
>
> and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

2001-02-05 17:28:29

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

> In fact, regular IDE DMA allows arbitrary scatter-gather at least in
> theory. Linux has never used it, so I don't know how well it works in

Purely in theory, as Jeff found out.

> But despite a lot of likely practical reasons why it won't work for
> arbitrary sg lists on plain IDE DMA, there is no _theoretical_ reason it
> wouldn't. And there are bound to be better controllers that could handle
> it.

I2O controllers are required too handle it (most dont) and some of the high
end scsi/fc controllers even get it right


2001-02-05 17:30:59

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

> > kiovec_align(kiovec, 512);
> > and have it do the bounce buffers ?
>
> _All_ drivers would have to do that in the degenerate case, because
> none of our drivers can deal with a dma boundary in the middle of a
> sector, and even in those places where the hardware supports it in
> theory, you are still often limited to word-alignment.

Thats true for _block_ disk devices but if we want a generic kiovec then
if I am going from video capture to network I dont need to force anything more
than 4 byte align

2001-02-05 18:51:28

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +0000, Alan Cox wrote:
> >
> > _All_ drivers would have to do that in the degenerate case, because
> > none of our drivers can deal with a dma boundary in the middle of a
> > sector, and even in those places where the hardware supports it in
> > theory, you are still often limited to word-alignment.
>
> Thats true for _block_ disk devices but if we want a generic kiovec then
> if I am going from video capture to network I dont need to force anything more
> than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary. They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset. Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen

2001-02-05 19:05:58

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary. They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset. Every video framebuffer

start/length per page ?

> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Video is non contiguous ranges. In fact if you are blitting to a card with
tiled memory it gets very interesting in its video lists

2001-02-05 19:10:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> > Thats true for _block_ disk devices but if we want a generic kiovec then
> > if I am going from video capture to network I dont need to force anything more
> > than 4 byte align
>
> Kiobufs have never, ever required the IO to be aligned on any
> particular boundary. They simply make the assumption that the
> underlying buffered object can be described in terms of pages with
> some arbitrary (non-aligned) start/offset. Every video framebuffer
> I've ever seen satisfies that, so you can easily map an arbitrary
> contiguous region of the framebuffer with a kiobuf already.

Stop this idiocy, Stephen. You're _this_ close to be the first person I
ever blacklist from my mailbox.

Network. Packets. Fragmentation. Or just non-page-sized MTU's.

It is _not_ a "series of contiguous pages". Never has been. Never will be.
So stop making excuses.

Also, think of protocols that may want to gather stuff from multiple
places, where the boundaries have little to do with pages but are
specified some other way. Imagine doing "writev()" style operations to
disk, gathering stuff from multiple sources into one operation.

Think of GART remappings - you can have multiple pages that show up as one
"linear" chunk to the graphics device behind the AGP bridge, but that are
_not_ contiguous in real memory.

There just is NO excuse for the "linear series of pages" view. And if you
cannot realize that, then I don't know what's wrong with you. Your
arguments are obviously crap, and the stuff you seem unable to argue
against (like networking) you decide to just ignore. Get your act
together.

Linus

2001-02-05 19:12:08

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

> Have you ever thought about other things, like networking, special
> devices, stuff like that? They can (and do) have packet boundaries that
> have nothing to do with pages what-so-ever. They can have such notions as
> packets that contain multiple streams in one packet, where it ends up
> being split up into several pieces. Where neither the original packet
> _nor_ the final pieces have _anything_ to do with "pages".
>
> THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done. The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list. That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.) Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO. *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work. That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred. It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked. But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling. I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.

Cheers,
Stephen

2001-02-05 19:18:28

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

> Stop this idiocy, Stephen. You're _this_ close to be the first person I
> ever blacklist from my mailbox.

I think I've just figured out what the miscommunication is around here

kiovecs can describe arbitary scatter gather

its just that they can also cleanly describe the common case of contiguous
pages in one entry.

After all a subpage block is simply a contiguous set of 1 page.

Alan


2001-02-05 19:29:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Mon, 5 Feb 2001, Alan Cox wrote:

> > Stop this idiocy, Stephen. You're _this_ close to be the first person I
> > ever blacklist from my mailbox.
>
> I think I've just figured out what the miscommunication is around here
>
> kiovecs can describe arbitary scatter gather

I know. But they are entirely useless for anything that requires low
latency handling. They are big, bloated, and slow.

It is also an example of layering gone horribly horribly wrong.

The _vectors_ are needed at the very lowest levels: the levels that do not
necessarily have to worry at all about completion notification etc. You
want the arbitrary scatter-gather vectors passed down to the stuff that
sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
high-level semantics.

This all proves that the lowest level of layering should be pretty much
noting but the vectors. No callbacks, no crap like that. That's already a
level of abstraction away, and should not get tacked on. Your lowest level
of abstraction should be just the "area". Something like

struct buffer {
struct page *page;
u16 offset, length;
};

int nr_buffers:
struct buffer *array;

should be the low-level abstraction.

And on top of _that_ you build a more complex entity (so a "kiobuf" would
be defined not just by the memory area, but by the operation you want to
do on it, adn the callback on completion etc).

Currently kiobufs do it the other way around: you can build up an array,
but only by having the overhead of passing kiovec's around - ie you have
to pass the _highest_ level of abstraction around just to get the lowest
level of details. That's wrong.

And that wrongness comes _exactly_ from Stephens opinion that the
fundamental IO entity is an array of contiguous pages.

And, btw, this is why the networking layer will never be able to use
kiobufs.

Which makes kiobufs as they stand now basically useless for anything but
some direct disk stuff. And I'd rather work on making the low-level disk
drivers use something saner.

Linus

2001-02-05 20:58:56

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:

> The _vectors_ are needed at the very lowest levels: the levels that do not
> necessarily have to worry at all about completion notification etc. You
> want the arbitrary scatter-gather vectors passed down to the stuff that
> sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> high-level semantics.

OK, this is exactly where we have a problem: I can see too many cases
where we *do* need to know about completion stuff at a fine
granularity when it comes to disk IO (unlike network IO, where we can
usually rely on a caller doing retransmit at some point in the stack).

If we are doing readahead, we want completion callbacks raised as soon
as possible on IO completions, no matter how many other IOs have been
merged with the current one. More importantly though, when we are
merging multiple page or buffer_head IOs in a request, we want to know
exactly which buffer/page contents are valid and which are not once
the IO completes.

The current request struct's buffer_head list provides that quite
naturally, but is a hugely heavyweight way of performing large IOs.
What I'm really after is a way of sending IOs to make_request in such
a way that if the caller provides an array of buffer_heads, it gets
back completion information on each one, but if the IO is requested in
large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
the request code can deal with it in those large chunks.

What worries me is things like the soft raid1/5 code: pretending that
we can skimp on the return information about which blocks were
transferred successfully and which were not sounds like a really bad
idea when you've got a driver which relies on that completion
information in order to do intelligent error recovery.

Cheers,
Stephen

2001-02-05 21:11:50

by David Lang

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

so you have two concepts in one here

1. SG items that can be more then a single page

2. a container for #1 that includes details for completion callbacks, etc

it looks like Linus is objecting to having both in the same structure and
then using that structure as your generic low-level bucket.

define these as two seperate structures, the #1 structure may now be
lightweight enough to be used for networking and other functions, and when
you go to use it with disk IO you then wrap it in the #2 structure. this
still lets you have the completion callbacks at as low a level as you
want, you just have to explicitly add this layer when it makes sense.

David Lang



On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> Date: Mon, 5 Feb 2001 20:54:29 +0000
> From: Stephen C. Tweedie <[email protected]>
> To: Linus Torvalds <[email protected]>
> Cc: Alan Cox <[email protected]>, Stephen C. Tweedie <[email protected]>,
> Manfred Spraul <[email protected]>,
> Christoph Hellwig <[email protected]>, Steve Lord <[email protected]>,
> [email protected], [email protected]
> Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
>
> Hi,
>
> On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:
>
> > The _vectors_ are needed at the very lowest levels: the levels that do not
> > necessarily have to worry at all about completion notification etc. You
> > want the arbitrary scatter-gather vectors passed down to the stuff that
> > sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> > high-level semantics.
>
> OK, this is exactly where we have a problem: I can see too many cases
> where we *do* need to know about completion stuff at a fine
> granularity when it comes to disk IO (unlike network IO, where we can
> usually rely on a caller doing retransmit at some point in the stack).
>
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one. More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.
>
> The current request struct's buffer_head list provides that quite
> naturally, but is a hugely heavyweight way of performing large IOs.
> What I'm really after is a way of sending IOs to make_request in such
> a way that if the caller provides an array of buffer_heads, it gets
> back completion information on each one, but if the IO is requested in
> large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
> the request code can deal with it in those large chunks.
>
> What worries me is things like the soft raid1/5 code: pretending that
> we can skimp on the return information about which blocks were
> transferred successfully and which were not sounds like a really bad
> idea when you've got a driver which relies on that completion
> information in order to do intelligent error recovery.
>
> Cheers,
> Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-02-05 21:29:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains


On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> And no, the IO success is *not* necessarily sequential from the start
> of the IO: if you are doing IO to raid0, for example, and the IO gets
> striped across two disks, you might find that the first disk gets an
> error so the start of the IO fails but the rest completes. It's the
> completion code which notifies the caller of what worked and what did
> not.

it's exactly these 'compound' structures i'm vehemently against. I do
think it's a design nightmare. I can picture these monster kiobufs
complicating the whole code for no good reason - we couldnt even get the
bh-list code in block_device.c right - why do you think kiobufs *all
across the kernel* will be any better?

RAID0 is not an issue. Split it up, use separate kiobufs for every
different disk. We need simple constructs - i do not believe why nobody
sees that these big fat monster-trucks of IO workload are *trouble*. They
keep things localized, instead of putting workload components into the
system immediately. We'll have performance bugs nobody has seen before.
bhs have one very nice property: they are simple, modularized. I think
this is like CISC vs. RISC: CISC designs ended up splitting 'fat
instructions' up into RISC-like instructions.

fragmented skbs are a different matter: they are simply a bit more generic
abstractions of 'memory buffer'. Clear goal, clear solution. I do not
think kiobufs have clear goals.

and i do not buy the performance arguments. In 2.4.1 we improved block-IO
performance dramatically by fixing high-load IO scheduling. Write
performance suddenly improved dramatically, there is a 30-40% improvement
in dbench performance. To put in another way: *we needed 5 years to fix a
serious IO-subsystem performance bug*. Block IO was already too complex -
and Alex & Andrea have done a nice job streamlining and cleaning it up for
2.4. We should simplify it further - and optimize the components, instead
of bringing in yet another *big* complication into the API.

and what is the goal of having multi-page kiobufs. To avoid having to do
multiple function calls via a simpler interface? Shouldnt we optimize that
codepath instead?

Ingo

2001-02-05 21:58:35

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

> OK, this is exactly where we have a problem: I can see too many cases
> where we *do* need to know about completion stuff at a fine
> granularity when it comes to disk IO (unlike network IO, where we can
> usually rely on a caller doing retransmit at some point in the stack).

Ok so whats wrong with embedded kiovec points into somethign bigger, one
kmalloc can allocate two arrays, one of buffers (shared with networking etc)
followed by a second of block io completion data.

Now you can also kind of cast from the bigger to the smaller object and get
the right result if the kiovec array is the start of the combined allocation


2001-02-05 22:10:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains


On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:

> > Obviously the disk access itself must be sector aligned and the total
> > length must be a multiple of the sector length, but there shouldn't be
> > any restrictions on the data buffers.
>
> But there are. Many controllers just break down and corrupt things
> silently if you don't align the data buffers (Jeff Merkey found this
> by accident when he started generating unaligned IOs within page
> boundaries in his NWFS code). And a lot of controllers simply cannot
> break a sector dma over a page boundary (at least not without some
> form of IOMMU remapping).

so we are putting workarounds for hardware bugs into the design?

Ingo

2001-02-05 23:01:48

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
>
> On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
>
> it's exactly these 'compound' structures i'm vehemently against. I do
> think it's a design nightmare. I can picture these monster kiobufs
> complicating the whole code for no good reason - we couldnt even get the
> bh-list code in block_device.c right - why do you think kiobufs *all
> across the kernel* will be any better?
>
> RAID0 is not an issue. Split it up, use separate kiobufs for every
> different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1? Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed? That's the level of
completion information I'm worrying about at the moment.

> fragmented skbs are a different matter: they are simply a bit more generic
> abstractions of 'memory buffer'. Clear goal, clear solution. I do not
> think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not. If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback. If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

> and what is the goal of having multi-page kiobufs. To avoid having to do
> multiple function calls via a simpler interface? Shouldnt we optimize that
> codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer. I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.

--Stephen

2001-02-05 23:07:30

by Alan

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

> do you then tell the application _above_ raid0 if one of the
> underlying IOs succeeds and the other fails halfway through?

struct
{
u32 flags; /* because everything needs flags */
struct io_completion *completions;
kiovec_t sglist[0];
} thingy;

now kmalloc one object of the header the sglist of the right size and the
completion list. Shove the completion list on the end of it as another
array of objects and what is the problem.

> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.

But that can be two arrays

2001-02-05 23:20:15

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +0000, Alan Cox wrote:
> > do you then tell the application _above_ raid0 if one of the
> > underlying IOs succeeds and the other fails halfway through?
>
> struct
> {
> u32 flags; /* because everything needs flags */
> struct io_completion *completions;
> kiovec_t sglist[0];
> } thingy;
>
> now kmalloc one object of the header the sglist of the right size and the
> completion list. Shove the completion list on the end of it as another
> array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs. You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments. Associating
the two lists becomes non-trivial, although it could be done.

--Stephen

2001-02-06 00:11:41

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

OK, if we take a step back what does this look like:

On Mon, Feb 05, 2001 at 08:54:29PM +0000, Stephen C. Tweedie wrote:
>
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one. More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.

This is the current situation. If the page cache submits a 64K IO to
the block layer, it does so in pieces, and then expects to be told on
return exactly which pages succeeded and which failed.

That's where the mess of having multiple completion objects in a
single IO request comes from. Can we just forbid this case?

That's the short cut that SGI's kiobuf block dev patches do when they
get kiobufs: they currently deal with either buffer_heads or kiobufs
in struct requests, but they don't merge kiobuf requests. (XFS
already clusters the IOs for them in that case.)

Is that a realistic basis for a cleaned-up ll_rw_blk.c?

It implies that the caller has to do IO merging. For read, that's not
much pain, as the most important case --- readahead --- is already
done in a generic way which could submit larger IOs relatively easily.
It would be harder for writes, but high-level write clustering code
has already been started.

It implies that for any IO, on IO failure you don't get told which
part of the IO failed. That adds code to the caller: the page cache
would have to retry per-page to work out which pages are readable and
which are not. It means that for soft raid, you don't get told which
blocks are bad if a stripe has an error anywhere. Ingo, is that a
potential problem?

But it gives very, very simple semantics to the request layer: single
IOs go in (with a completion callback and a single scatter-gather
list), and results go back with success or failure.

With that change, it becomes _much_ more natural to push a simple sg
list down through the disk layers.

--Stephen

2001-02-06 00:20:31

by Manfred Spraul

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

"Stephen C. Tweedie" wrote:
>
> The original multi-page buffers came from the map_user_kiobuf
> interface: they represented a user data buffer. I'm not wedded to
> that format --- we can happily replace it with a fine-grained sg list
>
Could you change that interface?

<<< from Linus mail:

struct buffer {
struct page *page;
u16 offset, length;
};

>>>>>>

/* returns the number of used buffers, or <0 on error */
int map_user_buffer(struct buffer *ba, int max_bcount,
void* addr, int len);
void unmap_buffer(struct buffer *ba, int bcount);

That's enough for the zero copy pipe code ;-)

Real hw drivers probably need a replacement for pci_map_single()
(pci_map_and_align_and_bounce_buffer_array())

The kiobuf structure could contain these 'struct buffer' instead of the
current 'struct page' pointers.

>
> In other words, even if we expand the kiobuf into a sg vector list,
> when it comes to merging requests in ll_rw_blk.c we still need to
> track the callbacks on each independent source kiobufs.
>
Probably.


--
Manfred

2001-02-06 00:50:13

by Roman Zippel

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> This all proves that the lowest level of layering should be pretty much
> noting but the vectors. No callbacks, no crap like that. That's already a
> level of abstraction away, and should not get tacked on. Your lowest level
> of abstraction should be just the "area". Something like
>
> struct buffer {
> struct page *page;
> u16 offset, length;
> };
>
> int nr_buffers:
> struct buffer *array;
>
> should be the low-level abstraction.

Does it has to be vectors? What about lists? I'm thinking about this for
some time now and I think lists are more flexible. At higher level we can
easily generate a list of pages and in a lower level you can still split
them up as needed. It would be basically the same structure, but you
could use it everywhere with the same kind of operations.

bye, Roman

2001-02-06 01:02:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Roman Zippel wrote:
> >
> > int nr_buffers:
> > struct buffer *array;
> >
> > should be the low-level abstraction.
>
> Does it has to be vectors? What about lists?

I'd prefer to avoid lists unless there is some overriding concern, like a
real implementation issue. But I don't care much one way or the other -
what I care about is that the setup and usage time is as low as possible.
I suspect arrays are better for that.

I have this strong suspicion that networking is going to be the most
latency-critical and complex part of this, and the fact that the
networking code wanted arrays is what makes me think that arrays are the
right way to go. But talk to Davem and ank about why they wanted vectors.

Linus

2001-02-06 01:10:16

by David Miller

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


Linus Torvalds writes:
> But talk to Davem and ank about why they wanted vectors.

SKB setup and free needs to be as light as possible.
Using vectors leads to code like:

skb_data_free(...)
{
...
for (i = 0; i < MAX_SKB_FRAGS; i++)
put_page(skb_shinfo(skb)->frags[i].page);
}

Currently, the ZC patches have a fixed frag vector size
(MAX_SKB_FRAGS). But a part of me wants this to be
made dynamic (to handle HIPPI etc. properly) whereas
another part of me doesn't want to do it that way because
it would increase the complexity of paged SKB handling
and add yet another member to the SKB structure.

Later,
David S. Miller
[email protected]

2001-02-06 09:32:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Mon, 5 Feb 2001, Linus Torvalds wrote:

> [...] But talk to Davem and ank about why they wanted vectors.

one issue is allocation overhead. The fragment array is a natural and
constant-size part of an skb, thus we get all the control structures in
place while allocating a structure that we have to allocate anyway.

another issue is that certain cards have (or can have) SG-limits, so we
have to be prepared to have a 'limited' array of fragments anyway, and
have to be prepared to split/refragment packets. Whether there is a global
MAX_SKB_FRAGS limit or not makes no difference.

Ingo

2001-02-06 09:45:47

by Roman Zippel

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Mon, 5 Feb 2001, Linus Torvalds wrote:

> > Does it has to be vectors? What about lists?
>
> I'd prefer to avoid lists unless there is some overriding concern, like a
> real implementation issue. But I don't care much one way or the other -
> what I care about is that the setup and usage time is as low as possible.
> I suspect arrays are better for that.

I was more thinking about the higher layers. Here it's simpler to setup a
list of pages which can be send to a lower layer. In the page cache we
already have per address space lists, so it would be very easy to use
that. A lower layer can generate of course anything it wants out of this,
e.g. it can generate sublists or vectors.

bye, Roman

2001-02-06 17:03:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> This is the current situation. If the page cache submits a 64K IO to
> the block layer, it does so in pieces, and then expects to be told on
> return exactly which pages succeeded and which failed.
>
> That's where the mess of having multiple completion objects in a
> single IO request comes from. Can we just forbid this case?
>
> That's the short cut that SGI's kiobuf block dev patches do when they
> get kiobufs: they currently deal with either buffer_heads or kiobufs
> in struct requests, but they don't merge kiobuf requests.

IIRC Jens Axboe has done some work on merging kiobuf-based requests.

> (XFS already clusters the IOs for them in that case.)
>
> Is that a realistic basis for a cleaned-up ll_rw_blk.c?

I don't think os. If we minimize the state in the IO container object,
the lower levels could split them at their guess and the IO completion
function just has to handle the case that it might be called for a smaller
object.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.
Whip me. Beat me. Make me maintain AIX.

2001-02-06 17:09:28

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 06:00:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> >
> > Is that a realistic basis for a cleaned-up ll_rw_blk.c?
>
> I don't think os. If we minimize the state in the IO container object,
> the lower levels could split them at their guess and the IO completion
> function just has to handle the case that it might be called for a smaller
> object.

The whole point of the post was that it is merging, not splitting,
which is troublesome. How are you going to merge requests without
having chains of scatter-gather entities each with their own
completion callbacks?

--Stephen

2001-02-06 17:15:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Stephen C. Tweedie wrote:
> > I don't think os. If we minimize the state in the IO container object,
> > the lower levels could split them at their guess and the IO completion
> > function just has to handle the case that it might be called for a smaller
> > object.
>
> The whole point of the post was that it is merging, not splitting,
> which is troublesome. How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

You can't, the stuff I played with turned out to be horrible. At
least with the current kiobuf I/O stuff, merging will have to be
done before its submitted. And IMO we don't want to loose the
ability to cluster buffers and requests in ll_rw_blk.

--
Jens Axboe

2001-02-06 17:25:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> The whole point of the post was that it is merging, not splitting,
> which is troublesome. How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

The object passed down to the low-level driver just needs to ne able
to contain multiple end-io callbacks. The decision what to call when
some of the scatter-gather entities fail is of course not so easy to
handle and needs further discussion.

Christoph

--
Whip me. Beat me. Make me maintain AIX.

2001-02-06 17:39:54

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hey folks,

On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

> The whole point of the post was that it is merging, not splitting,
> which is troublesome. How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

Let me just emphasize what Stephen is pointing out: if requests are
properly merged at higher layers, then merging is neither required nor
desired. Traditionally, ext2 has not done merging because the underlying
system doesn't support it. This leads to rather convoluted code for
readahead which doesn't result in appropriately merged requests on
indirect block boundries, and in fact leads to suboptimal performance.
The only case I see where merging of requests can improve things is when
dealing with lots of small files. But we already know that small files
need to be treated differently (fe tail merging). Besides, most of the
benefit of merging can be had by doing readaround for these small files.

As for io completion, can't we just issue seperate requests for the
critical data and the readahead? That way for SCSI disks, the important
io should be finished while the readahead can continue. Thoughts?

-ben

2001-02-06 18:01:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome. How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
>
> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired. Traditionally, ext2 has not done merging because the underlying
> system doesn't support it. This leads to rather convoluted code for
> readahead which doesn't result in appropriately merged requests on
> indirect block boundries, and in fact leads to suboptimal performance.
> The only case I see where merging of requests can improve things is when
> dealing with lots of small files. But we already know that small files
> need to be treated differently (fe tail merging). Besides, most of the
> benefit of merging can be had by doing readaround for these small files.

Stephen already covered this point, the merging is not a problem
to deal with for read-ahead. The underlying system can easily
queue that in nice big chunks. Delayed allocation makes it
easier to to flush big chunks as well. I seem to recall the xfs people
having problems with the lack of merging causing a performance hit
on smaller I/O.

Of course merging doesn't have to happen in ll_rw_blk.

> As for io completion, can't we just issue seperate requests for the
> critical data and the readahead? That way for SCSI disks, the important
> io should be finished while the readahead can continue. Thoughts?

Priorities?

--
Jens Axboe

2001-02-06 18:10:37

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Jens Axboe wrote:

> Stephen already covered this point, the merging is not a problem
> to deal with for read-ahead. The underlying system can easily

I just wanted to make sure that was clear =)

> queue that in nice big chunks. Delayed allocation makes it
> easier to to flush big chunks as well. I seem to recall the xfs people
> having problems with the lack of merging causing a performance hit
> on smaller I/O.

That's where readaround buffers come into play. If we have a fixed number
of readaround buffers that are used when small ios are issued, they should
provide a low overhead means of substantially improving things like find
(which reads many nearby inodes out of order but sequentially). I need to
implement this can get cache hit rates for various workloads. ;-)

> Of course merging doesn't have to happen in ll_rw_blk.
>
> > As for io completion, can't we just issue seperate requests for the
> > critical data and the readahead? That way for SCSI disks, the important
> > io should be finished while the readahead can continue. Thoughts?
>
> Priorities?

Definately. I'd like to be able to issue readaheads with a "don't bother
executing if this request unless the cost is low" bit set. It might also
be helpful for heavy multiuser loads (or even a single user with multiple
processes) to ensure progress is made for others.

-ben

2001-02-06 18:15:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
>
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome. How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
>
> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired.

I will claim that you CANNOT merge at higher levels and get good
performance.

Sure, you can do read-ahead, and try to get big merges that way at a high
level. Good for you.

But you'll have a bitch of a time trying to merge multiple
threads/processes reading from the same area on disk at roughly the same
time. Your higher levels won't even _know_ that there is merging to be
done until the IO requests hit the wall in waiting for the disk.

Qutie frankly, this whole discussion sounds worthless. We have solved this
problem already: it's called a "buffer head". Deceptively simple at higher
levels, and lower levels can easily merge them together into chains and do
fancy scatter-gather structures of them that can be dynamically extended
at any time.

The buffer heads together with "struct request" do a hell of a lot more
than just a simple scatter-gather: it's able to create ordered lists of
independent sg-events, together with full call-backs etc. They are
low-cost, fairly efficient, and they have worked beautifully for years.

The fact that kiobufs can't be made to do the same thing is somebody elses
problem. I _know_ that merging has to happen late, and if others are
hitting their heads against this issue until they turn silly, then that's
their problem. You'll eventually learn, or you'll hit your heads into a
pulp.

Linus

2001-02-06 18:19:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> Let me just emphasize what Stephen is pointing out: if requests are
> properly merged at higher layers, then merging is neither required nor
> desired. [...]

this is just so incorrect that it's not funny anymore.

- higher levels just do not have the kind of knowledge lower levels have.

- merging decisions are often not even *deterministic*.

- higher levels do not have the kind of state to eg. merge requests done
by different users. The only chance for merging is often the lowest
level, where we already know what disk, which sector.

- merging is not even *required* for some devices - and chances are high
that we'll get away from this inefficient and unreliable 'rotating array
of disks' business of storing bulk data in this century. (solid state
disks, holographic storage, whatever.)

i'm truly shocked that you and Stephen are both saying this.

Ingo

2001-02-06 18:28:37

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> - higher levels do not have the kind of state to eg. merge requests done
> by different users. The only chance for merging is often the lowest
> level, where we already know what disk, which sector.

That's what a readaround buffer is for, and I suspect that readaround will
give use a big performance boost.

> - merging is not even *required* for some devices - and chances are high
> that we'll get away from this inefficient and unreliable 'rotating array
> of disks' business of storing bulk data in this century. (solid state
> disks, holographic storage, whatever.)

Interesting that you've brought up this point, as its an example

> i'm truly shocked that you and Stephen are both saying this.

Merging != sorting. Sorting of requests has to be carried out at the
lower layers, and the specific block device should be able to choose the
Right Thing To Do for the next item in a chain of sequential requests.

-ben

2001-02-06 18:30:37

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome. How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
>
> The object passed down to the low-level driver just needs to ne able
> to contain multiple end-io callbacks. The decision what to call when
> some of the scatter-gather entities fail is of course not so easy to
> handle and needs further discussion.

Umm, and if you want the separate higher-level IOs to be told which
IOs succeeded and which ones failed on error, you need to associate
each of the multiple completion callbacks with its particular
scatter-gather fragment or fragments. So you end up with the same
sort of kiobuf/kiovec concept where you have chains of sg chunks, each
chunk with its own completion information.

This is *precisely* what I've been trying to get people to address.
Forget whether the individual sg fragments are based on pages or not:
if you want to have IO merging and accurate completion callbacks, you
need not just one sg list but multiple lists each with a separate
callback header.

Abandon the merging of sg-list requests (by moving that functionality
into the higher-level layers) and that problem disappears: flat
sg-lists will then work quite happily at the request layer.

Cheers,
Stephen

2001-02-06 18:36:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > - higher levels do not have the kind of state to eg. merge requests done
> > by different users. The only chance for merging is often the lowest
> > level, where we already know what disk, which sector.
>
> That's what a readaround buffer is for, [...]

If you are merging based on (device, offset) values, then that's lowlevel
- and this is what we have been doing for years.

If you are merging based on (inode, offset), then it has flaws like not
being able to merge through a loopback or stacked filesystem.

Ingo

2001-02-06 18:57:22

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> If you are merging based on (device, offset) values, then that's lowlevel
> - and this is what we have been doing for years.
>
> If you are merging based on (inode, offset), then it has flaws like not
> being able to merge through a loopback or stacked filesystem.

I disagree. Loopback filesystems typically have their data contiguously
on disk and won't split up incoming requests any further.

Here are the points I'm trying to address:

- reduce the overhead in submitting block ios, especially for
large ios. Look at the %CPU usages differences between 512 byte
blocks and 4KB blocks, this can be better.
- make asynchronous io possible in the block layer. This is
impossible with the current ll_rw_block scheme and io request
plugging.
- provide a generic mechanism for reordering io requests for
devices which will benefit from this. Make it a library for
drivers to call into. IDE for example will probably make use of
it, but some high end devices do this on the controller. This
is the important point: Make it OPTIONAL.

You mentioned non-spindle base io devices in your last message. Take
something like a big RAM disk. Now compare kiobuf base io to buffer head
based io. Tell me which one is going to perform better.

-ben

2001-02-06 18:59:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> - reduce the overhead in submitting block ios, especially for
> large ios. Look at the %CPU usages differences between 512 byte
> blocks and 4KB blocks, this can be better.

my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
512 byte bhs thats a problem of the raw IO code ...

> - make asynchronous io possible in the block layer. This is
> impossible with the current ll_rw_block scheme and io request
> plugging.

why is it impossible?

> You mentioned non-spindle base io devices in your last message. Take
> something like a big RAM disk. Now compare kiobuf base io to buffer
> head based io. Tell me which one is going to perform better.

roughly equal performance when using 4K bhs. And a hell of a lot more
complex and volatile code in the kiobuf case.

Ingo

2001-02-06 19:14:54

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > - reduce the overhead in submitting block ios, especially for
> > large ios. Look at the %CPU usages differences between 512 byte
> > blocks and 4KB blocks, this can be better.
>
> my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
> 512 byte bhs thats a problem of the raw IO code ...
>
> > - make asynchronous io possible in the block layer. This is
> > impossible with the current ll_rw_block scheme and io request
> > plugging.
>
> why is it impossible?

s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have
a non blocking variant that does all of the setup in the caller's context.
Yes, I know that we can do it with a kernel thread, but that isn't as
clean and it significantly penalises small ios (hint: databases issue
*lots* of small random ios and a good chunk of large ios).

> > You mentioned non-spindle base io devices in your last message. Take
> > something like a big RAM disk. Now compare kiobuf base io to buffer
> > head based io. Tell me which one is going to perform better.
>
> roughly equal performance when using 4K bhs. And a hell of a lot more
> complex and volatile code in the kiobuf case.

I'm willing to benchmark you on this.

-ben

2001-02-06 19:21:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Ben LaHaise wrote:
> On Tue, 6 Feb 2001, Ingo Molnar wrote:
>
> > If you are merging based on (device, offset) values, then that's lowlevel
> > - and this is what we have been doing for years.
> >
> > If you are merging based on (inode, offset), then it has flaws like not
> > being able to merge through a loopback or stacked filesystem.
>
> I disagree. Loopback filesystems typically have their data contiguously
> on disk and won't split up incoming requests any further.

Face it.

You NEED to merge and sort late. You _cannot_ do a good job early. Early
on, you don't have any concept of what the final IO pattern will be: you
will only have that once you've seen which requests are still pending etc,
something that the higher level layers CANNOT do.

Do you really want the higher levels to know about per-controller request
locking etc? I don't think so.

Trust me. You HAVE to do the final decisions late in the game. You
absolutely _cannot_ get the best performance except for trivial and
uninteresting cases (ie one process that wants to read gigabytes of data
in one single stream) otherwise.

(It should be pointed out, btw, that SGI etc were often interested exactly
in the trivial and uninteresting cases. When you have the DoD asking you
to stream satellite pictures over the net as fast as you can, money being
no object, you get a rather twisted picture of what is important and what
is not)

And I will turn your own argument against you: if you do merging at a low
level anyway, there's little point in trying to do it at a higher level.

Higher levels should do high-level sequencing. They can (and should) do
some amount of sorting - the lower levels will still do their own sort as
part of the merging anyway, and the lower level sorting may actually end
up being _different_ from a high-level sort because the lower levels know
about the topology of the device, but higher levels giving data with
"patterns" to it only make it easier for the lower levels to do a good
job. So high-level sorting is not _necessary_, but it's probably a good
idea.

High-level merging is almost certainly not even a good idea - higher
levels should try to _batch_ the requests, but that's a different issue,
and is again all about giving lower levels "patterns". It's can also about
simple issues like cache locality - batching things tends to make for
better icache (and possibly dcache) behaviour.

So you should separate out the issue of batching and merging. An dyou
absolutely should realize that you should NOT ignore Ingo's arguments
about loopback etc just because they don't fit the model you WANT them to
fit. The fact is that higher levels should NOT know about things like RAID
striping etc, yet that has a HUGE impact on the issue of merging (you do
_not_ want to merge requests to separate disks - you'll just have to split
them up again).

> Here are the points I'm trying to address:
>
> - reduce the overhead in submitting block ios, especially for
> large ios. Look at the %CPU usages differences between 512 byte
> blocks and 4KB blocks, this can be better.

This is often a filesystem layer issue. Design your filesystem well, and
you get a lot of batching for free.

You can also batch the requests - this is basically what "readahead" is.
That helps a lot. But that is NOT the same thing as merging. Not at all.
The "batched" read-ahead requests may actually be split up among many
different disks - and they will each then get separately merged with
_other_ requests to those disks. See?

And trust me, THAT is how you get good performance. Not by merging early.
By merging late, and letting the disk layers do their own thing.

> - make asynchronous io possible in the block layer. This is
> impossible with the current ll_rw_block scheme and io request
> plugging.

I'm surprised you say that. It's not only possible, but we do it all the
time. What do you think the swapout and writing is? How do you think that
read-ahead is actually _implemented_? Right. Read-ahead is NOT done as a
"merge" operation. It's done as several asynchronous IO operations that
the low-level stuff can choose (or not) to merge.

What do you think happens if you do a "submit_bh()"? It's a _purely_
asynchronous operation. It turns synchronous when you wait for the bh, not
before.

Your argument is nonsense.

> - provide a generic mechanism for reordering io requests for
> devices which will benefit from this. Make it a library for
> drivers to call into. IDE for example will probably make use of
> it, but some high end devices do this on the controller. This
> is the important point: Make it OPTIONAL.

Ehh. You've just described exatcly what we have.

This is what the whole elevator thing _is_. It's a library of routines.
You don't have to use them, and in fact many things DO NOT use them. The
loopback driver, for example, doesn't bother with sorting or merging at
all, because it knows that it's only supposed to pass the request on to
somebody else - who will do a hell of a lot better job of it.

Some high-end drivers have their own merging stuff, exactly because they
don't need the overhead - you're better off just feeding the request to
the controller as soon as you can, as the controller itself will do all
the merging and sorting anyway.

> You mentioned non-spindle base io devices in your last message. Take
> something like a big RAM disk. Now compare kiobuf base io to buffer head
> based io. Tell me which one is going to perform better.

Buffer heads?

Go and read the code.

Sure, it has some historical baggage still, but the fact is that it works
a hell of a lot better than kiobufs and it _does_ know about merging
multiple requests and handling errors in the middle of one request etc.
You can get the full advantage of streaming megabytes of data in one
request, AND still get proper error handling if it turns out that one
sector in the middle was bad.

Linus

2001-02-06 19:36:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > > As for io completion, can't we just issue seperate requests for the
> > > critical data and the readahead? That way for SCSI disks, the important
> > > io should be finished while the readahead can continue. Thoughts?
> >
> > Priorities?
>
> Definately. I'd like to be able to issue readaheads with a "don't bother
> executing if this request unless the cost is low" bit set. It might also
> be helpful for heavy multiuser loads (or even a single user with multiple
> processes) to ensure progress is made for others.

And in other contexts too it might be handy to assign priorities to
requests as well. I don't know how sgi plan on handling grio (or already
handle it in irix), maybe Steve can fill us in on that :)

--
Jens Axboe

2001-02-06 19:33:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have
> a non blocking variant that does all of the setup in the caller's context.
> Yes, I know that we can do it with a kernel thread, but that isn't as
> clean and it significantly penalises small ios (hint: databases issue
> *lots* of small random ios and a good chunk of large ios).

Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
NOT block. Never has. Never will.

(Small correction: it doesn't block on anything else than allocating a
request structure if needed, and quite frankly, you have to block
SOMETIME. You can't just try to throw stuff at the device faster than it
can take it. Think of it as a "there can only be this many IO's in
flight")

If you want to use kiobuf's because you think they are asycnrhonous and
bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
PR department seems to have been working overtime on some FUD strategy.

The fact is that bh's can do MORE than kiobuf's. They have all the
callbacks in place etc. They merge and sort correctly. Oh, they have
limitations: one "bh" always describes just one memory area with a
"start,len" kind of thing. That's fine - scatter-gather is pushed
downwards, and the upper layers do not even need to know about it. Which
is what layering is all about, after all.

Traditionally, a "bh" is only _used_ for small areas, but that's not a
"bh" issue, that's a memory management issue. The code should pretty much
handle the issue of a single 64kB bh pretty much as-is, but nothing
creates them: the VM layer only creates bh's in sizes ranging from 512
bytes to a single page.

The IO layer could do more, but there has yet to be anybody who needed
more (becase once you hit a page-size, you tend to get into
scatter-gather, so you want to have one bh per area - and let the
low-level IO level handle the actual merging etc).

Right now, on many normal setups, the thing that limits our ability to do
big IO requests is actually the fact that IDE cannot do more than 128kB
per request, for example (256 sectors). It's not the bh's or the VM layer.

If you want to make a "raw disk device", you can do so TODAY with bh's.
How? Don't use "bread()" (which allocates the backing store and creates
the cache). Allocate a separate anonymous bh (or multiple), and set them
up to point to whatever data source/sink you have, and let it rip. All
asynchronous. All with nice completion callbacks. All with existing code,
no kiobuf's in sight.

What more do you think your kiobuf's should be able to do?

Linus

2001-02-06 19:33:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > > - make asynchronous io possible in the block layer. This is
> > > impossible with the current ll_rw_block scheme and io request
> > > plugging.
> >
> > why is it impossible?
>
> s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to
> have a non blocking variant that does all of the setup in the caller's
> context. [...]

sorry, but exactly what code are you comparing this to? The aio code you
sent a few days ago does not do this either. (And you did not answer my
questions regarding this issue.) What i saw is some scheme that at a point
relies on keventd (a kernel thread) to do the blocking stuff. [or, unless
i have misread the code, does the ->bmap() synchronously.]

indeed an asynchron ll_rw_block() is possible and desirable (and not hard
at all - all structures are interrupt-safe already, opposed to the kiovec
code), but this is only half of the story. What is the big issue for me is
an async ->bmap(). And we wont access ext2fs data structures from IRQ
handlers anytime soon - so true async IO right now is damn near
impossible. No matter what the IO-submission interface is: kiobufs/kiovecs
or bhs/requests.

Ingo

2001-02-06 19:33:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Ben LaHaise wrote:
> > > - make asynchronous io possible in the block layer. This is
> > > impossible with the current ll_rw_block scheme and io request
> > > plugging.
> >
> > why is it impossible?
>
> s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have
> a non blocking variant that does all of the setup in the caller's context.
> Yes, I know that we can do it with a kernel thread, but that isn't as
> clean and it significantly penalises small ios (hint: databases issue
> *lots* of small random ios and a good chunk of large ios).

So make a non-blocking variant, not a big deal. Users of async I/O
know how to deal with resource limits anyway.

--
Jens Axboe

2001-02-06 19:45:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than
> it can take it. Think of it as a "there can only be this many IO's in
> flight")

yep. The/my goal would be to get some sort of async IO capability that is
able to read the pagecache without holding up the process. And just
because i've already implemented the helper-kernel-thread async IO variant
[in fact what TUX does is that there are per-CPU async IO helper threads,
and we always pick the 'localized' thread, to avoid unnecessery cross-CPU
traffic], i'd like to explore the possibility of getting this done via a
pure, IRQ-driven state-machine - which arguably has the lowest overhead.

but i just cannot find any robust way to do this with ext2fs (or any other
disk-based FS for that matter). The horror scenario: the inode block is
not cached yet, and the block resides in a triple-indirected block which
triggers 3 other block reads, before the actual data block can be read.
And i definitely do not see why kiobufs would help make this any easier.

Ingo

2001-02-06 19:47:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> > > You mentioned non-spindle base io devices in your last message. Take
> > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > head based io. Tell me which one is going to perform better.
> >
> > roughly equal performance when using 4K bhs. And a hell of a lot more
> > complex and volatile code in the kiobuf case.
>
> I'm willing to benchmark you on this.

sure. Could you specify the actual workload, and desired test-setups?

Ingo

2001-02-06 19:52:48

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Linus Torvalds wrote:

>
>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have
> > a non blocking variant that does all of the setup in the caller's context.
> > Yes, I know that we can do it with a kernel thread, but that isn't as
> > clean and it significantly penalises small ios (hint: databases issue
> > *lots* of small random ios and a good chunk of large ios).
>
> Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
> NOT block. Never has. Never will.
>
> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than it
> can take it. Think of it as a "there can only be this many IO's in
> flight")

This small correction is the crux of the problem: if it blocks, it takes
away from the ability of the process to continue doing useful work. If it
returns -EAGAIN, then that's okay, the io will be resubmitted later when
other disk io has completed. But, it should be possible to continue
servicing network requests or user io while disk io is underway.

> If you want to use kiobuf's because you think they are asycnrhonous and
> bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
> PR department seems to have been working overtime on some FUD strategy.

I'm using bh's to refer to what is currently being done, and kiobuf when
talking about what could be done. It's probably the wrong thing to do,
and if bh's are extended to operate on arbitrary sized blocks then there
is no difference between the two.

> If you want to make a "raw disk device", you can do so TODAY with bh's.
> How? Don't use "bread()" (which allocates the backing store and creates
> the cache). Allocate a separate anonymous bh (or multiple), and set them
> up to point to whatever data source/sink you have, and let it rip. All
> asynchronous. All with nice completion callbacks. All with existing code,
> no kiobuf's in sight.

> What more do you think your kiobuf's should be able to do?

That's what my code is doing today. There are a ton of bh's setup for a
single kiobuf request that is issued. For something like a single 256kb
io, this is the difference between the batched io requests being passed
into submit_bh fitting in L1 cache and overflowing it. Resizable bh's
would certainly improve this.

-ben

2001-02-06 19:58:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> This small correction is the crux of the problem: if it blocks, it
> takes away from the ability of the process to continue doing useful
> work. If it returns -EAGAIN, then that's okay, the io will be
> resubmitted later when other disk io has completed. But, it should be
> possible to continue servicing network requests or user io while disk
> io is underway.

typical blocking point is waiting for page completion, not
__wait_request(). But, this is really not an issue, NR_REQUESTS can be
increased anytime. If NR_REQUESTS is large enough then think of it as the
'absolute upper limit of doing IO', and think of the blocking as 'the
kernel pulling the brakes'.

[overhead of 512-byte bhs in the raw IO code is an artificial problem of
the raw IO code.]

Ingo

2001-02-06 20:07:50

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Ingo Molnar wrote:
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work. If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed. But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
>
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

Not just __get_request_wait, but also the limit on max locked buffers
in ll_rw_block. Serves the same purpose though, brake effect.

--
Jens Axboe

2001-02-06 20:20:02

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > > > You mentioned non-spindle base io devices in your last message. Take
> > > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > > head based io. Tell me which one is going to perform better.
> > >
> > > roughly equal performance when using 4K bhs. And a hell of a lot more
> > > complex and volatile code in the kiobuf case.
> >
> > I'm willing to benchmark you on this.
>
> sure. Could you specify the actual workload, and desired test-setups?

Sure. General parameters will be as follows (since I think we both have
access to these machines):

- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
base install plus data files.
- data to/from the ram block device must be copied within the ram
block driver.
- the filesystem used must be ext2. optimisations to ext2 for
tweaks to the interface are permitted & encouraged.

The main item I'm interested in is read (page cache cold)/synchronous
write performance for blocks from 256 bytes to 16MB in powers of two, much
like what I've done in testing the aio patches that shows where
improvement in latency is needed. Including a few other items on disk
like the timings of find/make -s dep/bonnie/dbench is probably to show
changes in throughput. Sound fair?

-ben

2001-02-06 20:24:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Ben LaHaise wrote:

> Sure. General parameters will be as follows (since I think we both have
> access to these machines):
>
> - 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
> base install plus data files.
> - data to/from the ram block device must be copied within the ram
> block driver.
> - the filesystem used must be ext2. optimisations to ext2 for
> tweaks to the interface are permitted & encouraged.
>
> The main item I'm interested in is read (page cache cold)/synchronous
> write performance for blocks from 256 bytes to 16MB in powers of two,
> much like what I've done in testing the aio patches that shows where
> improvement in latency is needed. Including a few other items on disk
> like the timings of find/make -s dep/bonnie/dbench is probably to show
> changes in throughput. Sound fair?

yep, sounds fair.

Ingo


2001-02-06 20:28:21

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work. If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed. But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
>
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

=) This is what I'm seeing: lots of processes waiting with wchan ==
__get_request_wait. With async io and a database flushing lots of io
asynchronously spread out across the disk, the NR_REQUESTS limit is hit
very quickly.

> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

True, and in the tests I've run, raw io is using 2KB blocks (same as the
database).

-ben

2001-02-06 20:27:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> This small correction is the crux of the problem: if it blocks, it takes
> away from the ability of the process to continue doing useful work. If it
> returns -EAGAIN, then that's okay, the io will be resubmitted later when
> other disk io has completed. But, it should be possible to continue
> servicing network requests or user io while disk io is underway.

Ehh.. The supprot for this is actually all there already. It's just not
used, because nobody asked for it.

Check the "rw_ahead" variable in __make_request(). Notice how it does
everything you ask for.

So remind me again why we should need a whole new interface for something
that already exists but isn't exported because nobody needed it? It got
created for READA, but that isn't used any more.

You could absolutely _trivially_ re-introduce it (along with WRITEA), but
you should probably change the semantics of what happens when it doesn't
get a request. Something like making "submit_bh()" return an error value
for the case, instead of doing "bh->b_end_io(0..)" which is what I think
it does right now. That would make it easier for the submitter to say "oh,
the queue is full".

This is probably all of 5 lines of code.

I really think that people don't give the block device layer enough
credit. Some of it is quite ugly due to 10 years of history, and there is
certainly a lack of some interesting capabilities (there is no "barrier"
operation right now to enforce ordering, for example, and it really would
be sensible to support a wider operation of ops than just read/write and
let the ioctl's use it to pass commands too).

These issues are things that I've been discussing with Jens for the last
few months, and are things that he already to some degree has been toying
with, and we already decided to try to do this during 2.5.x.

It's already been a _lot_ of clean-up with the per-queue request lists
etc, and there's more to be done in the cleanup section too. But the fact
is that too many people seem to have ignored the support that IS there,
and that actually works very well indeed - and is very generic.

> > What more do you think your kiobuf's should be able to do?
>
> That's what my code is doing today. There are a ton of bh's setup for a
> single kiobuf request that is issued. For something like a single 256kb
> io, this is the difference between the batched io requests being passed
> into submit_bh fitting in L1 cache and overflowing it. Resizable bh's
> would certainly improve this.

bh's _are_ resizeable. You just change bh->b_size, and you're done.

Of course, you'll need to do your own memory management for the backing
store. The generic bread() etc layer makes memory management simpler by
having just one size per page and making "struct page" their native mm
etity, but that's not really a bh issue - it's a MM issue and stems from
the fact that this is how all traditional block filesystems tend to want
to work.

NOTE! If you do start to resize the buffer heads, please give me a ping.
The code has never actually been _tested_ with anything but 512. 1024,
2048, 4096 and 8192-byte blocks. I would not be surprised at all if some
low-level drivers actually have asserts that the sizes are ones they
"recognize". The generic layer should be happy with anything that is a
multiple of 512, but as with all things, you'll probably find some gotchas
when you actually try something new.

Linus

2001-02-06 20:28:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 11:32:43AM -0800, Linus Torvalds wrote:
> Traditionally, a "bh" is only _used_ for small areas, but that's not a
> "bh" issue, that's a memory management issue. The code should pretty much
> handle the issue of a single 64kB bh pretty much as-is, but nothing
> creates them: the VM layer only creates bh's in sizes ranging from 512
> bytes to a single page.
>
> The IO layer could do more, but there has yet to be anybody who needed
> more (becase once you hit a page-size, you tend to get into
> scatter-gather, so you want to have one bh per area - and let the
> low-level IO level handle the actual merging etc).

Yes. That's one disadvantage blown away.

The second is that bh's are two things:

- a cacheing object
- an io buffer

This is not really an clean appropeach, and I would really like to
get away from it.

Christoph

--
Whip me. Beat me. Make me maintain AIX.

2001-02-06 20:36:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Christoph Hellwig wrote:

> The second is that bh's are two things:
>
> - a cacheing object
> - an io buffer
>
> This is not really an clean appropeach, and I would really like to get
> away from it.

caching bmap() blocks was a recent addition around 2.3.20, and i suggested
some time ago to cache pagecache blocks via explicit entries in struct
page. That would be one solution - but it creates overhead.

but there isnt anything wrong with having the bhs around to cache blocks -
think of it as a 'cached and recycled IO buffer entry, with the block
information cached'.

frankly, my quick (and limited) hack to abuse bhs to cache blocks just
cannot be a reason to replace bhs ...

Ingo

2001-02-06 20:41:43

by Manfred Spraul

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Ben LaHaise wrote:
>
> On Tue, 6 Feb 2001, Ingo Molnar wrote:
>
> >
> > On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > > This small correction is the crux of the problem: if it blocks, it
> > > takes away from the ability of the process to continue doing useful
> > > work. If it returns -EAGAIN, then that's okay, the io will be
> > > resubmitted later when other disk io has completed. But, it should be
> > > possible to continue servicing network requests or user io while disk
> > > io is underway.
> >
> > typical blocking point is waiting for page completion, not
> > __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> > increased anytime. If NR_REQUESTS is large enough then think of it as the
> > 'absolute upper limit of doing IO', and think of the blocking as 'the
> > kernel pulling the brakes'.
>
> =) This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait. With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.
>
Has that anything to do with kiobuf or buffer head?

Several kernel functions need a "dontblock" parameter (or a callback, or
a waitqueue address, or a tq_struct pointer).

--
Manfred

2001-02-06 20:50:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Ben LaHaise wrote:
> =) This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait. With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.

You can't do async I/O this way! In going what Linus said, make submit_bh
return an int telling you if it failed to queue the buffer and use
READA/WRITEA to submit it.

--
Jens Axboe

2001-02-06 20:51:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Manfred Spraul wrote:
> > =) This is what I'm seeing: lots of processes waiting with wchan ==
> > __get_request_wait. With async io and a database flushing lots of io
> > asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> > very quickly.
> >
> Has that anything to do with kiobuf or buffer head?

Nothing

> Several kernel functions need a "dontblock" parameter (or a callback, or
> a waitqueue address, or a tq_struct pointer).

We don't even need that, non-blocking is implicitly applied with READA.

--
Jens Axboe

2001-02-06 20:56:13

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
>
> > The second is that bh's are two things:
> >
> > - a cacheing object
> > - an io buffer
> >
> > This is not really an clean appropeach, and I would really like to get
> > away from it.
>
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.

Think about a given number of pages which are physically contiguous on
disk -- you dont need to cache the block number for each page, you just
need to cache the physical block number of the first page of the
"cluster".

SGI's pagebuf do that, and it would be great if we had something similar
in 2.5.

It allows us to have fast IO clustering.

> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

Usually we need to cache only block information (for clustering), and not
all the other stuff which buffer_head holds.

> frankly, my quick (and limited) hack to abuse bhs to cache blocks just
> cannot be a reason to replace bhs ...


2001-02-06 21:00:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Marcelo Tosatti wrote:

> Think about a given number of pages which are physically contiguous on
> disk -- you dont need to cache the block number for each page, you
> just need to cache the physical block number of the first page of the
> "cluster".

ranges are a hell of a lot more trouble to get right than page or
block-sized objects - and typical access patterns are rarely 'ranged'. As
long as the basic unit is not 'too small' (ie. not 512 byte, but something
more sane, like 4096 bytes), i dont think ranging done in higher levels
buys us anything valuable. And we do ranging at the request layer already
... Guess why most CPUs ended up having pages, and not "memory ranges"?
It's simpler, thus faster in the common case and easier to debug.

> Usually we need to cache only block information (for clustering), and
> not all the other stuff which buffer_head holds.

well, the other issue is that buffer_heads hold buffer-cache details as
well. But i think it's too small right now to justify any splitup - and
those issues are related enough to have significant allocation-merging
effects.

Ingo

2001-02-06 21:00:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Christoph Hellwig wrote:
>
> The second is that bh's are two things:
>
> - a cacheing object
> - an io buffer

Actually, they really aren't.

They kind of _used_ to be, but more and more they've moved away from that
historical use. Check in particular the page cache, and as a really
extreme case the swap cache version of the page cache.

It certainly _used_ to be true that "bh"s were actually first-class memory
management citizens, and actually had a data buffer and a cache associated
with them. And because of that historical baggage, that's how many people
still think of them.

These days, it's really not true any more. A "bh" doesn't really have an
IO buffer intrisically associated with it any more - all memory management
is done on a _page_ level, and it really works the other way around, ie a
page can have one or more bh's associated with it as the IO entity.

This _does_ show up in the bh itself: you find that bh's end up having the
bh->b_page pointer in it, which is really a layering violation these days,
but you'll notice that it's actually not used very much, and it could
probably be largely removed.

The most fundamental use of it (from an IO standpoint) is actually to
handle high memory issues, because high-memory handling is very
fundamentally based on "struct page", and in order to be able to have
high-memory IO buffers you absolutely have to have the "struct page" the
way things are done now.

(all the other uses tend to not be IO-related at all: they are stuff like
the callbacks that want to find the page that should be free'd up)

The other part of "struct bh" is that it _does_ have support for fast
lookups, and the bh hashing. Again, from a pure IO standpoint you can
easily choose to just ignore this. It's often not used at all (in fact,
_most_ bh's aren't hashed, because the only way to find them are through
the page cache).

> This is not really an clean appropeach, and I would really like to
> get away from it.

Trust me, you really _can_ get away from it. It's not designed into the
bh's at all. You can already just allocate a single (or multiple) "struct
buffer_head" and just use them as IO objects, and give them your _own_
pointers to the IO buffer etc.

In fact, if you look at how the page cache is organized, this is what the
page cache already does. The page cache has it's own IO buffer (the page
itself), and it just uses "struct buffer_head" to allocate temporary IO
entities. It _also_ uses the "struct buffer_head" to cache the meta-data
in the sense of having the buffer head also contain the physical address
on disk so that the page cache doesn't have to ask the low-level
filesystem all the time, so in that sense it actually has a double use for
it.

But you can (and _should_) think of that as a "we got the meta-data
address caching for free, and it fit with our historical use, so why not
use it?".

So you can easily do the equivalent of

- maintain your own buffers (possibly by looking up pages directly from
user space, if you want to do zero-copy kind of things)

- allocate a private buffer head ("get_unused_buffer_head()")

- make that buffer head point into your buffer

- submit the IO by just calling "submit_bh()", using the b_end_io()
callback as your way to maintain _your_ IO buffer ownership.

In particular, think of the things that you do NOT have to do:

- you do NOT have to allocate a bh-private buffer. Just point the bh at
your own buffer.
- you do NOT have to "give" your buffer to the bh. You do, of course,

want to know when the bh is done with _your_ buffer, but that's what
the b_end_io callback is all about.

- you do NOT have to hash the bh you allocated and thus expose it to
anybody else. It is YOUR private bh, and it does not show up on ANY
other lists. There are various helper functions to insert the bh on
various global lists ("mark_bh_dirty()" to put it on the dirty list,
"buffer_insert_inode_queue()" to put it on the inode lists etc, but
there is nothing in the thing that _forces_ you to expose your bh.

So don't think of "bh->b_data" as being something that the bh owns. It's
just a pointer. Think of "bh->b_data" and "bh->b_size" as _nothing_ more
than a data range in memory.

In short, you can, and often should, think of "struct buffer_head" as
nothing but an IO entity. It has some support for being more than that,
but that's secondary. That can validly be seen as another layer, that is
just so common that there is little point in splitting it up (and a lot of
purely historical reasons for not splitting it).

Linus

2001-02-06 21:26:51

by Manfred Spraul

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Jens Axboe wrote:
>
> > Several kernel functions need a "dontblock" parameter (or a callback, or
> > a waitqueue address, or a tq_struct pointer).
>
> We don't even need that, non-blocking is implicitly applied with READA.
>
READA just returns - I doubt that the aio functions should poll until
there are free entries in the request queue.

The pending aio requests should be "included" into the wait_for_requests
waitqueue (ok, they don't have a process context, thus a wait queue
entry doesn't help, but these requests belong into that wait queue)

--
Manfred

2001-02-06 21:42:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Manfred Spraul wrote:
> Jens Axboe wrote:
> >
> > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > a waitqueue address, or a tq_struct pointer).
> >
> > We don't even need that, non-blocking is implicitly applied with READA.
> >
> READA just returns - I doubt that the aio functions should poll until
> there are free entries in the request queue.

The aio functions should NOT use READA/WRITEA. They should just use the
normal operations, waiting for requests. The things that makes them
asycnhronous is not waiting for the requests to _complete_. Which you can
already do, trivially enough.

The case for using READA/WRITEA is not that you want to do asynchronous
IO (all Linux IO is asynchronous unless you do extra work), but because
you have a case where you _might_ want to start IO, but if you don't have
a free request slot (ie there's already tons of pending IO happening), you
want the option of doing something else. This is not about aio - with aio
you _need_ to start the IO, you're just not willing to wait for it.

An example of READA/WRITEA is if you want to do opportunistic dirty page
cleaning - you might not _have_ to clean it up, but you say

"Hmm.. if you can do this simply without having to wait for other
requests, start doing the writeout in the background. If notm I'll come
back to you later after I've done more real work.."

And the Linux block device layer supports both of these kinds of "delayed
IO" already. It's all there. Today.

Linus

2001-02-06 21:57:34

by Manfred Spraul

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Linus Torvalds wrote:
>
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > Jens Axboe wrote:
> > >
> > > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > > a waitqueue address, or a tq_struct pointer).
> > >
> > > We don't even need that, non-blocking is implicitly applied with READA.
> > >
> > READA just returns - I doubt that the aio functions should poll until
> > there are free entries in the request queue.
>
> The aio functions should NOT use READA/WRITEA. They should just use the
> normal operations, waiting for requests.

But then you end with lots of threads blocking in get_request()

Quoting Ben's mail:
<<<<<<<<<
>
> =) This is what I'm seeing: lots of processes waiting with wchan ==
> __get_request_wait. With async io and a database flushing lots of io
> asynchronously spread out across the disk, the NR_REQUESTS limit is hit
> very quickly.
>
>>>>>>>>>

On an io bound server the request queue is always full - waiting for the
next request might take longer than the actual io.

--
Manfred

2001-02-06 22:06:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Linus Torvalds wrote:

>
>
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > Jens Axboe wrote:
> > >
> > > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > > a waitqueue address, or a tq_struct pointer).
> > >
> > > We don't even need that, non-blocking is implicitly applied with READA.
> > >
> > READA just returns - I doubt that the aio functions should poll until
> > there are free entries in the request queue.
>
> The aio functions should NOT use READA/WRITEA. They should just use the
> normal operations, waiting for requests. The things that makes them
> asycnhronous is not waiting for the requests to _complete_. Which you can
> already do, trivially enough.

Reading write(2):

EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was
no room in the pipe or socket connected to fd to write the data
immediately.

I see no reason why "aio function have to block waiting for requests".

_Why_ they do ?

2001-02-06 22:10:45

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> > > > We don't even need that, non-blocking is implicitly applied with READA.
> > > >
> > > READA just returns - I doubt that the aio functions should poll until
> > > there are free entries in the request queue.
> >
> > The aio functions should NOT use READA/WRITEA. They should just use the
> > normal operations, waiting for requests. The things that makes them
> > asycnhronous is not waiting for the requests to _complete_. Which you can
> > already do, trivially enough.
>
> Reading write(2):
>
> EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was
> no room in the pipe or socket connected to fd to write the data
> immediately.
>
> I see no reason why "aio function have to block waiting for requests".

That was my reasoning too with READA etc, but Linus seems to want that we
can block while submitting the I/O (as throttling, Linus?) just not
until completion.

--
Jens Axboe

2001-02-06 22:14:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Manfred Spraul wrote:
> >
> > The aio functions should NOT use READA/WRITEA. They should just use the
> > normal operations, waiting for requests.
>
> But then you end with lots of threads blocking in get_request()

So?

What the HELL do you expect to happen if somebody writes faster than the
disk can take?

You don't lik ebusy-waiting. Fair enough.

So maybe blocking on a wait-queue is the right thing? Just MAYBE?

Linus

2001-02-06 22:27:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Jens Axboe wrote:

> On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> >
> > Reading write(2):
> >
> > EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was
> > no room in the pipe or socket connected to fd to write the data
> > immediately.
> >
> > I see no reason why "aio function have to block waiting for requests".
>
> That was my reasoning too with READA etc, but Linus seems to want that we
> can block while submitting the I/O (as throttling, Linus?) just not
> until completion.

Note the "in the pipe or socket" part.
^^^^ ^^^^^^

EAGAIN is _not_ a valid return value for block devices or for regular
files. And in fact it _cannot_ be, because select() is defined to always
return 1 on them - so if a write() were to return EAGAIN, user space would
have nothing to wait on. Busy waiting is evil.

So READA/WRITEA are only useful inside the kernel, and when the caller has
some data structures of its own that it can use to gracefully handle the
case of a failure - it will try to do the IO later for some reasons, maybe
deciding to do it with blocking because it has nothing better to do at the
later date, or because it decides that it can have only so many
outstanding requests.

Remember: in the end you HAVE to wait somewhere. You're always going to be
able to generate data faster than the disk can take it. SOMETHING has to
throttle - if you don't allow generic_make_request() to throttle, you have
to do it on your own at some point. It is stupid and counter-productive to
argue against throttling. The only argument can be _where_ that throttling
is done, and READA/WRITEA leaves the possibility open of doing it
somewhere else (or just delaying it and letting a future call with
READ/WRITE do the throttling).

Linus

2001-02-06 22:29:11

by Andre Hedrick

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Linus Torvalds wrote:

>
>
> On Tue, 6 Feb 2001, Manfred Spraul wrote:
> > >
> > > The aio functions should NOT use READA/WRITEA. They should just use the
> > > normal operations, waiting for requests.
> >
> > But then you end with lots of threads blocking in get_request()
>
> So?
>
> What the HELL do you expect to happen if somebody writes faster than the
> disk can take?
>
> You don't lik ebusy-waiting. Fair enough.
>
> So maybe blocking on a wait-queue is the right thing? Just MAYBE?

Did I miss a portion of the thread?
Is the block layer ignoring the status of a device?

--Andre

2001-02-06 23:03:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Linus Torvalds wrote:

> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING has to
> throttle - if you don't allow generic_make_request() to throttle, you have
> to do it on your own at some point. It is stupid and counter-productive to
> argue against throttling. The only argument can be _where_ that throttling
> is done, and READA/WRITEA leaves the possibility open of doing it
> somewhere else (or just delaying it and letting a future call with
> READ/WRITE do the throttling).

Its not "arguing against throttling".

Its arguing against making a smart application block on the disk while its
able to use the CPU for other work.

An application which sets non blocking behavior and busy waits for a
request (which seems to be your argument) is just stupid, of course.



2001-02-06 23:27:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
>
> Its arguing against making a smart application block on the disk while its
> able to use the CPU for other work.

There are currently no other alternatives in user space. You'd have to
create whole new interfaces for aio_read/write, and ways for the kernel to
inform user space that "now you can re-try submitting your IO".

Could be done. But that's a big thing.

> An application which sets non blocking behavior and busy waits for a
> request (which seems to be your argument) is just stupid, of course.

Tell me what else it could do at some point? You need something like
select() to wait on it. There are no such interfaces right now...

(besides, latency would suck. I bet you're better off waiting for the
requests if they are all used up. It takes too long to get deep into the
kernel from user space, and you cannot use the exclusive waiters with its
anti-herd behaviour etc).

Simple rule: if you want to optimize concurrency and avoid waiting - use
several processes or threads instead. At which point you can get real work
done on multiple CPU's, instead of worrying about what happens when you
have to wait on the disk.

Linus

2001-02-07 00:27:00

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
>
> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

No, it is a problem of the ll_rw_block interface: buffer_heads need to
be aligned on disk at a multiple of their buffer size. Under the Unix
raw IO interface it is perfectly legal to begin a 128kB IO at offset
512 bytes into a device.

--Stephen

2001-02-07 00:28:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:

> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size. Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

then we should either fix this limitation, or the raw IO code should split
the request up into several, variable-size bhs, so that the range is
filled out optimally with aligned bhs.

Ingo

2001-02-07 00:36:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07 2001, Stephen C. Tweedie wrote:
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
>
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size. Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

Submitting buffers to lower layers that are not hw sector aligned
can't be supported below ll_rw_blk anyway (they can, but look at the
problems this has always created), and I would much rather see stuff
like this handled outside of there.

--
Jens Axboe

2001-02-07 00:41:22

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
>
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size. Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
>
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
enforces a single blocksize on all requests but that relaxing that
requirement is no big deal). Buffer_heads can't deal with data which
spans more than a page right now.

--Stephen

2001-02-07 00:42:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Tue, 6 Feb 2001, Ingo Molnar wrote:
>
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size. Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
>
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

As mentioned, no such limitation exists if you just use the right
interfaces.

Linus

2001-02-07 00:42:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> >
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
>
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.

Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
traditional block device setup, where "b_blocknr" is the "virtual
blocknumber" and that indeed is tied in to the block size.

That's the whole _point_ of ll_rw_block() and friends - they show the
device at a different "virtual blocking" level than the low-level physical
accesses necessarily are. Which very much means that if you have a 4kB
"view", of the device, you get a stream of 4kB blocks. Not 4kB sized
blocks at 512-byte offsets (or whatebver the hardware blocking size is).

This way the interfaces are independent of the hardware blocksize. Which
is logical and what you'd expect. You need to go to a lower level to see
those kinds of blocking issues.

But it is _not_ true of "generic_make_request()" and the block IO layer in
general. It obviously _cannot_ be true, because the block I/O layer has
always had the notion of merging consecutive blocks together - regardless
of whether the end result is even a power of two or antyhing like that in
size. You can make an IO request for pretty much any size, as long as it's
a multiple of the hardare blocksize (normally 512 bytes, but there are
certainly devices out there with other blocksizes).

The fact is, if you have problems like the above, then you don't
understand the interfaces. And it sounds like you designed kiobuf support
around the wrong set of interfaces.

If you want to get at the _sector_ level, then you do

lock_bh();
bh->b_rdev = device;
bh->b_rsector = sector-number (where linux defines "sector" to be 512 bytes)
bh->b_size = size in bytes (must be a multiple of 512);
bh->b_data = pointer;
bh->b_end_io = callback;
generic_make_request(rw, bh);

which doesn't look all that complicated to me. What's the problem?

Linus

2001-02-07 00:48:12

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07, 2001 at 12:36:29AM +0000, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> >
> > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> >
> > > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > > be aligned on disk at a multiple of their buffer size. Under the Unix
> > > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > > 512 bytes into a device.
> >
> > then we should either fix this limitation, or the raw IO code should split
> > the request up into several, variable-size bhs, so that the range is
> > filled out optimally with aligned bhs.
>
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal). Buffer_heads can't deal with data which
> spans more than a page right now.


I can handle requests larger than a page (64K) but I am not using
the buffer cache in Linux. We really need an NT/NetWare like model
to support the non-Unix FS's properly.

i.e.

a disk request should be

<disk> <lba> <length> <buffer> and get rid of this fixed block
stuff with buffer heads. :-)

I understand that the way the elevator is implemented in Linux makes
this very hard at this point to support, since it's very troublesome
to handling requests that overlap sector boundries.

Jeff


>
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2001-02-07 00:51:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal). Buffer_heads can't deal with data which
> spans more than a page right now.

Stephen, you're so full of shit lately that it's unbelievable. You're
batting a clear 0.000 so far.

"struct buffer_head" can deal with pretty much any size: the only thing it
cares about is bh->b_size.

It so happens that if you have highmem support, then "create_bounce()"
will work on a per-page thing, but that just means that you'd better have
done your bouncing into low memory before you call generic_make_request().

Have you ever spent even just 5 minutes actually _looking_ at the block
device layer, before you decided that you think it needs to be completely
re-done some other way? It appears that you never bothered to.

Sure, I would not be surprised if some device driver ends up being
surpised if you start passing it different request sizes than it is used
to. But that's a driver and testing issue, nothing more.

(Which is not to say that "driver and testing" issues aren't important as
hell: it's one of the more scary things in fact, and it can take a long
time to get right if you start doing somehting that historically has never
been done and thus has historically never gotten any testing. So I'm not
saying that it should work out-of-the-box. But I _am_ saying that there's
no point in trying to re-design upper layers that already do ALL of this
with no problems at all).

Linus

2001-02-07 00:56:42

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> >
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal). Buffer_heads can't deal with data which
> > spans more than a page right now.
>
> Stephen, you're so full of shit lately that it's unbelievable. You're
> batting a clear 0.000 so far.
>
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.
>
> It so happens that if you have highmem support, then "create_bounce()"
> will work on a per-page thing, but that just means that you'd better have
> done your bouncing into low memory before you call generic_make_request().
>
> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.
>
> Sure, I would not be surprised if some device driver ends up being
> surpised if you start passing it different request sizes than it is used
> to. But that's a driver and testing issue, nothing more.
>
> (Which is not to say that "driver and testing" issues aren't important as
> hell: it's one of the more scary things in fact, and it can take a long
> time to get right if you start doing somehting that historically has never
> been done and thus has historically never gotten any testing. So I'm not
> saying that it should work out-of-the-box. But I _am_ saying that there's
> no point in trying to re-design upper layers that already do ALL of this
> with no problems at all).
>
> Linus
>

I remember Linus asking to try this variable buffer head chaining
thing 512-1024-512 kind of stuff several months back, and mixing them to
see what would happen -- result. About half the drivers break with it.
The interface allows you to do it, I've tried it, (works on Andre's
drivers, but a lot of SCSI drivers break) but a lot of drivers seem to
have assumptions about these things all being the same size in a
buffer head chain.

:-)

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2001-02-07 01:03:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> I remember Linus asking to try this variable buffer head chaining
> thing 512-1024-512 kind of stuff several months back, and mixing them to
> see what would happen -- result. About half the drivers break with it.
> The interface allows you to do it, I've tried it, (works on Andre's
> drivers, but a lot of SCSI drivers break) but a lot of drivers seem to
> have assumptions about these things all being the same size in a
> buffer head chain.

I don't see anything that would break doing this, in fact you can
do this as long as the buffers are all at least a multiple of the
block size. All the drivers I've inspected handle this fine, noone
assumes that rq->bh->b_size is the same in all the buffers attached
to the request. This includes SCSI (scsi_lib.c builds sg tables),
IDE, and the Compaq array + Mylex driver. This mostly leaves the
"old-style" drivers using CURRENT etc, the kernel helpers for these
handle it as well.

So I would appreciate pointers to these devices that break so we
can inspect them.

--
Jens Axboe

2001-02-07 01:02:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> I remember Linus asking to try this variable buffer head chaining
> thing 512-1024-512 kind of stuff several months back, and mixing them
> to see what would happen -- result. About half the drivers break with
> it. [...]

time to fix them then - instead of rewriting the rest of the kernel ;-)

Ingo

2001-02-07 01:05:12

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07, 2001 at 02:01:54AM +0100, Ingo Molnar wrote:
>
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
>
> > I remember Linus asking to try this variable buffer head chaining
> > thing 512-1024-512 kind of stuff several months back, and mixing them
> > to see what would happen -- result. About half the drivers break with
> > it. [...]
>
> time to fix them then - instead of rewriting the rest of the kernel ;-)
>
> Ingo

I agree.

Jeff

2001-02-07 01:06:32

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07, 2001 at 02:02:21AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > I remember Linus asking to try this variable buffer head chaining
> > thing 512-1024-512 kind of stuff several months back, and mixing them to
> > see what would happen -- result. About half the drivers break with it.
> > The interface allows you to do it, I've tried it, (works on Andre's
> > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to
> > have assumptions about these things all being the same size in a
> > buffer head chain.
>
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request. This includes SCSI (scsi_lib.c builds sg tables),
> IDE, and the Compaq array + Mylex driver. This mostly leaves the
> "old-style" drivers using CURRENT etc, the kernel helpers for these
> handle it as well.
>
> So I would appreciate pointers to these devices that break so we
> can inspect them.
>
> --
> Jens Axboe

Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.

Jeff

2001-02-07 01:07:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > handle it as well.
> >
> > So I would appreciate pointers to these devices that break so we
> > can inspect them.
> >
> > --
> > Jens Axboe
>
> Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.

most likely some coding error on your side. buffer-size mismatches should
show up as filesystem corruption or random DMA scribble, not in-driver
oopses.

Ingo

2001-02-07 01:09:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.

Do you still have this oops?

--
Jens Axboe

2001-02-07 01:10:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07 2001, Ingo Molnar wrote:
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.
>
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I would suspect so, aic7xxx shouldn't care about anything except the
sg entries and I would seriously doubt that it makes any such
assumptions on them :-)

--
Jens Axboe

2001-02-07 01:13:22

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07, 2001 at 02:06:27AM +0100, Ingo Molnar wrote:
>
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
>
> > > I don't see anything that would break doing this, in fact you can
> > > do this as long as the buffers are all at least a multiple of the
> > > block size. All the drivers I've inspected handle this fine, noone
> > > assumes that rq->bh->b_size is the same in all the buffers attached
> > > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > > handle it as well.
> > >
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.
>
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.
>
> Ingo

Oops was in my code, but was caused by these drivers. The Adaptec
driver did have an oops that was it's own code address, AIC7XXX
crashed in my code.

Jeff

2001-02-07 01:14:02

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07, 2001 at 02:08:53AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.
>
> Do you still have this oops?
>

I can recreate. Will work on it tommorrow. SCI testing today.

Jeff

> --
> Jens Axboe

2001-02-07 01:12:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Wed, 7 Feb 2001, Jens Axboe wrote:

> > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it.
> >
> > most likely some coding error on your side. buffer-size mismatches should
> > show up as filesystem corruption or random DMA scribble, not in-driver
> > oopses.
>
> I would suspect so, aic7xxx shouldn't care about anything except the
> sg entries and I would seriously doubt that it makes any such
> assumptions on them :-)

yep - and not a single reference to b_size in aic7xxx.c.

Ingo

2001-02-07 01:20:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Jens Axboe wrote:
>
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request.

It's really easy to get this wrong when going forward in the request list:
you need to make sure that you update "request->current_nr_sectors" each
time you move on to the next bh.

I would not be surprised if some of them have been seriously buggered.

On the other hand, I would _also_ not be surprised if we've actually fixed
a lot of them: one of the things that the RAID code and loopback test is
exactly getting these kinds of issues right (not this exact one, but
similar ones).

And let's remember things like the old ultrastor driver that was totally
unable to handle anything but 1kB devices etc. I would not be _totally_
surprised if it turns out that there are still drivers out there that
remember the time when Linux only ever had 1kB buffers. Even if it is 7
years ago or so ;)

(Also, there might be drivers that are "optimized" - they set the IO
length once per request, and just never set it again as they do partial
end_io() calls. None of those kinds of issues would ever be found under
normal load, so I would be _really_ nervous about just turning it on
silently. This is all very much a 2.5.x-kind of thing ;)

Linus

2001-02-07 01:26:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Ingo Molnar wrote:
>
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I'm not sure. If I was a driver writer (and I'm happy those days are
mostly behind me ;), I would not be totally dis-inclined to check for
various limits and things.

There can be hardware out there that simply has trouble with non-native
alignment, ie be unhappy about getting a 1kB request that is aligned in
memory at a 512-byte boundary. So there are real reasons why drivers might
need updating. Don't dismiss the concerns out-of-hand.

Linus

2001-02-07 01:31:55

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote:
>
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.
>
> Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
> traditional block device setup, where "b_blocknr" is the "virtual
> blocknumber" and that indeed is tied in to the block size.
>
> The fact is, if you have problems like the above, then you don't
> understand the interfaces. And it sounds like you designed kiobuf support
> around the wrong set of interfaces.

They used the only interfaces available at the time...

> If you want to get at the _sector_ level, then you do
...
> which doesn't look all that complicated to me. What's the problem?

Doesn't this break nastily as soon as the IO hits an LVM or soft raid
device? I don't think we are safe if we create a larger-sized
buffer_head which spans a raid stripe: the raid mapping is only
applied once per buffer_head.

--Stephen

2001-02-07 01:41:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> >
> > The fact is, if you have problems like the above, then you don't
> > understand the interfaces. And it sounds like you designed kiobuf support
> > around the wrong set of interfaces.
>
> They used the only interfaces available at the time...

Ehh.. "generic_make_request()" goes back a _loong_ time. It used to be
called just "make_request()", but all my points still stand.

It's even exported to modules. As far as I know, the raid code has always
used this interface exactly because raid needed to feed back the remapped
stuff and get around the blocksizing in ll_rw_block().

This really isn't anything new. I _know_ it's there in 2.2.x, and I
would not be surprised if it was there even in 2.0.x.

> > If you want to get at the _sector_ level, then you do
> ...
> > which doesn't look all that complicated to me. What's the problem?
>
> Doesn't this break nastily as soon as the IO hits an LVM or soft raid
> device? I don't think we are safe if we create a larger-sized
> buffer_head which spans a raid stripe: the raid mapping is only
> applied once per buffer_head.

Absolutely. This is exactly what I mean by saying that low-level drivers
may not actually be able to handle new cases that they've never been asked
to do before - they just never saw anything like a 64kB request before or
something that crossed its own alignment.

But the _higher_ levels are there. And there's absolutely nothing in the
design that is a real problem. But there's no question that you might need
to fix up more than one or two low-level drivers.

(The only drivers I know better are the IDE ones, and as far as I can tell
they'd have no trouble at all with any of this. Most other normal drivers
are likely to be in this same situation. But because I've not had a reason
to test, I certainly won't guarantee even that).

Linus

2001-02-07 01:40:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Linus Torvalds wrote:
> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request.
>
> It's really easy to get this wrong when going forward in the request list:
> you need to make sure that you update "request->current_nr_sectors" each
> time you move on to the next bh.
>
> I would not be surprised if some of them have been seriously buggered.

Maybe have been, but it looks good at least with the general drivers
that I mentioned.

> [...] so I would be _really_ nervous about just turning it on
> silently. This is all very much a 2.5.x-kind of thing ;)

Then you might want to apply this :-)

--- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001
+++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001
@@ -1048,7 +1048,7 @@
/* Verify requested block sizes. */
for (i = 0; i < nr; i++) {
struct buffer_head *bh = bhs[i];
- if (bh->b_size % correct_size) {
+ if (bh->b_size != correct_size) {
printk(KERN_NOTICE "ll_rw_block: device %s: "
"only %d-char blocks implemented (%u)\n",
kdevname(bhs[0]->b_dev),

--
Jens Axboe

2001-02-07 01:46:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Jens Axboe wrote:
>
> > [...] so I would be _really_ nervous about just turning it on
> > silently. This is all very much a 2.5.x-kind of thing ;)
>
> Then you might want to apply this :-)
>
> --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001
> +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001
> @@ -1048,7 +1048,7 @@
> /* Verify requested block sizes. */
> for (i = 0; i < nr; i++) {
> struct buffer_head *bh = bhs[i];
> - if (bh->b_size % correct_size) {
> + if (bh->b_size != correct_size) {
> printk(KERN_NOTICE "ll_rw_block: device %s: "
> "only %d-char blocks implemented (%u)\n",
> kdevname(bhs[0]->b_dev),

Actually, I'd rather leave it in, but speed it up with the saner and
faster

if (bh->b_size & (correct_size-1)) {
...

That way people who _want_ to test the odd-size thing can do so. And
normal code (that never generates requests on any other size than the
"native" size) won't ever notice either way.

(Oh, we'll eventually need to move to "correct_size == hardware
blocksize", not the "virtual blocksize" that it is now. As it it a tester
needs to set the soft-blk size by hand now).

Linus

2001-02-07 01:54:17

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> >
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal). Buffer_heads can't deal with data which
> > spans more than a page right now.
>
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.

Right now, anything larger than a page is physically non-contiguous,
and sorry if I didn't make that explicit, but I thought that was
obvious enough that I didn't need to. We were talking about raw IO,
and as long as we're doing IO out of user anonymous data allocated
from individual pages, buffer_heads are limited to that page size in
this context.

> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.

Yes. We still have this fundamental property: if a user sends in a
128kB IO, we end up having to split it up into buffer_heads and doing
a separate submit_bh() on each single one. Given our VM, PAGE_SIZE
(*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
this case.

THAT is the overhead that I'm talking about: having to split a large
IO into small chunks, each of which just ends up having to be merged
back again into a single struct request by the *make_request code.

A constructed IO request basically doesn't care about anything in the
buffer_head except for the data pointer and size, and the completion
status info and callback. All of the physical IO description is in
the struct request by this point. The chain of buffer_heads is
carrying around a huge amount of information which isn't used by the
IO, and if the caller is something like the raw IO driver which isn't
using the buffer cache, that extra buffer_head data is just overhead.

--Stephen

2001-02-07 01:56:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06 2001, Linus Torvalds wrote:
> > > [...] so I would be _really_ nervous about just turning it on
> > > silently. This is all very much a 2.5.x-kind of thing ;)
> >
> > Then you might want to apply this :-)
> >
> > --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001
> > +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001
> > @@ -1048,7 +1048,7 @@
> > /* Verify requested block sizes. */
> > for (i = 0; i < nr; i++) {
> > struct buffer_head *bh = bhs[i];
> > - if (bh->b_size % correct_size) {
> > + if (bh->b_size != correct_size) {
> > printk(KERN_NOTICE "ll_rw_block: device %s: "
> > "only %d-char blocks implemented (%u)\n",
> > kdevname(bhs[0]->b_dev),
>
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
>
> if (bh->b_size & (correct_size-1)) {
> ...
>
> That way people who _want_ to test the odd-size thing can do so. And
> normal code (that never generates requests on any other size than the
> "native" size) won't ever notice either way.

Fine, as I said I didn't spot anything bad so that's why it was changed.

> (Oh, we'll eventually need to move to "correct_size == hardware
> blocksize", not the "virtual blocksize" that it is now. As it it a tester
> needs to set the soft-blk size by hand now).

Exactly, wrt earlier mail about submitting < hw block size requests to
the lower levels.

--
Jens Axboe

2001-02-07 02:38:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> > "struct buffer_head" can deal with pretty much any size: the only thing it
> > cares about is bh->b_size.
>
> Right now, anything larger than a page is physically non-contiguous,
> and sorry if I didn't make that explicit, but I thought that was
> obvious enough that I didn't need to. We were talking about raw IO,
> and as long as we're doing IO out of user anonymous data allocated
> from individual pages, buffer_heads are limited to that page size in
> this context.

Sure. That's obviously also one of the reasons why the IO layer has never
seen bigger requests anyway - the data _does_ tend to be fundamentally
broken up into page-size entities, if for no other reason that that is how
user-space sees memory.

However, I really _do_ want to have the page cache have a bigger
granularity than the smallest memory mapping size, and there are always
special cases that might be able to generate IO in bigger chunks (ie
in-kernel services etc)

> Yes. We still have this fundamental property: if a user sends in a
> 128kB IO, we end up having to split it up into buffer_heads and doing
> a separate submit_bh() on each single one. Given our VM, PAGE_SIZE
> (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> this case.

Absolutely. And this is independent of what kind of interface we end up
using, whether it be kiobuf of just plain "struct buffer_head". In that
respect they are equivalent.

> THAT is the overhead that I'm talking about: having to split a large
> IO into small chunks, each of which just ends up having to be merged
> back again into a single struct request by the *make_request code.

You could easily just generate the bh then and there, if you wanted to.

Your overhead comes from the fact that you want to gather the IO together.

And I'm saying that you _shouldn't_ gather the IO. There's no point. The
gathering is sufficiently done by the low-level code anyway, and I've
tried to explain why the low-level code _has_ to do that work regardless
of what upper layers do.

You need to generate a separate sg entry for each page anyway. So why not
just use the existing one? The "struct buffer_head". Which already
_handles_ all the issues that you have complained are hard to handle.

Linus

2001-02-07 09:10:57

by David Howells

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


Linus Torvalds <[email protected]> wrote:
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
>
> if (bh->b_size & (correct_size-1)) {

I presume that correct_size will always be a power of 2...

David

2001-02-07 12:19:27

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +0000, David Howells wrote:
>
> I presume that correct_size will always be a power of 2...

Yes.

--Stephen

2001-02-07 14:57:37

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> >
> However, I really _do_ want to have the page cache have a bigger
> granularity than the smallest memory mapping size, and there are always
> special cases that might be able to generate IO in bigger chunks (ie
> in-kernel services etc)

No argument there.

> > Yes. We still have this fundamental property: if a user sends in a
> > 128kB IO, we end up having to split it up into buffer_heads and doing
> > a separate submit_bh() on each single one. Given our VM, PAGE_SIZE
> > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> > this case.
>
> Absolutely. And this is independent of what kind of interface we end up
> using, whether it be kiobuf of just plain "struct buffer_head". In that
> respect they are equivalent.

Sorry? I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things. SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

> > THAT is the overhead that I'm talking about: having to split a large
> > IO into small chunks, each of which just ends up having to be merged
> > back again into a single struct request by the *make_request code.
>
> You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

> Your overhead comes from the fact that you want to gather the IO together.

> And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does. And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

> The
> gathering is sufficiently done by the low-level code anyway, and I've
> tried to explain why the low-level code _has_ to do that work regardless
> of what upper layers do.

I know. The problem is the low-level code doing it a hundred times
for a single injected IO.

> You need to generate a separate sg entry for each page anyway. So why not
> just use the existing one? The "struct buffer_head". Which already
> _handles_ all the issues that you have complained are hard to handle.

Two issues here. First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment. It contains a ton of
fields of interest only to the buffer cache. We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list. We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list. We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.


There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built. That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.


If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
Stephen

2001-02-07 18:29:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
>
>
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> >
> > The second is that bh's are two things:
> >
> > - a cacheing object
> > - an io buffer
>
> Actually, they really aren't.
>
> They kind of _used_ to be, but more and more they've moved away from that
> historical use. Check in particular the page cache, and as a really
> extreme case the swap cache version of the page cache.

Yes. And that exactly why I think it's ugly to have the left-over
caching stuff in the same data sctruture as the IO buffer.

> It certainly _used_ to be true that "bh"s were actually first-class memory
> management citizens, and actually had a data buffer and a cache associated
> with them. And because of that historical baggage, that's how many people
> still think of them.

I do even know that the pagecache is our primary cache now :)
Anyway having that caching cruft still in is ugly.

> > This is not really an clean appropeach, and I would really like to
> > get away from it.
>
> Trust me, you really _can_ get away from it. It's not designed into the
> bh's at all. You can already just allocate a single (or multiple) "struct
> buffer_head" and just use them as IO objects, and give them your _own_
> pointers to the IO buffer etc.

So true. Exactly because of that the data structures should become
seperated also.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2001-02-07 18:30:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote:
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.
>
> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

I was not talking about caching physical blocks but the remaining
buffer-cache support stuff.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.
Whip me. Beat me. Make me maintain AIX.

2001-02-07 18:37:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Wed, 7 Feb 2001, Christoph Hellwig wrote:

> On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> >
> > Actually, they really aren't.
> >
> > They kind of _used_ to be, but more and more they've moved away from that
> > historical use. Check in particular the page cache, and as a really
> > extreme case the swap cache version of the page cache.
>
> Yes. And that exactly why I think it's ugly to have the left-over
> caching stuff in the same data sctruture as the IO buffer.

I do agree.

I would not be opposed to factoring out the "pure block IO" part from the
bh struct. It should not even be very hard. You'd do something like

struct block_io {
.. here is the stuff needed for block IO ..
};

struct buffer_head {
struct block_io io;
.. here is the stuff needed for hashing etc ..
}

and then you make "generic_make_request()" and everything lower down take
just the "struct block_io".

You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
because they knoa about bh semantics (ie things like scaling the sector
number to the bh size etc). Which means that pretty much all the code
outside the block layer wouldn't even _notice_. Which is a sign of good
layering.

If you want to do this, please do go ahead.

But do realize that this is not exactly a 2.4.x thing ;)

Linus

2001-02-07 18:47:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Wed, Feb 07, 2001 at 10:36:47AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
>
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > >
> > > Actually, they really aren't.
> > >
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> >
> > Yes. And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
>
> I do agree.
>
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
>
> struct block_io {
> .. here is the stuff needed for block IO ..
> };
>
> struct buffer_head {
> struct block_io io;
> .. here is the stuff needed for hashing etc ..
> }
>
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".

Yep. (besides the name block_io sucks :))

> You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
> because they knoa about bh semantics (ie things like scaling the sector
> number to the bh size etc). Which means that pretty much all the code
> outside the block layer wouldn't even _notice_. Which is a sign of good
> layering.

Yep.

> If you want to do this, please do go ahead.

I'll take a look at it.

> But do realize that this is not exactly a 2.4.x thing ;)

Sure.

Christoph

--
Whip me. Beat me. Make me maintain AIX.

2001-02-07 19:13:43

by Richard Gooch

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Stephen C. Tweedie writes:
> Hi,
>
> On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> > Absolutely. And this is independent of what kind of interface we end up
> > using, whether it be kiobuf of just plain "struct buffer_head". In that
> > respect they are equivalent.
>
> Sorry? I'm not sure where communication is breaking down here, but
> we really don't seem to be talking about the same things. SGI's
> kiobuf request patches already let us pass a large IO through the
> request layer in a single unit without having to split it up to
> squeeze it through the API.

Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
buffer_head is effectively the same thing.

> If you really don't mind the size of the buffer_head as a sg fragment
> header, then at least I'd like us to be able to submit a pre-built
> chain of bh's all at once without having to go through the remap/merge
> cost for each single bh.

Even if you are limited to feeding one buffer_head at a time, the
merge costs should be somewhat mitigated, since you'll decrease your
calls into the API by a factor of 8 or 16.
But an API extension to allow passing a pre-built chain would be even
better.

Hopefully I haven't missed the point. I've got the flu so I'm not
running on all 4 cylinders :-(

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-02-07 20:08:13

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
> Stephen C. Tweedie writes:
> >
> > Sorry? I'm not sure where communication is breaking down here, but
> > we really don't seem to be talking about the same things. SGI's
> > kiobuf request patches already let us pass a large IO through the
> > request layer in a single unit without having to split it up to
> > squeeze it through the API.
>
> Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
> don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
> buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it. A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched. That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently. And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

> But an API extension to allow passing a pre-built chain would be even
> better.

Yep.

--Stephen

2001-02-08 11:22:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote:
> I will claim that you CANNOT merge at higher levels and get good
> performance.
>
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
>
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Hi,

I've tried to experimentally check this statement.

I instrumented a kernel with the following patch. It keeps a counter
for every merge between unrelated requests. An unrelated requests
is defined as the requests getting allocated from different currents.
I did various tests and suprisingly I was not able to trigger a
single unrelated merge on my IDE system with various IO loads (dbench,
news expire, news sort, kernel compile, swapping ...)

So either my patch is wrong (if yes, what is wrong?), or they do simply not
happen in usual IO loads. I know that it has a few holes (like it doesn't
count unrelated merges that happen from the same process, or if a process
quits and another one gets its kernel stack and IO of both is merged it'll
be counted as related merge), but if unrelated merges were relevant there
should still show up more, no?

My pet theory is that page and buffer cache filters most unrelated merges
out. I haven't tried to use raw IO to avoid this problem, but I expect that
anything that does raw IO will do some intelligent IO scheduling on its
own anyways.

If anyone is interested: it would be interesting if other people are
able to trigger unrelated merges in real loads.
Here is a patch. Display statistics using:

(echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore


--- linux/drivers/block/ll_rw_blk.c-REQSTAT Tue Jan 30 13:33:25 2001
+++ linux/drivers/block/ll_rw_blk.c Thu Feb 8 01:13:57 2001
@@ -31,6 +31,9 @@

#include <linux/module.h>

+int unrelated_merge;
+int related_merge;
+
/*
* MAC Floppy IWM hooks
*/
@@ -478,6 +481,7 @@
rq->rq_status = RQ_ACTIVE;
rq->special = NULL;
rq->q = q;
+ rq->originator = current;
}

return rq;
@@ -668,6 +672,11 @@
if (!q->merge_requests_fn(q, req, next, max_segments))
return;

+ if (next->originator != req->originator)
+ unrelated_merge++;
+ else
+ related_merge++;
+
q->elevator.elevator_merge_req_fn(req, next);
req->bhtail->b_reqnext = next->bh;
req->bhtail = next->bhtail;
--- linux/include/linux/blkdev.h-REQSTAT Tue Jan 30 17:17:01 2001
+++ linux/include/linux/blkdev.h Wed Feb 7 23:33:35 2001
@@ -45,6 +45,8 @@
struct buffer_head * bh;
struct buffer_head * bhtail;
request_queue_t *q;
+
+ struct task_struct *originator;
};

#include <linux/elevator.h>




-Andi

2001-02-08 11:37:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi!

> > > Reading write(2):
> > >
> > > EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was
> > > no room in the pipe or socket connected to fd to write the data
> > > immediately.
> > >
> > > I see no reason why "aio function have to block waiting for requests".
> >
> > That was my reasoning too with READA etc, but Linus seems to want that we
> > can block while submitting the I/O (as throttling, Linus?) just not
> > until completion.
>
> Note the "in the pipe or socket" part.
> ^^^^ ^^^^^^
>
> EAGAIN is _not_ a valid return value for block devices or for regular
> files. And in fact it _cannot_ be, because select() is defined to always
> return 1 on them - so if a write() were to return EAGAIN, user space would
> have nothing to wait on. Busy waiting is evil.

So you consider inability to select() on regular files _feature_?

It can be a pretty serious problem with slow block devices
(floppy). It also hurts when you are trying to do high-performance
reads/writes. [I know it hurt in userspace sherlock search engine --
kind of small altavista.]

How do you write high-performance ftp server without threads if select
on regular file always returns "ready"?


> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING

Userspace wants to _know_ when to stop. It asks politely using
"select()".
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2001-02-08 11:36:51

by Pavel Machek

[permalink] [raw]
Subject: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]

Hi!

> > Its arguing against making a smart application block on the disk while its
> > able to use the CPU for other work.
>
> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".

Why is current select() interface not good enough?

Defining that select may say regular file is not ready should be
enough. Okay, maybe you'd want new fcntl() flag saying "I _really_
want this regular file to be non-blocking". No need for new
interfaces.
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2001-02-08 13:23:28

by Martin Dalecki

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Linus Torvalds wrote:
>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
> >
> > > The whole point of the post was that it is merging, not splitting,
> > > which is troublesome. How are you going to merge requests without
> > > having chains of scatter-gather entities each with their own
> > > completion callbacks?
> >
> > Let me just emphasize what Stephen is pointing out: if requests are
> > properly merged at higher layers, then merging is neither required nor
> > desired.
>
> I will claim that you CANNOT merge at higher levels and get good
> performance.
>
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
>
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Merging is a hardware tighted optimization, so it should happen, there
we you
really have full "knowlendge" and controll of the hardware -> namely the
device driver.

> Qutie frankly, this whole discussion sounds worthless. We have solved this
> problem already: it's called a "buffer head". Deceptively simple at higher
> levels, and lower levels can easily merge them together into chains and do
> fancy scatter-gather structures of them that can be dynamically extended
> at any time.
>
> The buffer heads together with "struct request" do a hell of a lot more
> than just a simple scatter-gather: it's able to create ordered lists of
> independent sg-events, together with full call-backs etc. They are
> low-cost, fairly efficient, and they have worked beautifully for years.
>
> The fact that kiobufs can't be made to do the same thing is somebody elses
> problem. I _know_ that merging has to happen late, and if others are
> hitting their heads against this issue until they turn silly, then that's
> their problem. You'll eventually learn, or you'll hit your heads into a
> pulp.

Amen.

--
- phone: +49 214 8656 283
- job: STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-02-08 13:31:09

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
>
> > EAGAIN is _not_ a valid return value for block devices or for regular
> > files. And in fact it _cannot_ be, because select() is defined to always
> > return 1 on them - so if a write() were to return EAGAIN, user space would
> > have nothing to wait on. Busy waiting is evil.
>
> So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
Stephen

2001-02-08 13:55:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:

<snip>

> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
>
> Select can work if the access is sequential, but async IO is a more
> general solution.

Even async IO (ie aio_read/aio_write) should block on the request queue if
its full in Linus mind.








2001-02-08 14:00:51

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]

On Thu, 8 Feb 2001, Pavel Machek wrote:

> Hi!
>
> > > Its arguing against making a smart application block on the disk while its
> > > able to use the CPU for other work.
> >
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
>
> Why is current select() interface not good enough?

Think of random disk io scattered across the disk. Think about aio_write
providing a means to perform zero copy io without needing to resort to
playing mm tricks write protecting pages in the user's page tables. It's
also a means for dealing efficiently with thousands of outstanding
requests for network io. Using a select based interface is going to be an
ugly kludge that still has all the overhead of select/poll.

-ben

2001-02-08 14:54:29

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi!

> So you consider inability to select() on regular files _feature_?

select on files is unimplementable. You can't do background file IO the
same way you do background receiving of packets on socket. Filesystem is
synchronous. It can block.

> It can be a pretty serious problem with slow block devices
> (floppy). It also hurts when you are trying to do high-performance
> reads/writes. [I know it hurt in userspace sherlock search engine --
> kind of small altavista.]
>
> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

You can emulate asynchronous IO with kernel threads like FreeBSD and some
commercial Unices do, but you still need as many (possibly kernel) threads
as many requests you are servicing.

> > Remember: in the end you HAVE to wait somewhere. You're always going to be
> > able to generate data faster than the disk can take it. SOMETHING
>
> Userspace wants to _know_ when to stop. It asks politely using
> "select()".

And how do you want to wait for other select()ed events if you are blocked
in wait_for_buffer in get_block (former bmap)?

Making real async IO would require to rewrite all filesystems and whole
VFS _from_scratch_. It won't happen.

Mikulas

2001-02-08 15:09:20

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".
>
> Could be done. But that's a big thing.

Has been done. Still needs some work, but it works pretty well. As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers. The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

> > An application which sets non blocking behavior and busy waits for a
> > request (which seems to be your argument) is just stupid, of course.
>
> Tell me what else it could do at some point? You need something like
> select() to wait on it. There are no such interfaces right now...
>
> (besides, latency would suck. I bet you're better off waiting for the
> requests if they are all used up. It takes too long to get deep into the
> kernel from user space, and you cannot use the exclusive waiters with its
> anti-herd behaviour etc).

Ah, but no. In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

> Simple rule: if you want to optimize concurrency and avoid waiting - use
> several processes or threads instead. At which point you can get real work
> done on multiple CPU's, instead of worrying about what happens when you
> have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll. Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io. But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

-ben

2001-02-08 15:34:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Thu, 8 Feb 2001, Ben LaHaise wrote:

<snip>

> > (besides, latency would suck. I bet you're better off waiting for the
> > requests if they are all used up. It takes too long to get deep into the
> > kernel from user space, and you cannot use the exclusive waiters with its
> > anti-herd behaviour etc).
>
> Ah, but no. In fact for some things, the wait queue extensions I'm using
> will be more efficient as things like test_and_set_bit for obtaining a
> lock gets executed without waking up a task.

The latency argument is somewhat bogus because there is no problem to
check the request queue, in the aio syscalls, and simply fail if its full.

2001-02-08 15:35:39

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait


On Thu, 8 Feb 2001, Marcelo Tosatti wrote:

>
> On Thu, 8 Feb 2001, Ben LaHaise wrote:
>
> <snip>
>
> > > (besides, latency would suck. I bet you're better off waiting for the
> > > requests if they are all used up. It takes too long to get deep into the
> > > kernel from user space, and you cannot use the exclusive waiters with its
> > > anti-herd behaviour etc).
> >
> > Ah, but no. In fact for some things, the wait queue extensions I'm using
> > will be more efficient as things like test_and_set_bit for obtaining a
> > lock gets executed without waking up a task.
>
> The latency argument is somewhat bogus because there is no problem to
> check the request queue, in the aio syscalls, and simply fail if its full.

Ugh, I forgot to say check the request queue before doing any filesystem
work.

2001-02-08 15:47:12

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> >
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
>
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

This is not problem (you can create queue big enough to handle the load).

The problem is that aio_read and aio_write are pretty useless for ftp or
http server. You need aio_open.

Mikulas

2001-02-08 15:56:35

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > > How do you write high-performance ftp server without threads if select
> > > > on regular file always returns "ready"?
> > >
> > > Select can work if the access is sequential, but async IO is a more
> > > general solution.
> >
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
>
> This is not problem (you can create queue big enough to handle the load).

The point is that you want to be able to not block if the queue full (and
the queue size has nothing to do with that).

> The problem is that aio_read and aio_write are pretty useless for ftp or
> http server. You need aio_open.

Could you explain this?

2001-02-08 15:57:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Thu, Feb 08 2001, Mikulas Patocka wrote:
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
>
> This is not problem (you can create queue big enough to handle the load).

Well in theory, but in practice this isn't a very good idea. At some
point throwing yet more requests in there doesn't make a whole lot
of sense. You are basically _always_ going to be able to empty the request
list by dirtying lots of data.

--
Jens Axboe

2001-02-08 16:13:12

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

> > The problem is that aio_read and aio_write are pretty useless for ftp or
> > http server. You need aio_open.
>
> Could you explain this?

If the server is sending many small files, disk spends huge amount time
walking directory tree and seeking to inodes. Maybe opening the file is
even slower than reading it - read is usually sequential but open needs to
seek at few areas of disk.

And if you have one-threaded server using open, close, aio_read and
aio_write, you actually block the whole server while it is opening a
single file. This is not how async io is supposed to work.

Mikulas


2001-02-08 16:36:44

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > The problem is that aio_read and aio_write are pretty useless for ftp or
> > > http server. You need aio_open.
> >
> > Could you explain this?
>
> If the server is sending many small files, disk spends huge amount time
> walking directory tree and seeking to inodes. Maybe opening the file is
> even slower than reading it - read is usually sequential but open needs to
> seek at few areas of disk.
>
> And if you have one-threaded server using open, close, aio_read and
> aio_write, you actually block the whole server while it is opening a
> single file. This is not how async io is supposed to work.

Ok but this is not the point of the discussion.

2001-02-08 17:00:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > You need aio_open.
> > Could you explain this?
>
> If the server is sending many small files, disk spends huge
> amount time walking directory tree and seeking to inodes. Maybe
> opening the file is even slower than reading it

Not if you have a big enough inode_cache and dentry_cache.

OTOH ... if you have enough memory the whole async IO argument
is moot anyway because all your files will be in memory too.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-02-08 17:13:58

by James A Sutherland

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
>
> > > > You need aio_open.
> > > Could you explain this?
> >
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
>
> Not if you have a big enough inode_cache and dentry_cache.

Eh? However big the caches are, you can still get misses which will
require multiple (blocking) disk accesses to handle...

> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Only for cache hits. If you're doing a Mindcraft benchmark or something
with everything in RAM, you're fine - for real world servers, that's not
really an option ;-)

Really, you want/need cache MISSES to be handled without blocking. However
big the caches, short of running EVERYTHING from a ramdisk, these will
still happen!


James.

2001-02-08 17:53:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]



On Thu, 8 Feb 2001, Pavel Machek wrote:
> >
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
>
> Why is current select() interface not good enough?

Ehh..

One major reason is rather simple: disk request wait times tend to be on
the order of sub-millisecond (remember: if we run out of requests, that
means that we have 256 of them already queued, which means that it's very
likely that several of them will be freed up in the very near future due
to completion).

The fact is, that if you start doing write/select loops, you're going to
waste a _large_ portion of your CPU speed on it. Especially considering
that the select() call would have to go all the way down to the ll_rw_blk
layer to figure out whether there are more requests etc.

So there is (a) historical reasons that say that regular files can never
wait and EAGAIN is not an acceptable return value and (b) practical
reasons for why such an interface would be a bad one.

There are better ways to do it. Either using threads, or just having a
better aio-like interface.

Linus

2001-02-08 18:00:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Thu, 8 Feb 2001, Martin Dalecki wrote:
> >
> > But you'll have a bitch of a time trying to merge multiple
> > threads/processes reading from the same area on disk at roughly the same
> > time. Your higher levels won't even _know_ that there is merging to be
> > done until the IO requests hit the wall in waiting for the disk.
>
> Merging is a hardware tighted optimization, so it should happen, there we you
> really have full "knowlendge" and controll of the hardware -> namely the
> device driver.

Or, in many cases, the device itself. There are valid reasons for not
doing merging in the driver, but they all tend to boil down to "even lower
layers can do a better job of it". They basically _never_ boil down to
"upper layers already did it for us".

That said, there tend to be advantages to doing "appropriate" clustering
at each level. Upper layers can (and do) use read-ahead to help the lower
levels. The write-out can (and currently does not) try to sort the
requests for better elevator behaviour.

The driver level can (and does) further cluster the requests - even if the
low-level device does a perfect job of orderign and merging on its own
it's usually advantageous to have fewer (and bigger) commands in-flight in
order to have fewer completion interrupts and less command traffic on the
bus.

So it's obviously not entirely black-and-white. Upper layers can help, but
it's a mistake to think that they should "do the work".

(Note: a lot of people seem to think that "layering" means that the
complexity is in upper layers, and that lower layers should be simple and
"stupid". This is not true. A well-balanced layering would have all layers
doing potentially equally complex things - but the complexity should be
_independent_. Complex interactions are bad. But it's also bad to thin
kthat lower levels shouldn't be allowed to optimize because they should be
"simple".).

Linus

2001-02-08 18:09:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Thu, 8 Feb 2001, Marcelo Tosatti wrote:
>
> On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:
>
> <snip>
>
> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> >
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
>
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

Not necessarily. I said that "READA/WRITEA" are only worth exporting
inside the kernel - because the latencies and complexities are low-level
enough that it should not be exported to user space as such.

But I could imagine a kernel aio package that does the equivalent of

bh->b_end_io = completion_handler;
generic_make_request(WRITE, bh); /* this may block */
bh= bh->b_next;

/* Now, fill it up as much as we can.. */
current->state = TASK_INTERRUPTIBLE;
while (more data to be written) {
if (generic_make_request(WRITEA, bh) < 0)
break;
bh = bh->b_next;
}

return;

and then you make the _completion handler_ thing continue to feed more
requests. Yes, you may block at some points (because you need to always
have at least _one_ request in-flight in order to have the state machine
active, but you can basically try to avoid blocking more than necessary.

But do you see why the above can't be done from user space? It requires
that the completion handler (which runs in an interrupt context) be able
to continue to feed requests and keep the queue filled. If you don't do
that, you'll never have good throughput, because it takes too long to send
signals, re-schedule or whatever to user mode.

And do you see how it has to block _sometimes_? If people do hundreds of
AIO requests, we can't let memory just fill up with pending writes..

Linus

2001-02-08 18:39:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait



On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
>
> > > > You need aio_open.
> > > Could you explain this?
> >
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
>
> Not if you have a big enough inode_cache and dentry_cache.
>
> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Note that this _is_ an important point.

You should never _ever_ think about pure IO speed as the most important
thing. Even if you get absolutely perfect IO streaming off the fastest
disk you can find, I will beat you every single time with a cached setup
that doesn't need to do IO at all.

90% of the VFS layer is all about caching, and trying to avoid IO. Of the
rest, about 9% is about trying to avoid even calling down to the low-level
filesystem, because it's faster if we can handle it at a high level
without any need to even worry about issues like physical disk addresses.
Even if those addresses are cached.

The remaining 1% is about actually getting the IO done. At that point we
end up throwing our hands in the air and saying "ok, this will be slow".

So if you design your system for disk load, you are missing a big portion
of the picture.

There are cases where IO really matter. The most notable one being
databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
buy more memory if you want high performance, and you tend to be limited
by the network speed anyway (if you have multiple gigabit networks and
network speed isn't an issue, then I can also tell you that buying a few
gigabyte of RAM isn't an issue, because you are obviously working for
something like the DoD and have very little regard for the cost of the
thing ;)

For databases (and for file servers that you want to be robust over a
crash), IO throughput is an issue mainly because you need to put the damn
requests in stable memory somewhere. Which tends to mean that _write_
speed is what really matters, because the reads you can still try to cache
as efficiently as humanly possible (and the issue of database design then
turns into trying to find every single piece of locality you can, so that
the read caching works as well as possible).

Short and sweet: "aio_open()" is basically never supposed to be an issue.
If it is, you've misdesigned something, or you're trying too damn hard to
single-thread everything (and "hiding" the threading that _does_ happen by
just calling it "AIO" instead - lying to yourself, in short).

Linus

2001-02-08 19:56:23

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
>
> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
>
> No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen

2001-02-09 11:28:25

by Martin Dalecki

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Linus Torvalds wrote:
>
> On Thu, 8 Feb 2001, Rik van Riel wrote:
>
> > On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> >
> > > > > You need aio_open.
> > > > Could you explain this?
> > >
> > > If the server is sending many small files, disk spends huge
> > > amount time walking directory tree and seeking to inodes. Maybe
> > > opening the file is even slower than reading it
> >
> > Not if you have a big enough inode_cache and dentry_cache.
> >
> > OTOH ... if you have enough memory the whole async IO argument
> > is moot anyway because all your files will be in memory too.
>
> Note that this _is_ an important point.
>
> You should never _ever_ think about pure IO speed as the most important
> thing. Even if you get absolutely perfect IO streaming off the fastest
> disk you can find, I will beat you every single time with a cached setup
> that doesn't need to do IO at all.
>
> 90% of the VFS layer is all about caching, and trying to avoid IO. Of the
> rest, about 9% is about trying to avoid even calling down to the low-level
> filesystem, because it's faster if we can handle it at a high level
> without any need to even worry about issues like physical disk addresses.
> Even if those addresses are cached.
>
> The remaining 1% is about actually getting the IO done. At that point we
> end up throwing our hands in the air and saying "ok, this will be slow".
>
> So if you design your system for disk load, you are missing a big portion
> of the picture.
>
> There are cases where IO really matter. The most notable one being
> databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
> buy more memory if you want high performance, and you tend to be limited
> by the network speed anyway (if you have multiple gigabit networks and
> network speed isn't an issue, then I can also tell you that buying a few
> gigabyte of RAM isn't an issue, because you are obviously working for
> something like the DoD and have very little regard for the cost of the
> thing ;)
>
> For databases (and for file servers that you want to be robust over a
> crash), IO throughput is an issue mainly because you need to put the damn
> requests in stable memory somewhere. Which tends to mean that _write_
> speed is what really matters, because the reads you can still try to cache
> as efficiently as humanly possible (and the issue of database design then
> turns into trying to find every single piece of locality you can, so that
> the read caching works as well as possible).
>
> Short and sweet: "aio_open()" is basically never supposed to be an issue.
> If it is, you've misdesigned something, or you're trying too damn hard to
> single-thread everything (and "hiding" the threading that _does_ happen by
> just calling it "AIO" instead - lying to yourself, in short).

Right - I agree with you that an AIO design is basically hiding an
inherently
multi threaded program flow. This argument is indeed very catchy. And
looking
from some other point one will see that most of the AIO designs are from
times
where multi threading in applications wasn't that common as it is now.
Most prominently coprocesses in a shell come to my mind as a very good
example
about how to handle AIO (sort of)...

2001-02-12 10:10:31

by Jamie Lokier

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Linus Torvalds wrote:
> Absolutely. This is exactly what I mean by saying that low-level drivers
> may not actually be able to handle new cases that they've never been asked
> to do before - they just never saw anything like a 64kB request before or
> something that crossed its own alignment.
>
> But the _higher_ levels are there. And there's absolutely nothing in the
> design that is a real problem. But there's no question that you might need
> to fix up more than one or two low-level drivers.
>
> (The only drivers I know better are the IDE ones, and as far as I can tell
> they'd have no trouble at all with any of this. Most other normal drivers
> are likely to be in this same situation. But because I've not had a reason
> to test, I certainly won't guarantee even that).

PCI has dma_mask, which distinguishes different device capabilities.
This nice interface handles 64-bit capable devices, 32-bit ones, ISA
limitations (the old 16MB limit) and some other strange devices.

This mask appears in block devices one way or another so that bounce
buffers are used for high addresses.

How about a mask for block devices which indicates the kinds of
alignment and lengths that the driver can handle? For old drivers that
can't be thoroughly tested, we assume the worst. Some devices have
hardware limitations. Newer, tested drivers can relax the limits.

It's probably not difficult to say, "this 64k request can't be handled
so split it into 1k requests". It integrates naturally with the
decision to use bounce buffers -- alignment restrictions cause copying
just as high addresses causes copying.

-- Jamie

2001-02-12 10:47:24

by Pavel Machek

[permalink] [raw]
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

Hi!

> > So you consider inability to select() on regular files _feature_?
>
> select on files is unimplementable. You can't do background file IO the
> same way you do background receiving of packets on socket. Filesystem is
> synchronous. It can block.

You can use helper friends if VFS layer is not able to handle
background IO. Then we can do it right in linux-4.4.
Pavel

--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]