LinuxLists.cc - Re: Make pipe data structure be a circular list of pages, rather than

2005-01-07 03:41:49

Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote:
> ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, [email protected]
> Make pipe data structure be a circular list of pages, rather than
> a circular list of one page.
> This improves pipe throughput, and allows us to (eventually)
> use these lists of page buffers for moving data around efficiently.

Interesting; how big a gain did you see?

-- wli

2005-01-07 06:35:27

by Linus Torvalds

[permalink] [raw]

Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Thu, 6 Jan 2005, William Lee Irwin III wrote:

> On Fri, Jan 07, 2005 at 12:29:13AM +0000, Linux Kernel Mailing List wrote:
> > ChangeSet 1.2229.1.1, 2005/01/06 16:29:13-08:00, [email protected]
> > Make pipe data structure be a circular list of pages, rather than
> > a circular list of one page.
> > This improves pipe throughput, and allows us to (eventually)
> > use these lists of page buffers for moving data around efficiently.
>
> Interesting; how big a gain did you see?

On my ppc970 (which Alan correctly points out is not necessarily the best
platform to test), lmbench throughput looks like this:

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem
Mem UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
before:
ppc970.os Linux 2.6.10 753. 1305 550. 1110.3 2051.3 932.3 926.4 2043 1291.
ppc970.os Linux 2.6.10 738. 1596 558. 1099.4 2038.8 936.6 925.0 2033 1288.
ppc970.os Linux 2.6.10 752. 1535 557. 1107.2 2043.5 937.9 932.1 2037 1292.
ppc970.os Linux 2.6.10 750. 1524 556. 1107.5 2046.0 937.2 930.4 2037 1289.
after:
ppc970.os Linux 2.6.10 976. 1594 561. 1110.6 2045.8 934.9 923.6 2039 1288.
ppc970.os Linux 2.6.10 1195 1595 549. 1102.0 2049.0 909.5 930.5 2044 1292.
ppc970.os Linux 2.6.10 1669 1418 507. 1122.9 2051.3 943.4 916.4 2035 1280.
ppc970.os Linux 2.6.10 1472 1612 557. 1092.4 2032.9 932.1 935.1 2030 1281.

ie that's a clear improvement (30-90% higher bandwidth). That's from
having <n> (currently 16) pages of buffers in flight instead of just one.

Of course, the full 16 you'll seldom see in real life - stdio etc tends to
buffer up at 8kB or similar, but it does mean that such a 8kB write won't
need to block for the reader to empty the pipe.

I worried a bit about latency, but my test-run of four before/after was in
the noise. It's bound to be worse because of the extra allocation, but
it's not clearly visible.

However, I've got a long-term cunning plan, which was the _real_ reason
for the pipe changes. I'm finally going to implement Larry McVoy's
"splice()" operation, and it turns out that a pipe (with the new
organization) ends up being the exact conduit between two files that you
need to "splice" an input an an output together - without any copying (the
new pipe buffering is just pointers to pages and byte ranges within those
pages).

Together with a "tee()" system call (duplicate it in two), you can
basically generate pipelines and forks of data, and you'll only need to
update a reference count (and pupulate a new "struct pipe_buffer" thing).

But making sure that every step along the way is an improvement is part of
the philosophy.

Linus

2005-01-07 06:37:22

by Linus Torvalds

[permalink] [raw]

Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Thu, 6 Jan 2005, Linus Torvalds wrote:
>
> I worried a bit about latency, but my test-run of four before/after was in
> the noise. It's bound to be worse because of the extra allocation, but
> it's not clearly visible.

Btw, this is not to say that it isn't measurable: I'm sure it is. It's
just not as _obvious_ as the throughput improvement. If somebody were to
do some more controlled latency tests... Hint, hint.

Linus

2005-01-19 16:29:13

by Larry McVoy

[permalink] [raw]

Subject: Re: Make pipe data structure be a circular list of pages, rather than

I think you are going to regret making splice() a system call, it shouldn't
be, you'll find cases where it won't work. Make it a library call built
out of pull() and push(), see my original postings on this and you'll
follow the logic. Making splice a system call is like putting ftp
into the kernel, not a good idea.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com

2005-01-19 17:17:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Wed, 19 Jan 2005, Larry McVoy wrote:
>
> I think you are going to regret making splice() a system call, it shouldn't
> be, you'll find cases where it won't work. Make it a library call built
> out of pull() and push()

Actually, if you look at my current (ugly) test-patch, the one part of it
that isn't ugly is the actual "sys_splice()" + "do_splice()" part, and the
only thing they actually do is (a) do all the required "fd -> struct file"
setup (including locking, testing readability vs writability etc), and
then (b) they check which side is a pipe, and split it up into "pull" vs
"push" at that point.

Of course, the "pull" is actually called "do_splice_from()", and the
"push" is called "do_splice_to()", but I can rename them if you want to ;)

So yes, internally it is actually a push/pull thing, with separate actions
for both. I absolutely agree that you cannot have a "splice()" action, you
need to have a _direction_. But that's actually exactly what the pipe
gives you: since all data has to originate or end in a pipe, the direction
is implicit by the file descriptors involved.

(The exception is a pipe->pipe transfer, but then the direction doesn't
matter, and you can choose whichever just ends up being easier. We happen
to consider it a "do_splice_from()" from the source pipe right now, but
that's purely a matter of "which way do you test first").

Could we show it as two system calls? Sure. It wouldn't give you anything,
though, since you need to do all the tests anyway - including the test
for whether the source or destination is a pipe which currently is the
thing that splits up the "splice()" into two cases. So it's not even like
you could optimize out the direction test - it would just become an
error case test instead.

And having just one system call means that the user doesn't need to care
(other than making sure that one end _is_ a pipe), and also allows us to
share all the code that _is_ shared (namely the "struct file *" setups and
testing - not a whole lot, but it's cleaner that way).

I think you are perhaps confused about the fact that what makes this all
possible in the first place really is the _pipe_. You probably think of
"splice()" as going from one file descriptor to another. It's not. It goes
from one (generic) file descriptor into a _pipe_, or from one pipe into
another (generic) file descriptor. So the "push"/"pull" part really is
there, but it's there in another form: if you want to copy from one file
to another, you have to

1) open a pipe - it's the splice equivalent of allocating a
temporary buffer.
repeat:
2) "pull" from the fd into the pipe: splice(in, pipe[1], size)
3) "push" from the pipe into the fd: splice(pipe[0], out, size)

so it's all there. The above is 100% conceptually equivalent to

1) buf = malloc()
repeat:
2) read(in, buf, size);
3) write(out, buf, size);

See? I think you'll find that the "push" and "pull" you look for really is
there.

The advantage here is that if either the input or the output _already_ is
a pipe, you can optimize it to just do a loop of a single splice(), with
no new temporary buffer needed. So while the above three steps are the
generic case, depending on how you do things the _common_ case may be just
a simple

repeat:
1) splice(in, out, maxsize);

which is why you do not want to _force_ the split. The determination of
how to do this efficiently at run-time is also very easy:

int copy_fd(int in, int out)
{
char *buf;
int fds[2];

for (;;) {
int count = splice(in, out, ~0UL);
if (count < 0) {
if (errno == EAGAIN)
continue;
/* Old kernel without "splice()"? */
if (errno == ENOSYS)
goto read_write_case;
/* No pipe involved? */
if (errno == EINVAL)
goto pipe_buffer_case;
/* We found a pipe, but the other end cannot accept this splice */
if (errno == ECANNOTSPLICE)
goto read_write_case;
return -1;
}
if (!count)
return 0;
}

pipe_buffer_case:
if (pipe(fds) < 0)
goto read_write_case;
for (;;) {
int count = splice(in, fds[1], ~0UL);
if (count < 0)
if (errno == EAGAIN)
continue;
return -1;
}
if (!count)
return 0;
do {
int n = splice(fds[0], out, count);
if (n < 0) {
if (errno == EAGAIN)
continue;
return -1;
}
if (!n) {
errno = ENOSPC;
return -1;
}
} while ((count -= n) > 0)
}

read_write_case:
buf = malloc(BUFSIZE);
for (;;) {
int count = read(in, buf, BUFSIZE);
...

See? THAT is the kind of library routine you want to have (btw, there's no
"ECANNOTSPLICE" error code right now, my current example code incorrectly
returns EINVAL for both the "no pipe" case and the "found pipe but not
splicable" case).

Linus

2005-01-19 19:04:41

by Larry McVoy

[permalink] [raw]

Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Wed, Jan 19, 2005 at 09:14:33AM -0800, Linus Torvalds wrote:
> I think you are perhaps confused about the fact that what makes this all
> possible in the first place really is the _pipe_. You probably think of
> "splice()" as going from one file descriptor to another. It's not.

You seem to have misunderstood the original proposal, it had little to do
with file descriptors. The idea was that different subsystems in the OS
export pull() and push() interfaces and you use them. The file decriptors
are only involved if you provide them with those interfaces(which you
would, it makes sense). You are hung up on the pipe idea, the idea I
see in my head is far more generic. Anything can play and you don't
need a pipe at all, you need

struct pagev {
pageno page; // or whatever the type is to get the page
u16 offset; // could be u32 if you have large pages
u16 len; // ditto
};
struct splice {
pagev pages[];
u8 gift:1;
u8 loan:1;
};

You can feed these into pipes for sure, but you can source and sink
them from/to anything which exports a pull()/push() entry point.
It's as generic as read()/write() in the driver object.

I never proposed restricting this to file descriptors, that makes
no sense. File descriptors are nothing more than something which
gets you to the underlying object (socket/file/pipe/whatever) and
lets you call into that object to move some data.

See how more generic that is? Pipes are just one source/sink but
everything else needs to play as well. How are you going to
implement a socket sending data to a file without the VM nonsense
and the extra copies?
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com

2005-01-20 00:12:34

by Linus Torvalds

[permalink] [raw]

Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Wed, 19 Jan 2005, Larry McVoy wrote:
>
> See how more generic that is? Pipes are just one source/sink but
> everything else needs to play as well. How are you going to
> implement a socket sending data to a file without the VM nonsense
> and the extra copies?

Heh. This is exactly why I originally told you it is undoable - because
your model implies a O(M x N) complexity, where everybody has to agree on
how to push things to everybody else.

The thing that made me so excited about the pipes is exactly that the
pipes is what finally made me see how to actually _implement_ kind of what
you wanted. Instead of a "everything to everything" thing that has O(n?)
complexity, it's a "everything to a pipe", where the pipe ends up being
the common medium for things to talk to each other.

So you have an extra level of indirection, but it also means that now the
interface complexity is purely linear - everybody needs to know how to
accept data from (and push data to) a pipe, but they do _not_ need to know
how to do it to anything else.

I agree that it's not what you described, and one of my first emails
describing it even said so, but I now notice that you weren't cc'd on
that one (I tried to cc you on some others, but not the whole discussion):

"...

And to be honest, my pipes aren't really what Larry's "splice()" was: his
notion was more of a direct "fd->fd" thing, with nothing in between. That
was what I still consider unworkable.

The problem with splicing two file descriptors together is that it needs
for them to agree on the interchange format, and these things tend to be
very expensive from an interface angle (ie the format has to _exactly_
match the "impedance" at both ends, since any conversion would make the
whole exercise pointless - you'd be better off just doing a read and a
write).

..."

so what makes the pipes special is that they are the "impedance matcher".

For example, absolutely _none_ of the regular file IO stuff wants to work
with page fragments: it's all very much about whole-page stuff in the page
cache, and that is how it has to be, considering the semantics of a
coherent mmap etc.

So your notion of having all the subsystems natively use a common format
just never struck me as workable, because they very fundamentally do _not_
want work with the same kind of data structures, and forcing them to just
didn't seem valid. So I ignored splice as unworkable.

So think of my "splice" as giving you credit through the name, but at the
same time it definitely is something different (and in my opinion a lot
more workable). If you object to the name, I can call it "wire()" instead,
which was somebody elses suggestion, but I thought you might appreciate
the name even if it isn't _really_ what you want ;)

Linus