2005-01-07 13:27:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

Hello.

pipe_writev:
> + if (bufs < PIPE_BUFFERS) {
> + ssize_t chars;
> + int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);

If i understand this patch correctly, then this code

for (;;)
write(pipe_fd, &byte, 1);

will block after writing PIPE_BUFFERS == 16 characters, no?
And pipe_inode_info will use 64K to hold 16 bytes!

Is it ok?

May be it make sense to add data to the last allocated page
until buf->len > PAGE_SIZE ?

Oleg.


2005-01-07 16:19:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Oleg Nesterov wrote:
>
> If i understand this patch correctly, then this code
>
> for (;;)
> write(pipe_fd, &byte, 1);
>
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!

Yes.

> Is it ok?

If you want throughput, don't do single-byte writes. Obviously we _could_
do coalescing, but there's a reason I'd prefer to avoid it. So I consider
it a "don't do that then", and I'll wait to see if people do. I can't
think of anything that cares about performance that does that anyway:
becuase system calls are reasonably expensive regardless, anybody who
cares at all about performance will have buffered things up in user space.

> May be it make sense to add data to the last allocated page
> until buf->len > PAGE_SIZE ?

The reason I don't want to coalesce is that I don't ever want to modify a
page that is on a pipe buffer (well, at least not through the pipe buffer
- it might get modified some other way). Why? Because the long-term plan
for pipe-buffers is to allow the data to come from _other_ sources than
just a user space copy. For example, it might be a page directly from the
page cache, or a partial page that contains the data part of an skb that
just came in off the network.

With this organization, a pipe ends up being able to act as a "conduit"
for pretty much any data, including some high-bandwidth things like video
streams, where you really _really_ don't want to copy the data. So the
next stage is:

- allow the buffer size to be set dynamically per-pipe (probably only
increased by root, due to obvious issues, although a per-user limit is
not out of the question - it's just a "mlock" in kernel buffer space,
after all)
- add per-"struct pipe_buffer" ops pointer to a structure with
operation function pointers: "release()", "wait_for_ready()", "poll()"
(and possibly "merge()", if we want to coalesce things, although I
really hope we won't need to)
- add a "splice(fd, fd)" system call that copies pages (by incrementing
their reference count, not by copying the data!) from an input source
to the pipe, or from a pipe to an output.
- add a "tee(in, out1, out2)" system call that duplicates the pages
(again, incrementing their reference count, not copying the data) from
one pipe to two other pipes.

All of the above is basically a few lines of code (the "splice()" thing
requires some help from drivers/networking/pagecache etc, but it's not
complex help, and not everybody needs to do it - I'll start off with
_just_ a generic page cache helper to get the thing rolling, that's easy).

Now, imagine using the above in a media server, for example. Let's say
that a year or two has passed, so that the video drivers have been updated
to be able to do the splice thing, and what can you do? You can:

- splice from the (mpeg or whatever - let's just assume that the video
input is either digital or does the encoding on its own - like they
pretty much all do) video input into a pipe (remember: no copies - the
video input will just DMA directly into memory, and splice will just
set up the pages in the pipe buffer)
- tee that pipe to split it up
- splice one end to a file (ie "save the compressed stream to disk")
- splice the other end to a real-time video decoder window for your
real-time viewing pleasure.

That's the plan, at least. I think it makes sense, and the thing that
convinced me about it was (a) how simple all of this seems to be
implementation-wise (modulo details - but there are no "conceptually
complex" parts: no horrid asynchronous interfaces, no questions about
hotw to buffer things, no direct user access to pages that might
partially contain protected data etc etc) and (b) it's so UNIXy. If
there's something that says "the UNIX way", it's pipes between entities
that act on the data.

For example, let's say that you wanted to serve a file from disk (or any
other source) with a header to another program (or to a TCP connection, or
to whatever - it's just a file descriptor). You'd do

fd = create_pipe_to_destination();

input = open("filename", O_RDONLY);
write(fd, "header goes here", length_of_header);
for (;;) {
ssize_t err;
err = splice(input, fd,
~0 /* maxlen */,
0 /* msg flags - think "sendmgsg" */);
if (err > 0)
continue;
if (!err) /* EOF */
break;
.. handle input errors here ..
}

(obviously, if this is a real server, this would likely all be in a
select/epoll loop, but that just gets too hard to describe consicely, so
I'm showing the single-threaded simple version).

Further, this also ends up giving a nice pollable interface to regular
files too: just splice from the file (at any offset) into a pipe, and poll
on the result. The "splice()" will just do the page cache operations and
start the IO if necessary, the "poll()" will wait for the first page to be
actually available. All _trivially_ done with the "struct pipe_buffer"
operations.

So the above kind of "send a file to another destination" should
automatically work very naturally in any poll loop: instead of filling a
writable pipe with a "write()", you just fill it with "splice()" instead
(and you can read it with a 'read()' or you just splice it to somewhere
else, or you tee() it to two destinations....).

I think the media server kind of environment is the one most easily
explained, where you have potentially tons of data that the server process
really never actually wants to _see_ - it just wants to push it on to
another process or connection or save it to a log-file or something. But
as with regular pipes, it's not a specialized interface: it really is just
a channel of communication.

The difference being that a historical UNIX pipe is always a channel
between two process spaces (ie you can only fill it and empty it into the
process address space), and the _only_ thing I'm trying to do is to have
it be able to be a channel between two different file descriptors too. You
still need the process to "control" the channel, but the data doesn't have
to touch the address space any more.

Think of all the servers or other processes that really don't care about
the data. Think of something as simple as encrypting a file, for example.
Imagine that you have hardware encryption support that does DMA from the
source, and writes the results using DMA. I think it's pretty obvious how
you'd connect this up using pipes and two splices (one for the input, one
for the output).

And notice how _flexible_ it is (both the input and the output can be any
kind of fd you want - the pipes end up doing both the "conversion" into a
common format of "list of (possibly partial) pages" and the buffering,
which is why the different "engines" don't need to care where the data
comes from, or where it goes. So while you can use it to encrypt a file
into another file, you could equally easily use it for something like

tar cvf - my_source_tree | hw_engine_encrypt | splice_to_network

and the whole pipeline would not have a _single_ actual data copy: the
pipes are channels.

Of course, since it's a pipe, the nice thing is that people don't have to
use "splice()" to access it - the above pipeline has a perfectly regular
"tar" process that probably just does regular writes. You can have a
process that does "splice()" to fill the pipe, and the other end is a
normal thing that just uses regular "read()" and doesn't even _know_ that
the pipe is using new-fangled technology to be filled.

I'm clearly enamoured with this concept. I think it's one of those few
"RightThing(tm)" that doesn't come along all that often. I don't know of
anybody else doing this, and I think it's both useful and clever. If you
now prove me wrong, I'll hate you forever ;)

Linus

2005-01-07 16:41:04

by Davide Libenzi

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Fri, 7 Jan 2005, Linus Torvalds wrote:

> I'm clearly enamoured with this concept. I think it's one of those few
> "RightThing(tm)" that doesn't come along all that often. I don't know of
> anybody else doing this, and I think it's both useful and clever. If you
> now prove me wrong, I'll hate you forever ;)

I don't know which kind of poison Larry put in your glass when you two
met, but it clearly worked :=)



- Davide

2005-01-07 16:53:02

by Alan

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Gwe, 2005-01-07 at 14:30, Oleg Nesterov wrote:
> will block after writing PIPE_BUFFERS == 16 characters, no?
> And pipe_inode_info will use 64K to hold 16 bytes!
>
> Is it ok?

That would break stuff, but holding the last page until it fills 4K
would work, or just basic sg coalescing when possible. The pipe
behaviour - particularly size and size of atomic writes is defined by
SuS and there are people who use pipes two ways between apps and use the
guarantees to avoid deadlocks

2005-01-07 17:09:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Davide Libenzi wrote:
>
> I don't know which kind of poison Larry put in your glass when you two
> met, but it clearly worked :=)

Oh, Larry introduced the notion of "splice()" to me at least five years
ago, so whatever he put in that glass was quite slow-acting. At the time I
told him to forget it - I thought it intriguing, but couldn't see any
reasonable way to do it.

And to be honest, my pipes aren't really what Larry's "splice()" was: his
notion was more of a direct "fd->fd" thing, with nothing in between. That
was what I still consider unworkable.

The problem with splicing two file descriptors together is that it needs
for them to agree on the interchange format, and these things tend to be
very expensive from an interface angle (ie the format has to _exactly_
match the "impedance" at both ends, since any conversion would make the
whole exercise pointless - you'd be better off just doing a read and a
write).

And note that the "format" is not just the actual data transfer thing, but
the rules about what to do with partial reads, partial writes, file
offsets, streams _without_ file offsets etc etc). And that's really what
the pipe generalization does: it decouples all the format things into an
intermediate thing that is very unambiguous, and has none of those issues.

The pipe approach also decouples things in another way: a true "splice()"
needs all entities to participate in the new world order for it to be
useful. It shares that brokenness with "sendfile()", which to some degree
exemplifies the problems with splice() (sendfile ends up being a very very
specialized uni-directional form of Larry's splice()). But realizing that
pipes _are_ the conduit and that they already exist and work with very old
interfaces suddenly means that the new splice() can be incrementally
useful.

So it took me five years to think about other things, until an actual
approach that would work and make sense from an implementation standpoint
worked.

What made me think about this particular thing is that I really thought a
lot of things want to just mmap the data, but do not actually want to go
to the expense of finding virtual addresses, handling the virtual->
physical translation at each stage. A lot of things just want to access
the buffers, _without_ the user-visible mapping. User-visibility is not
only expensive, it ends up being an impossible interface for many things
(memory mapping has page granularity, something that simply isn't true for
a lot of devices).

So in many ways, think of the new pipes as a "buffer mapping". Nothing
else. It's a way to carry arbitrary buffers along in a secure manner.

Linus

2005-01-07 17:11:15

by Alan

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote:
> it a "don't do that then", and I'll wait to see if people do. I can't
> think of anything that cares about performance that does that anyway:
> becuase system calls are reasonably expensive regardless, anybody who
> cares at all about performance will have buffered things up in user space.

Actually I found a load of apps that do this but they don't care about
performance. Lots of people have signal handlers that just do

write(pipe_fd, &signr, sizeof(signr))

so they can drop signalled events into their event loops


> > May be it make sense to add data to the last allocated page
> > until buf->len > PAGE_SIZE ?
>
> The reason I don't want to coalesce is that I don't ever want to modify a
> page that is on a pipe buffer (well, at least not through the pipe buffe

If I can't write 4096 bytes down it one at a time without blocking from
an empty pipe then its not a pipe in the eyes of the Unix world and the
standards.

> With this organization, a pipe ends up being able to act as a "conduit"
> for pretty much any data, including some high-bandwidth things like video
> streams, where you really _really_ don't want to copy the data. So the
> next stage is:

The data copying impact isn't very high even if it is just done for the
pipe() case for standards behaviour. You end up with one page that is
written too and then sent and then freed rather than many.

Possibly what should be done is to not use pipe() for this at all but to
create a more explicit object so we don't confuse existing code and code
that refuses buffers heavily (plenty of that around). Add "sewer()"[1]
or "conduit()" and we don't have to change the behaviour of the Unix
pipe world and risk screwing up apps, and you get to not to have to
write grungy corner cases to ruin the plan.

2005-01-07 17:24:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Alan Cox wrote:
>
> That would break stuff, but holding the last page until it fills 4K
> would work, or just basic sg coalescing when possible. The pipe
> behaviour - particularly size and size of atomic writes is defined by
> SuS and there are people who use pipes two ways between apps and use the
> guarantees to avoid deadlocks

Hmm.. Are there really any other guarantees than the atomicity guarantee
for a PIPE_BUF transfer? I don't see any in SuS..

That said, existing practice obviously always trumps paper standards, and
yes, coalescing is possible.

Linus

2005-01-07 17:35:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Alan Cox wrote:
>
> > The reason I don't want to coalesce is that I don't ever want to modify a
> > page that is on a pipe buffer (well, at least not through the pipe buffe
>
> If I can't write 4096 bytes down it one at a time without blocking from
> an empty pipe then its not a pipe in the eyes of the Unix world and the
> standards.

Absolutely. In fact, with the new implementation, you can often write
_several_ packets of 4096 bytes without blocking (but only writes less
than PIPE_BUF are guaranteed to be done all-or-nothing). I'm very aware of
the atomicity guarantees, I'm just saying that if you try to write 4096
bytes by doing it one byte at a time, that has changed.

> > With this organization, a pipe ends up being able to act as a "conduit"
> > for pretty much any data, including some high-bandwidth things like video
> > streams, where you really _really_ don't want to copy the data. So the
> > next stage is:
>
> The data copying impact isn't very high even if it is just done for the
> pipe() case for standards behaviour. You end up with one page that is
> written too and then sent and then freed rather than many.

I absolutely agree. A regular read()/write() still copies the data, and
that's because I'm a firm believer that copying even a few kB of data is
likely to be cheaper than trying to play MM games (not just the lookup of
the physical address - all the locking, COW, etc crud that VM games
require).

So while this shares some of the issues with the zero-copy pipes of yore,
but doesn't actually do any of that for regular pipe read/writes. And
never will, as far as I'm concerned. I just don't think user zero-copy is
interesting at that level: if the user wants to access somethign without
copying, he uses "mmap()".

So only when the data is _not_ in user VM space, that's when increasing a
reference count is cheaper than copying. Pretty much by definition, you
already have a "struct page *" at that point, along with which part of the
page contains the data.

So the "standard behaviour" (aka just plain read/write on the pipe) is all
the same copies that it used to be. The "just move pages around" issue
only happens when you want to duplicate the stream, or if you splice
around stuff that is already in kernel buffers (or needs a kernel buffer
anyway).

Linus

2005-01-07 17:51:49

by Chris Friesen

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

Alan Cox wrote:
> On Gwe, 2005-01-07 at 16:17, Linus Torvalds wrote:
>
>>it a "don't do that then", and I'll wait to see if people do. I can't
>>think of anything that cares about performance that does that anyway:
>>becuase system calls are reasonably expensive regardless, anybody who
>>cares at all about performance will have buffered things up in user space.
>
>
> Actually I found a load of apps that do this but they don't care about
> performance. Lots of people have signal handlers that just do
>
> write(pipe_fd, &signr, sizeof(signr))
>
> so they can drop signalled events into their event loops

I would be one of those people, although I do pass the signal
information as well.

Chris

2005-01-07 17:52:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
> So the "standard behaviour" (aka just plain read/write on the pipe) is all
> the same copies that it used to be.

I want to highlight this again. The increase in throughput did _not_ come
from avoiding a copy. It came purely from the fact that we have multiple
buffers, and thus a throughput-intensive load gets to do several bigger
chunks before it needs to wait for the recipient. So the increase in
throughput comes from fewer synchronization points (scheduling and
locking), not from anything else.

Another way of saying the same thing: pipes actually used to have clearly
_lower_ bandwidth than UNIX domain sockets, even though clearly pipes are
simpler and should thus be able to be faster. The reason? UNIX domain
sockets allow multiple packets in flight, and pipes only used to have a
single buffer. With the new setup, pipes get roughly comparable
performance to UNIX domain sockets for me.

Sockets can still outperform pipes, truth be told - there's been more
work on socket locking than on pipe locking. For example, the new pipe
code should conceptually really allow one CPU to empty one pipe buffer
while another CPU fills another pipe buffer, but because I kept the
locking structure the same as it used to be, right now read/write
serialize over the whole pipe rather than anything else.

This is the main reason why I want to avoid coalescing if possible: if you
never coalesce, then each "pipe_buffer" is complete in itself, and that
simplifies locking enormously.

(The other reason to potentially avoid coalescing is that I think it might
be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
boundaries. The new pipe code _is_ internally "packetized" after all).

Linus

2005-01-07 21:03:22

by Mike Waychison

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
>
> On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
>>So the "standard behaviour" (aka just plain read/write on the pipe) is all
>>the same copies that it used to be.
>
>
> I want to highlight this again. The increase in throughput did _not_ come
> from avoiding a copy. It came purely from the fact that we have multiple
> buffers, and thus a throughput-intensive load gets to do several bigger
> chunks before it needs to wait for the recipient. So the increase in
> throughput comes from fewer synchronization points (scheduling and
> locking), not from anything else.
>
> Another way of saying the same thing: pipes actually used to have clearly
> _lower_ bandwidth than UNIX domain sockets, even though clearly pipes are
> simpler and should thus be able to be faster. The reason? UNIX domain
> sockets allow multiple packets in flight, and pipes only used to have a
> single buffer. With the new setup, pipes get roughly comparable
> performance to UNIX domain sockets for me.
>
> Sockets can still outperform pipes, truth be told - there's been more
> work on socket locking than on pipe locking. For example, the new pipe
> code should conceptually really allow one CPU to empty one pipe buffer
> while another CPU fills another pipe buffer, but because I kept the
> locking structure the same as it used to be, right now read/write
> serialize over the whole pipe rather than anything else.
>
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.

This got me to thinking about how you can heuristically optimize away
coalescing support and still allow PAGE_SIZE bytes minimum in the
effective buffer.


Method 1)

Simply keep track of how much data is in the effective buffer. If the
count is < PAGE_SIZE, then coalesce, otherwise, don't. Very simple, but
means a high traffic pipe might see 'stuttering' between coalescing mode
and fastpath.

Method 2)

It would be interesting to see the performance of a simple heuristic
whereby writes smaller than a given threshold (PAGE_SIZE >> 3?) are
coalesced with the previous iff the previous also was less than the
threshold and there is room for the write in the given page.

With this heuristic, the minimum size of the effective buffer is only
possible if the sequence consists of alternating threshold (T) bytes and
1 byte writes.

If 'number of pages in circular buffer' (C) is even, this produces a
minimum of (T*C + C)/2 bytes in the effective buffer. We can figure out
a minimum T value for a given page count C by isolating T from:

(T*C + C)/2 = PAGE_SIZE

which is:

T = (2 * PAGE_SIZE / C) - 1

So with the assumption that we only coalesce if

- - the previous write count was < T
- - the current write count is < T
- - there is room for the current write count in the current page

and if we set C = 16, we can safely let T = (PAGE_SIZE >> 3) - 1 and the
minimum effective buffer is guaranteed to be at least PAGE_SIZE bytes.

This should reduce the amount of coalescing work required and thus
reduce the amount of locking required for the case where performance
counts (on x86, the writer would have to write 511 bytes to use the
non-coalescing fast path).

This method is probably better as many writers will be buffering their
writes, and the size also nicely fits with block layer's blocksize.

Just my two cents.

>
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).
>
> Linus

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB3vgbdQs4kOxk3/MRAhN1AKCD6/o04URPYphGRh13P5GLNfV4JgCaA3wx
QpMbLHqBtmAB3sy7mmxXCEI=
=687H
-----END PGP SIGNATURE-----

2005-01-07 22:01:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
> I want to highlight this again. The increase in throughput did _not_ come
> from avoiding a copy. It came purely from the fact that we have multiple
> buffers, and thus a throughput-intensive load gets to do several bigger
> chunks before it needs to wait for the recipient. So the increase in
> throughput comes from fewer synchronization points (scheduling and
> locking), not from anything else.

Btw, from limited testing this effect seems to be much more pronounced on
SMP.

I've done some more benchmarking, and on a single-CPU Banias laptop,
bandwidth didn't really change much, and latency goes up measurably
(roughly 2.5us -> 3.0us). In contrast, on a P4 with HT, latency was not
noticeable (looks like 10.4us -> 10.7us), since there the locking and
system call costs dominate, and throughput goes up from 370MB/s to
900+MB/s (ie > 100% improvement).

So it's not always a win, but at the same time sometimes it's quite a big
win. We might tune the number of buffers down for the UP case, or
something.

Side note: the UNIX domain socket lmbench numbers are higher than the
memory copy numbers, implying that at least the UNIX domain socket tests
have the data set fit in the L2. That may be the reason why lmbench says
that sockets outperform pipes.

Linus

2005-01-07 22:57:34

by Diego Calleja

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

El Fri, 7 Jan 2005 13:59:53 -0800 (PST) Linus Torvalds <[email protected]> escribi?:

> Btw, from limited testing this effect seems to be much more pronounced on
> SMP.

I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah
yeah I know kernel compiles are not a good benchmark but since linux uses
-pipe for gcc...I though it may try it to see if it makes a difference
(suggestions about better methods to test this are accepted :). The results
could be very well stadistical noise (it looks like it only takes a second less
on average), but I though it may be interesting.


without the patch:
952.84user 98.64system 8:54.64elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318007minor)pagefaults 0swaps
951.78user 98.25system 8:53.71elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318009minor)pagefaults 0swaps
954.53user 99.12system 8:53.02elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318041minor)pagefaults 0swaps

with it:
951.67user 97.59system 8:53.12elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318010minor)pagefaults 0swaps
949.04user 97.68system 8:52.04elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318018minor)pagefaults 0swaps
948.40user 97.37system 8:51.48elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+7318011minor)pagefaults 0swaps

2005-01-07 23:21:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Fri, 7 Jan 2005, Diego Calleja wrote:
>
> I've tried it in a 2xPIII, 512 MB of ram. I've done kernel compiles...yeah
> yeah I know kernel compiles are not a good benchmark but since linux uses
> -pipe for gcc...I though it may try it to see if it makes a difference
> (suggestions about better methods to test this are accepted :). The results
> could be very well stadistical noise (it looks like it only takes a second less
> on average), but I though it may be interesting.

Hmm.. I'm gratified, but a bit suspicious. I'd not have expected it to be
noticeable on a kernel compile (that's a 1%+ change in total system time,
which considering all the file ops etc that a compile does means that it
would have to have made quite a big difference for the pipe case).

So your data is intriguing, but it looks a bit _too_ good to be true ;)

Random things like page cache placement (and thus cache effects) might
also account for it. But hey, if true, that's quite an improvement.

(Of course, anything the kernel does tends to be totally drowned out by
allt eh actual user-level work gcc does, so in the _big_ picture the 1%
change in system time means almost nothing. So even "good news" doesn't
actually matter in this case).

Linus

2005-01-07 23:49:53

by Chris Friesen

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

Mike Waychison wrote:

> This got me to thinking about how you can heuristically optimize away
> coalescing support and still allow PAGE_SIZE bytes minimum in the
> effective buffer.

While coalescing may be a win in some cases, there should also be some
way to tell the kernel to NOT coalesce, to handle the case where you
want minimum latency at the cost of some throughput.

Chris

2005-01-08 18:26:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Fri, 7 Jan 2005, Linus Torvalds wrote:
>
> Hmm.. Are there really any other guarantees than the atomicity guarantee
> for a PIPE_BUF transfer? I don't see any in SuS..
>
> That said, existing practice obviously always trumps paper standards, and
> yes, coalescing is possible.

Would a kernel build with "make -j18" count as existing practice?
Immediately hangs for me...

Hugh

2005-01-08 18:55:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Sat, 8 Jan 2005, Hugh Dickins wrote:
>
> Would a kernel build with "make -j18" count as existing practice?

Absolutely.

Looks like it does a single-byte write of '+' for each process started
(and I assume it will do a '-' for each process that exits ;)

Oh, well. So much for my hope to avoid coalescing.

Linus

2005-01-08 21:38:49

by Lee Revell

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Fri, 2005-01-07 at 17:46 -0600, Chris Friesen wrote:
> Mike Waychison wrote:
>
> > This got me to thinking about how you can heuristically optimize away
> > coalescing support and still allow PAGE_SIZE bytes minimum in the
> > effective buffer.
>
> While coalescing may be a win in some cases, there should also be some
> way to tell the kernel to NOT coalesce, to handle the case where you
> want minimum latency at the cost of some throughput.

Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
FIFO being one of the fastest known IPC mechanisms. Each client in the
JACK (http://jackit.sf.net) graph wakes the next one by writing a single
byte to a FIFO. Ardour's GUI, control, and audio threads interact via a
similar mechanism. How would you expect this change to impact the inter
thread wakeup latency? It's confusing when people say "performance",
meaning "increased throughput albeit with more latency". For many
people that's a regression.

These apps *certainly* care about performance, they just don't define it
in terms of throughput.

And yes we do know futexes are the right tool for this but they weren't
available at the time and aren't available on 2.4.

Lee

2005-01-08 21:54:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Sat, 8 Jan 2005, Lee Revell wrote:
>
> Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
> FIFO being one of the fastest known IPC mechanisms. Each client in the
> JACK (http://jackit.sf.net) graph wakes the next one by writing a single
> byte to a FIFO. Ardour's GUI, control, and audio threads interact via a
> similar mechanism. How would you expect this change to impact the inter
> thread wakeup latency? It's confusing when people say "performance",
> meaning "increased throughput albeit with more latency". For many
> people that's a regression.

I posted the performance numbers in the thread already, and with every
single throughput number I also talked abotu what the latency difference
was. So quite frankly, if you were confused, I suspect it was because you
didn't read them. Tssk, tssk.

Short and sweet: the latency changes are in the noise for SMP, but can be
seen on UP. I'll look at it a bit more: since I had to add the coalescing
code anyway, I might also decide to re-use a buffer page rather than free
it immediately, since that may help latency for small writes.

Linus

2005-01-08 22:09:27

by Lee Revell

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Sat, 2005-01-08 at 13:51 -0800, Linus Torvalds wrote:
>
> On Sat, 8 Jan 2005, Lee Revell wrote:
> >
> > Many latency critical apps use (tmpfs mounted) FIFO's for IPC; the Linux
> > FIFO being one of the fastest known IPC mechanisms. Each client in the
> > JACK (http://jackit.sf.net) graph wakes the next one by writing a single
> > byte to a FIFO. Ardour's GUI, control, and audio threads interact via a
> > similar mechanism. How would you expect this change to impact the inter
> > thread wakeup latency? It's confusing when people say "performance",
> > meaning "increased throughput albeit with more latency". For many
> > people that's a regression.
>
> I posted the performance numbers in the thread already, and with every
> single throughput number I also talked abotu what the latency difference
> was. So quite frankly, if you were confused, I suspect it was because you
> didn't read them. Tssk, tssk.
>

How embarassing, I found that message immediately after hitting send.
Looks like a big win. I'll try the JACK stress test on it just to make
sure.

Lee


2005-01-08 22:33:49

by Davide Libenzi

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Sat, 8 Jan 2005, Linus Torvalds wrote:

> Short and sweet: the latency changes are in the noise for SMP, but can be
> seen on UP. I'll look at it a bit more: since I had to add the coalescing
> code anyway, I might also decide to re-use a buffer page rather than free
> it immediately, since that may help latency for small writes.

Yeah, I noticed you clean the page's magazine down to the bottom. Maybe
keeping one (or more) page in there might help something, due to cache
improvements and alloc/free page savings.


- Davide

2005-01-09 04:07:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than



On Sat, 8 Jan 2005, Linus Torvalds wrote:
>
> Short and sweet: the latency changes are in the noise for SMP, but can be
> seen on UP. I'll look at it a bit more: since I had to add the coalescing
> code anyway, I might also decide to re-use a buffer page rather than free
> it immediately, since that may help latency for small writes.

Side note: the SMP numbers seem to fluctuate wildly making any
benchmarking a bit suspect. They probably depend very much on just how
things get scheduled, and on inter-CPU cache behaviour.

In contrast, the UP numbers are a lot more reliable, so I'll try to make
sure that everything looks good on UP - which definitely means that I'll
work to make sure that latency looks good too.

Linus

2005-01-09 23:19:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

On Sat, 8 Jan 2005, Linus Torvalds wrote:

> Side note: the SMP numbers seem to fluctuate wildly making any
> benchmarking a bit suspect. They probably depend very much on just how
> things get scheduled, and on inter-CPU cache behaviour.

Lmbench results on SMP are bogus. OTOH you can easily make a more
deterministic SMP pipe bench by making the two tasks affine to two
different CPUs.


- Davide

2005-01-10 23:33:09

by Robert White

[permalink] [raw]
Subject: RE: Make pipe data structure be a circular list of pages, rather than

Howdy,

This entire discussion of the pipe and splice() is very much the same (issue,
problem, approach) as the STREAMS interface from UNIX SVR4, which had the pipe device
as its trivial example device. [This is all from vague recall, so forgive the
sketchiness 8-)]

[Note: check out this STREAMS programmers guide
http://docs-pdf.sun.com/802-1933/802-1933.pdf from Sun's documentation site.]

A Message was a fixed-size Message structure with a linked list of Buffers attached
to it. The Buffers were a structure-and-payload arrangement as in (over-expressed
here). The whole arrangement could be zero-copy or self condensing as needed.

In Linux parlance each file descriptor was basically a reference to a pair of linked
list headers (one for messages available for reading and one to accept data from
writes), a pointer to the driver below the head, and an optional tasklet. The driver
itself was virtually the same data structure.

STREAMS was a little over done and could have used some of the Linuxisims like
tasklets (instead of "the STREAMS Scheduler"), and the bulk of the nonsense for
getting Message and Buffer and STREAMS Head structures just screams Slab Cache, but
it has a lot of interesting bits.

The reason I bring it up is four-fold.

1) They had push and pop ioctl operations on the streams so that you could actually
interpose translators and multiplexors in the streams, so you could use the design to
do the splice() operation directly by building something that basically looked like:

fd0a fd0b
mux
pipe-driver
fd1

basically by pushing the mux into stream head and then pushing connecting the second
and subsequent fds to the mux object.

2) The Messages had type fields so you could do interesting things like pass an open
file descriptor to another process across a pipe because the special "this is an FD
message" could arrive unmolested and trigger the necessary file table manipulations
at the receiving end.

3) The optional tasklet (equivelant) at the various points in the STREAM could do
things like buffer consolidation, which in the Linux model could compete with actual
data consumption fairly easily/cheaply.

4) Some of the things that I have run into dealing with line disciplines and
streaming data through character devices (and USB) could gain a lot from the
Message/Buffer/STREAM paradigm. [e.g. the fixed-size flip buffers in ntty would be
rendered moot by having the linked lists above and below the line discipline.]

Anyway, thick as it is, I think the pdf referenced above might be interesting reading
vis a vi the pipe() discussion.

Rob White,
Casabyte, Inc.


2005-01-14 10:38:45

by Peter Chubb

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris> Mike Waychison wrote:
>> This got me to thinking about how you can heuristically optimize
>> away coalescing support and still allow PAGE_SIZE bytes minimum in
>> the effective buffer.

Chris> While coalescing may be a win in some cases, there should also
Chris> be some way to tell the kernel to NOT coalesce, to handle the
Chris> case where you want minimum latency at the cost of some
Chris> throughput.

SysVr4 pipes used to have two modes: a `legacy' mode that coalesced
data, and a `message' mode that preserved message boundaries. Seems
to me that we could be about to have the latter in Linux...

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*