2005-01-08 10:45:57

by George Spelvin

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

> - add a "tee(in, out1, out2)" system call that duplicates the pages
> (again, incrementing their reference count, not copying the data) from
> one pipe to two other pipes.

H'm... the version that seemed natural to me was an asymmetrical one-way
tee, such as "tee(in, out, len, flags)" might be better, where the next
<len> bytes are *both* readable on fd "in" *and* copied to fd "out".

You can make it a two-way tee with an additional "splice(in, out2, len)"
call, so you haven't lost expressiveness, and it makes three-way and
higher tees easier to construct.


But then I realized that I might be thinking about a completely different
implementation than you... I was thinking asynchronous, while perhaps
you were thinking synchronous.

A simple example of the difference:

int
main(void)
{
fd *dest = open("/dev/null", O_WRONLY);
FILE *src = popen("/usr/bin/yes", "r");
splice(fileno(src), dest, SPLICE_INFINITY, 0);
return 0;
}

Will this process exit without being killed? I was imagining yes,
it would exit immediately, but perhaps "no" makes more sense.

Ding! Oh, of course, it couldn't exit, or cleaning up after the following
mess would be far more difficult:

int
main(void)
{
int fd[2];
pipe(fd);
write(fd[1], "Hello, world!\n", 14);
splice(fd[0], fd[1], SPLICE_INFINITY, 0);
return 0;
}

With the synchronous model, the two-output tee() call makes more sense, too.
Still, it would be nice to be able to produce n identical output streams
without needing n-1 processes to do it Any ideas? Perhaps

int
tee(int infd, int const *outfds, unsigned noutfds, loff_t size, unsigned flags)

As for the issue of coalescing:
> This is the main reason why I want to avoid coalescing if possible: if you
> never coalesce, then each "pipe_buffer" is complete in itself, and that
> simplifies locking enormously.
>
> (The other reason to potentially avoid coalescing is that I think it might
> be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> boundaries. The new pipe code _is_ internally "packetized" after all).

It is somewhat offensive that the minimum overhead for a 1-byte write
is a struct pipe_buffer plus a full page.

But yes, keeping a pipe_buffer simple is a BIG win. So how about the
following coalescing strategy, which complicates the reader not at all:

- Each pipe writing fd holds a reference to a page and an offset within
that page.
- When writing some data, see if the data will fit in the remaining
portion of the page.
- If it will not, then dump the page, allocate a fresh one, and set
the offset to 0.
- Possible optimization: If the page's refcount is 1 (nobody else
still has a reference to the page, and we would free it if we
dropped it), then just recycle it directly.
- Copy the written data (up to a maximum of 1 page) to the current write page.
- Bump the page's reference count (to account for the pipe_buffer pointer) and
queue an appropriate pipe_buffer.
- Increment the offset by the amount of data written to the page.
- Decrement the amount of data remaining to be written and repeat if necessary.

This allocates one struct pipe_buffer (8 bytes) per write, but not a whole
page. And it does so by exploiting the exact "we don't care what the rest
of the page is used for" semantics that make the abstraction useful.

The only waste is that, as written, every pipe writing fd keeps a page
allocated even if the pipe is empty. Perhaps the vm could be taught to
reclaim those references if necessary?


It's also worth documenting atomicity guarantees and poll/select
semantics. The above algorithm is careful to always queue the first
PAGE_SIZE bytes of any write atomically, which I believe is what is
historically expected. It would be possible to have writes larger than
PIPE_BUF fill in the trailing end of any partial page as well.


2005-01-08 18:41:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather



On Sat, 8 Jan 2005 [email protected] wrote:
>
> H'm... the version that seemed natural to me was an asymmetrical one-way
> tee, such as "tee(in, out, len, flags)" might be better, where the next
> <len> bytes are *both* readable on fd "in" *and* copied to fd "out".

Yes, the implementation may end up being something like that, I haven't
decided. That would be more of a "peek()" kind of thing, not a "tee()",
but it has the advantage of getting rid of one unnecessary complexity of
"tee()" - dependence on three different fd's (in particular, if one of the
output fd's isn't writable, we can't do anything at all).

However, "peek()" has the problem that while it allows you to just run "n"
peek() calls for the "n" outputs, you now need to keep track of how much
each output has consumed (since they can be filled at different rates),
and then appropriately "eat" the proper amount of input when it has been
consumed by everybody. That's a very error-prone interface.

In contrast, with a plain "tee()", you'd just create a tree "log(n)"
deep of "tee()" actors, and you'd have "n" outputs from one input, and
each individual point would be very simple.

NOTE! I don't think either "tee()" or "peek()" really works for very high
fan-out. I don't believe that you'd ever really have a "balanced" enough
output across many things, and since they all end up being synchronized to
the same input (you can make the buffers deeper, which will help a bit,
but..) I don't think it ends up being _that_ useful.

Where I foresee "tee()" is when there is some rate-limited input that you
can easily handle in two or three streams. That's what the example of a
digital TV input signal was kind of meant to be - the input is
rate-limited, which means that the fact that the outputs are
likely consumed at different paces (saving to disk might be fast, showing
it on the screen would be real-time) doesn't matter. But such a use ends
up breaking down when the load is too high - you're likely going to be
constrained in the number of streams by pure load issues.

(And if you are _not_ constrained by load issues, then you don't need
tee() at all - you could just have copied the things by hand ;)

> But then I realized that I might be thinking about a completely different
> implementation than you... I was thinking asynchronous, while perhaps
> you were thinking synchronous.

I _am_ thinking entirely synchronous. It would copy exactly what was in
the pipe, and nothing more. It would sleep if either end was full/empty
(modulo N_DELAY, of course), but it would in no way be a long-term "set up
this copy mechanism.

None of the pipes would do anything on their own - they aren't "actors".
They'd be purely buffers. So if you want to have an asynchronous
long-running "tee()/copy()/peek()", you'd have a thread that does that for
you.

> With the synchronous model, the two-output tee() call makes more sense, too.
> Still, it would be nice to be able to produce n identical output streams
> without needing n-1 processes to do it Any ideas?

I agree, although I don't know how common it would be for "n" to be very
big. And you don't need n-1 processes (or threads), you can certainly do
it with the normal select-loop approach too.

> As for the issue of coalescing:
> > This is the main reason why I want to avoid coalescing if possible: if you
> > never coalesce, then each "pipe_buffer" is complete in itself, and that
> > simplifies locking enormously.
> >
> > (The other reason to potentially avoid coalescing is that I think it might
> > be useful to allow the "sendmsg()/recvmsg()" interfaces that honour packet
> > boundaries. The new pipe code _is_ internally "packetized" after all).
>
> It is somewhat offensive that the minimum overhead for a 1-byte write
> is a struct pipe_buffer plus a full page.

Note that there is no reason why we couldn't use just "kmalloc()" too.

For the different input cases we'll need per-pipe_buffer operation
functions (very much including a "release()" function) _anyway_, which
means that while the current implementation always just does "alloc_page()
+ __free_page()" for it's buffer allocation, there is absolutely zero that
says that you couldn't allocate small buffers with "kmalloc()" and free
them with "kfree()".

And since we need to support per-buffer ops anyway (for people mixing
"write()" and "splice()" to set up the pipe contents), it works very
naturally for mixed size writes too (ie you can have a small write that is
buffered in a small kmalloc'ed buffer, followed by a big write that uses
the page allocator, and nobody will care - you just decide at buffer fill
time which one you need).

So yes, the current setup is wasteful, but coalescing isn't the only way
to fight that - you can just use smaller allocations too.

So in my mind, the only reason for doign coalescing is if some silly
program deadlocks if it writes many small writes, and nobody is reading
it. Some people use pipes as buffers _within_ a program, ie use them to
send signals etc (common for signal-safe select handling - you make the
signal handler do a write() to a pipe, and the select loop has the pipe in
its FD-set).

For all of those that are valid (certainly the signal case), you need to
put the pipe into nonblocking mode anyway, so I don't think coalescing is
really needed. SuS certainly does _not_ seem guarantee that you can do
many nonblocking writes (small _or_ big). Let's see if somebody reports
any trouble with the new world order.

Linus

2005-01-08 22:51:29

by Alan

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

On Sad, 2005-01-08 at 18:41, Linus Torvalds wrote:
> For all of those that are valid (certainly the signal case), you need to
> put the pipe into nonblocking mode anyway, so I don't think coalescing is
> really needed. SuS certainly does _not_ seem guarantee that you can do
> many nonblocking writes (small _or_ big). Let's see if somebody reports
> any trouble with the new world order.

It breaks netscape 3 if I simulate it (but thats picking an app I know
makes silly assumptions).

>From the rules pasted below I think the 1 byte at a time 4K write is
guaranteed or at least strongly implied. The single write atomicity is
guaranteed but that is if anything made easier by the changes.

(From pathconf)

fpathconf(_PC_PATH_BUF)

returns the size of the pipe buffer, where filedes must
refer
to a pipe or FIFO and path must refer to a FIFO. The
corre-
sponding macro is _POSIX_PIPE_BUF.

(From pwrite)

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with
data from other processes doing writes on the same pipe. Writes of
greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary
boundaries, with writes by other processes, whether or not the
O_NONBLOCK flag of the file status flags is set.

[of O_NDELAY for pipe]
A write request for {PIPE_BUF} or fewer bytes shall have the following
effect: if there is sufficient space available in the pipe, write()
shall transfer all the data and return the number of bytes requested.
Otherwise, write() shall transfer no data and return -1 with errno set
to [EAGAIN].


2005-01-13 21:51:53

by Ingo Oeser

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

Hi,

Linus Torvalds wrote:
> None of the pipes would do anything on their own - they aren't "actors".
> They'd be purely buffers. So if you want to have an asynchronous
> long-running "tee()/copy()/peek()", you'd have a thread that does that for
> you.

Hmm, that's a pity, because it makes hardware support more difficult.

I thought you might consider an system call, which "wires up" fds.

Imagine a device fd, which gets lots of measuring data, wired through a
DSP pipe, spliced to realtime display fd and file storage fd.

How many buffers do you like to use here? How much unnecessary copies
are made by the CPU?

Any realtime mass data processing needing hardware support could benefit
from "active" pipes, where either end could be drivers "knowing each other"
and deciding to rather route the IO in hardware, avoiding to go through the
overworked memory bus and CPU.

This kind of "wire me up to my next chip" is needed quite often in the
embedded world, where power matters and special purpose chips or even
DSPs rule. Every chipset vendor does its own libraries there for now,
but a more generic approach might be better.

What do you think?

Regards

Ingo Oeser

2005-01-13 22:37:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather



On Thu, 13 Jan 2005, Ingo Oeser wrote:
>
> Hmm, that's a pity, because it makes hardware support more difficult.
>
> I thought you might consider an system call, which "wires up" fds.
>
> Imagine a device fd, which gets lots of measuring data, wired through a
> DSP pipe, spliced to realtime display fd and file storage fd.

I think that the solution to that is to make the pipe _be_ the driver
interface.

Remember: a pipe is just a set of buffers. If you have a hardware device
that could use the buffers, then there is nothing to say that the driver
couldn't be the "actor" that fills the buffers. So doing an "open()" on
the device would just create a pipe that gets filled by the hardware.

But that doesn't make the pipe an "actor". The pipe just remains a
standard way to encapsulate the notion of "set of buffers". It needs an
external actor to do something, but that actor can be a device driver
filling it up, or a system call that reads it or moves it to another
destination.

Linus

2005-01-14 21:08:22

by Ingo Oeser

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

Hi Linus,

Linus Torvalds wrote:
> On Thu, 13 Jan 2005, Ingo Oeser wrote:
> > Hmm, that's a pity, because it makes hardware support more difficult.
> > I thought you might consider an system call, which "wires up" fds.
> I think that the solution to that is to make the pipe _be_ the driver
> interface.

But we have currently no suitable interface for this.

We have socketpair(), we have pipe(), but we don't have
dev_pipe() creating a pipe with the driver sitting in the middle.

Usage: "processing devices" like crypto processors, DSPs,
MPEG-Encoding/-Decoding hardware, smart cards and the like.

Synopsys: int dev_pipe(const char *dev_path, int fd_vec[2]);

We also don't have wire_fds(), which would wire up two fds by
connecting the underlying file pointers with each other and closing the fds.

That would be quite useful, because user space will not see the data anymore
anyway, if it travels just between some chips.

Usage: Chips from a chipset, which usally know each other quite well and
can talk to each other without CPU intervention.

Synopsys: int wire(int in_fd, int out_fd);

> But that doesn't make the pipe an "actor". The pipe just remains a
> standard way to encapsulate the notion of "set of buffers". It needs an
> external actor to do something, but that actor can be a device driver
> filling it up, or a system call that reads it or moves it to another
> destination.

I implemented this some years ago for the "Linux & DSP" project,
but I had to impose ordering restrictions on open() and close() to user space
and this is not really elegant.

Many hardware vendors have implemented it somehow anyway.

So what about starting to do it seriously and implement infrastructure for
both of it?

I could also do an simple example driver using both features, if you give me
some evenings of time and are interested in the idea.

Are you interested?


Regards

Ingo Oeser, still dreaming of sys_wire() and sys_dev_pipe()

PS: Syscall names are under discussion of course ;-)

2005-01-14 21:32:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather



On Fri, 14 Jan 2005, Ingo Oeser wrote:
>
> But we have currently no suitable interface for this.
>
> We have socketpair(), we have pipe(), but we don't have
> dev_pipe() creating a pipe with the driver sitting in the middle.

Not exactly that, no.

But realize that what you are asking for is actually a _much_ simpler
thing than even "fifo_open()". What you ask for (if somebody really starts
doing this, and I'd love to see it) is not much more difficult than

int device_open(struct inode *inode, struct file *filp)
{
if (!pipe_new(inode))
return -ENOMEM;

/* The hardware is a writer */
PIPE_WRITERS(*inode)++;

/* The new fd is a reader of the data the hardware generates */
file->f_op = read_fifo_fops;

/*
* This starts the hardware capture - although in real
* life there might be a "start it" ioctl or something
* that actually does that together with the parameters
* for capture..
*/
start_capture_process(inode);

return 0;
}

ie we're definitely not talking rocket science here.

(Yes, I'm sure there are some details missing in the above, but you get
the idea: the infrastructure really _is_ pretty much there, and any lack
comes from the fact that nobody has actually ever _used_ it).

> Usage: "processing devices" like crypto processors, DSPs,
> MPEG-Encoding/-Decoding hardware, smart cards and the like.
>
> Synopsys: int dev_pipe(const char *dev_path, int fd_vec[2]);

No, I do NOT think that's what you'd want.

What you want is more of a

fd = open("/dev/mpeginput", O_RDONLY);

and then that device driver open just does the thing outlined above.

> We also don't have wire_fds(), which would wire up two fds by
> connecting the underlying file pointers with each other and closing the fds.

But that is _exactly_ what splice() would do.

So you could have the above open of "/dev/mpeginput", and then you just
sit and splice the result into a file or whatever.

See?

(Or maybe it's me who doesn't see what it is you want).

Linus

2005-01-14 22:18:17

by Ingo Oeser

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

Hi Linus,

Linus Torvalds wrote:
> On Fri, 14 Jan 2005, Ingo Oeser wrote:
> But realize that what you are asking for is actually a _much_ simpler
> thing than even "fifo_open()". What you ask for (if somebody really starts
> doing this, and I'd love to see it) is not much more difficult than
>
> int device_open(struct inode *inode, struct file *filp)
> {
> if (!pipe_new(inode))
> return -ENOMEM;
>
> /* The hardware is a writer */
> PIPE_WRITERS(*inode)++;
>
> /* The new fd is a reader of the data the hardware generates */
> file->f_op = read_fifo_fops;
>
> /*
> * This starts the hardware capture - although in real
> * life there might be a "start it" ioctl or something
> * that actually does that together with the parameters
> * for capture..
> */
> start_capture_process(inode);
>
> return 0;
> }
>
> ie we're definitely not talking rocket science here.

Data sink/source is simple, indeed. You just implemented a buffering
between a drivers output filp, if I understand you correctly.

Now both directions together and it gets a bit more difficult.
Driver writers basically reimplement fs/pipe.c over and over again.

encoded stream -> write() -> decoder hardware -> read() -> decoded stream

I'm trying to model the decoder as device here.
Same goes for encoding.

> (Yes, I'm sure there are some details missing in the above, but you get
> the idea: the infrastructure really _is_ pretty much there, and any lack
> comes from the fact that nobody has actually ever _used_ it).

No, we have all the hardware actually doing this hidden behind other
interfaces. e.g. the DVB core, my LinuxDSP core, the upcoming hardware
crypto core and so on.

Always the same basic operations for that needing the same structure
to handle it.

Maybe the solution here is just using ONE fd and read/write to it.

But for identifying producer/consumer, you need two fds. And now we come to
wire() (or maybe we discover that splice() is really doing the same).

> > We also don't have wire_fds(), which would wire up two fds by
> > connecting the underlying file pointers with each other and closing the
> > fds.
>
> But that is _exactly_ what splice() would do.
>
> So you could have the above open of "/dev/mpeginput", and then you just
> sit and splice the result into a file or whatever.
>
> See?

But you are still pushing everything through the page cache.
I would like to see "fop->connect(producer, consumer)"

instead of

while (producer->not_killed &&) consumer->not_killed) {
producer->read(buffer);
consumer->write(buffer);
}

That's what the wire() syscall is about. If the splice() syscall is the same
and enables drivers to avoid the copying through the CPU, I'm perfectly happy.

> (Or maybe it's me who doesn't see what it is you want).

Imagine 3 chips. One is a encoder/decoder chip with 8 channels, the other
2 chips are a video DAC/ADC and an audio DAC/ADC with 4 channels each.

These chips can be wired directly by programming a wire matrix (much like a
dumb routing table). But you can also receive/send via all of these chips
to/from harddisk for recording/playback.

So you have to implement the drivers for these chips to provide one filp per
channel and one minor per chip.

If I can do this with splice, you've got me and I'm really looking forward to
your first commits/patches, since this itch is scratching me since long ;-)


Thanks and Regards

Ingo oeser

2005-01-14 22:44:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather



On Fri, 14 Jan 2005, Ingo Oeser wrote:
>
> Data sink/source is simple, indeed. You just implemented a buffering
> between a drivers output filp, if I understand you correctly.

Yes and no. It is indeed just buffering.

The thing I find most intriguing about it, and which is why I htink it's
important is that while it's "just" buffering, it is _standard_ buffering.
That's where it gets interesting. Everybody needs buffers, and they are
generally so easy to implement that there's no point in having much of a
"buffer library".

But it the standard way allows you to do interesting stuff _between_ two
things, than that's where it gets interesting. The combinations it allows.

> Now both directions together and it gets a bit more difficult.
> Driver writers basically reimplement fs/pipe.c over and over again.

But bidirectional is just two of those things.

> Always the same basic operations for that needing the same structure
> to handle it.

Yes.

> Maybe the solution here is just using ONE fd and read/write to it.

No. Use two totally separate fd's, and make it _cheap_ to move between
them. That's what "splice()" gives you - basically a very low-cost way to
move between two uni-directional things. No "memcpy()", because memcpy is
expensive for large streams of data.

> > > We also don't have wire_fds(), which would wire up two fds by
> > > connecting the underlying file pointers with each other and closing the
> > > fds.
> >
> > But that is _exactly_ what splice() would do.
> >
> > So you could have the above open of "/dev/mpeginput", and then you just> > sit and splice the result into a file or whatever.
> >
> > See?
>
> But you are still pushing everything through the page cache.
> I would like to see "fop->connect(producer, consumer)"

No no. There's no buffer cache involved. There is just "buffers".

If you end up reading from a regular file (or writing to one), then yes,
the buffers end up being picked up from the buffer cache. But that's by no
means required. The buffers can be just anonymous pages (like the ones a
regular "write()" to a pipe generates), or they could be DMA pages
allocated for the data by a device driver. Or they could be the page that
contains a "skb" from a networking device.

I really think that splice() is what you want, and you call "wire()".

> Imagine 3 chips. One is a encoder/decoder chip with 8 channels, the other
> 2 chips are a video DAC/ADC and an audio DAC/ADC with 4 channels each.
>
> These chips can be wired directly by programming a wire matrix (much like a
> dumb routing table). But you can also receive/send via all of these chips
> to/from harddisk for recording/playback.
>
> So you have to implement the drivers for these chips to provide one filp per
> channel and one minor per chip.
>
> If I can do this with splice, you've got me and I'm really looking forward to
> your first commits/patches, since this itch is scratching me since long ;-)

Yes, I believe that we're talking about the same thing. What you can do in
my vision is:

- create a pipe for feeding the audio decoder chip. This is just the
sound driver interface to a pipe, and it's the "device_open()" code I
gave as an example in my previous email, except going the other way (ie
the writer is the 'fd', and the driver is the "reader" of the data).

You'd do this with a simple 'fd = open("/dev/xxxx", O_WRONLY)",
together with some ioctl (if necessary) to set up the actual
_parameters_ for the piped device.

- do a "splice()" from a file to the pipe. A splice from a regular file
is really nothing but a page cache lookup, and moving those page cache
pages to the pipe.

- tell the receiving fd to start processing it (again, this might be an
ioctl - some devices may need directions on how to interpret the data,
or it might be implicit in the fact that the pipe got woken up by being
filled)

Going the other way (receiving data from a hardware device) is all the
same thing - and there "tee()" may be useful, since it would allow the
received data to be dup'ed to two different sinks.

Linus

2005-01-14 23:34:48

by Ingo Oeser

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

Hi Linus,

Linus Torvalds wrote:
> That's where it gets interesting. Everybody needs buffers, and they are
> generally so easy to implement that there's no point in having much of a
> "buffer library".

But my hardware has the buffers on chip, if connected directly to each other.
So no system level buffering is needed.

> No. Use two totally separate fd's, and make it _cheap_ to move between
> them. That's what "splice()" gives you - basically a very low-cost way to
> move between two uni-directional things. No "memcpy()", because memcpy is
> expensive for large streams of data.

Here we agree already, so lets talk about the rest.

> If you end up reading from a regular file (or writing to one), then yes,
> the buffers end up being picked up from the buffer cache. But that's by no
> means required. The buffers can be just anonymous pages (like the ones a
> regular "write()" to a pipe generates), or they could be DMA pages
> allocated for the data by a device driver. Or they could be the page that
> contains a "skb" from a networking device.
>
> I really think that splice() is what you want, and you call "wire()".

That sounds really close to what I want.

Now imagine this (ACSCII art in monospace font):

[ chip A ] ------(1)------ [ chip B ] [ CPU ] [memory] [disk]
| | | | |
+----------(2)-----------+---(2)-----+----(2)--+--------+

(1) Is a direct path between these chips.
(2) Is the system bus (e.g. PCI)

(1) works without any CPU intervention
(2) needs the buffering scheme you described.

I like to use (1) for obvious reasons, but still allow (2).
User space should decide which one, since it's a policy decision.

I need to open the chips with default to (2), because the driver
doesn't know in advance, that it will be connected to a chip it can talk to
directly. Everything else will be quite ugly.

Please note, that this is a simplified case and we actually have many chips of
A/B and (1) is more like a software programmable "patch field" you
might know from a server rack. I want user space to operate this "patch
field".

Maybe "io router" is a descriptive term. No?

> Yes, I believe that we're talking about the same thing. What you can do in
> my vision is:
>
> - create a pipe for feeding the audio decoder chip. This is just the
> sound driver interface to a pipe, and it's the "device_open()" code I
> gave as an example in my previous email, except going the other way (ie
> the writer is the 'fd', and the driver is the "reader" of the data).
>
> You'd do this with a simple 'fd = open("/dev/xxxx", O_WRONLY)",
> together with some ioctl (if necessary) to set up the actual
> _parameters_ for the piped device.
>
> - do a "splice()" from a file to the pipe. A splice from a regular file
> is really nothing but a page cache lookup, and moving those page cache
> pages to the pipe.
>
> - tell the receiving fd to start processing it (again, this might be an
> ioctl - some devices may need directions on how to interpret the data,
> or it might be implicit in the fact that the pipe got woken up by being
> filled)

Yes, I understand and support your vision. Now I would like to use path (1)
the direct for this. Possible?

Maybe by making drivers optionally aware of splice() and defaulting to a
regular pipe, if they don't care?


Thanks and Regards

Ingo Oeser

2005-01-15 00:16:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather



On Sat, 15 Jan 2005, Ingo Oeser wrote:
>
> Now imagine this (ACSCII art in monospace font):
>
> [ chip A ] ------(1)------ [ chip B ] [ CPU ] [memory] [disk]
> | | | | |
> +----------(2)-----------+---(2)-----+----(2)--+--------+
>
> Yes, I understand and support your vision. Now I would like to use path (1)
> the direct for this. Possible?

Yes, if I understand what you're thinking right. I'm just assuming (for
the sake of simplicity) that "A" generates all data, and "B" is the one
that uses it. If both A and B can generate/use data, it still works, but
you'd need two pipes per endpoint, just because pipes are fundamentally
uni-directional.

Also, for this particular example, I'll also assume that whatever buffers
that chips A/B use are "mappable" to the CPU too: they could be either
regular memory pages (that A/B DMA to/from), or they can be IO buffers on
the card itself that can be mapped into the CPU address space and
memcpy'ed.

That second simplification is something that at least my current example
code (which I don't think I've sent to you) kind of assumes. It's not
really _forced_ onto the design, but it avoids the need for a separate
"accessor" function to copy things around.

[ If you cannot map them into CPU space, then each "struct pipe_buffer" op
structure would need operations to copy them to/from kernel memory and
to copy them to/from user memory, so the second simplification basically
avoids having to have four extra "accessor" functions. ]

What you'd do to get your topology above is:
- create a "pipe" for both A and B (let's call them "fdA" and "fdB"
respectively).
- the driver is responsible for creating the "struct pipe_buffer" for any
data generated by chip A. If it's regular memory, it's a page
allocation and the appropriate DMA setup, and if it's IO memory on the
card, you'd need to generate fake "struct page *" and a mapping
function for them.
- if you want to send the data that A generates to both fdB _and_ write
it to disk (file X), you'd also need one anonymous pipe (fdC), and open
a fdX for writing to the file, and then just in your process
effectively do:

for (;;) {
n = tee(fdA, fdB, fbC);
splice(fdC, fdX, n);
}

and you'd get what I think you are after. The reason you need the "fdC"
is simply because the "tee()" operation would only work on pipes: you
can not duplicate any other data structure with "tee()" than a pipe
buffer, so you couldn't do a direct "tee(fdA, fdB, fdX)" in my world.

(The above example is _extremely_ simplified: in real life, you'd have to
take care of partial results from "splice()" etc error conditions, of
course).

NOTE! My first cut will _assume_ that all buffers are in RAM, so it
wouldn't support the notion of on-card memory. On-card memory does
complicate things - not just the accessor functions, but also the notion
of how to "splice()" (or "tee()" from one to the other). If you assume
buffers-in-RAM, then all pointers are the same, and a tee/splice really
just moves a pointer around. But if the pointer can have magic meaning
that depends on the source it came from, then splicing such a pointer to
another device must inevitably imply at least the possibility of some kind
of memory copy.

So I don't think you'll get _exactly_ what you want for a while, since
you'd have to go through in-memory buffers. But there's no huge conceptual
problem (just enough implementation issues to make it inconvenient) with
the concept of actually keeping the buffer on an external controller.

Linus

2005-01-16 03:00:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather



On Fri, 14 Jan 2005, Linus Torvalds wrote:
>
> So I don't think you'll get _exactly_ what you want for a while, since
> you'd have to go through in-memory buffers. But there's no huge conceptual
> problem (just enough implementation issues to make it inconvenient) with
> the concept of actually keeping the buffer on an external controller.

Here is, if you are interested, a slightly updated version of my
"work-in-progress" example patch. It:

- depends on a fairly recent -BK (since it uses the pipe buffer ops
interfaces that I already checked in)
- is very ugly: the VFS interfaces to splice things really are very wrong
indeed, and they _should_ be about passing "struct pipe_buffer" arrays
around instead.
- it only does one specific case (splicing from the page cache directly
into a pipe)
- only set up for ext3 right now (though the "generic_file_splice" thing
should work for any page-cache user)
- and even that specific case it does wrong since it doesn't update the
"f_pos" field right now
- error handling is broken - if an IO error happens on the splice, the
"map()" operation will return NULL, and the current fs/pipe.c can't
handle that and will oops ;)

so it really is _purely_ a demonstration vehicle, but it at least shows
the kind of operation that I'm thinking of.

An example program to show its use is a simple:

#include <fcntl.h>
#include <unistd.h>
#include <linux/unistd.h>

#define __NR_sys_splice 289

inline _syscall4(int, sys_splice, int, a, int, b, size_t, len, unsigned long, flags);

int main(int argc, char **argv)
{
int fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
exit(1);
}
write(1,"hello: ", 7);
sys_splice(fd, 1, 1000, 0);
}

which demonstrates three things:

- it works on a perfectly regular pipe, with no special setup, so to use
it you can just do

./test-splice my-file | less

and the normal pipe that the shell created for this will just work. No
new magic fd's required.

- that you can combine different sources to the pipe (mixing a
traditional static write with a "splice from a file").

- we start the IO when we do the splice, but the waiting for the result
happens separately, when (or if) the data is used by the other end.

it will print out "hello: " followed by the first 1000 bytes of the file.

It's obviously not very useful for anything but demonstration, since
splicing the other way (from a pipe to another fd) isn't implemented by
this example, but it gives a more concrete example of what I'm thinking
of.

And shows that the splicing itself is pretty simple ("move_to_pipe()"
does all the moving of buffers to the pipe) - the rest is just regular
setup and finding of the pages in the page cache.

Linus

----
# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
# 2005/01/15 18:36:19-08:00 [email protected]
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# fs/splice.c
# 2005/01/15 18:36:07-08:00 [email protected] +229 -0
#
# include/linux/fs.h
# 2005/01/15 18:36:07-08:00 [email protected] +2 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# include/asm-i386/unistd.h
# 2005/01/15 18:36:07-08:00 [email protected] +2 -1
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# fs/splice.c
# 2005/01/15 18:36:07-08:00 [email protected] +0 -0
# BitKeeper file /home/torvalds/v2.6/linux/fs/splice.c
#
# fs/ext3/file.c
# 2005/01/15 18:36:07-08:00 [email protected] +3 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# fs/Makefile
# 2005/01/15 18:36:07-08:00 [email protected] +1 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
# arch/i386/kernel/entry.S
# 2005/01/15 18:36:07-08:00 [email protected] +1 -0
# sys_splice: take 3
#
# Still horribly bad VFS interface, still doing the pipe
# filling inside the per-fd callback instead of doing it
# in "generic code" in fs/splice.c.
#
# Need to pass in (and out) arrays of "struct pipe_buffer"
# on a VFS level instead.
#
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S 2005-01-15 18:36:40 -08:00
+++ b/arch/i386/kernel/entry.S 2005-01-15 18:36:40 -08:00
@@ -864,5 +864,6 @@
.long sys_add_key
.long sys_request_key
.long sys_keyctl
+ .long sys_splice

syscall_table_size=(.-sys_call_table)
diff -Nru a/fs/Makefile b/fs/Makefile
--- a/fs/Makefile 2005-01-15 18:36:40 -08:00
+++ b/fs/Makefile 2005-01-15 18:36:40 -08:00
@@ -10,6 +10,7 @@
ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
+ splice.o

obj-$(CONFIG_EPOLL) += eventpoll.o
obj-$(CONFIG_COMPAT) += compat.o
diff -Nru a/fs/ext3/file.c b/fs/ext3/file.c
--- a/fs/ext3/file.c 2005-01-15 18:36:40 -08:00
+++ b/fs/ext3/file.c 2005-01-15 18:36:40 -08:00
@@ -101,6 +101,8 @@
return ret;
}

+extern ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags);
+
struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -115,6 +117,7 @@
.release = ext3_release_file,
.fsync = ext3_sync_file,
.sendfile = generic_file_sendfile,
+ .splice_read = generic_file_splice_read,
};

struct inode_operations ext3_file_inode_operations = {
diff -Nru a/fs/splice.c b/fs/splice.c
--- /dev/null Wed Dec 31 16:00:00 196900
+++ b/fs/splice.c 2005-01-15 18:36:40 -08:00
@@ -0,0 +1,229 @@
+/*
+ * linux/fs/splice.c
+ *
+ * "splice": joining two ropes together by interweaving their strands.
+ *
+ * This is the "extended pipe" functionality, where a pipe is used as
+ * an arbitrary in-memory buffer. Think of a pipe as a small kernel
+ * buffer that you can use to transfer data from one end to the other.
+ *
+ * The traditional unix read/write is extended with a "splice()" operation
+ * that transfers data buffers to or from a pipe buffer.
+ *
+ * Named by Larry McVoy.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/pipe_fs_i.h>
+
+static void page_cache_pipe_buf_release(struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+ page_cache_release(buf->page);
+}
+
+static void *page_cache_pipe_buf_map(struct file *file, struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+ struct page *page = buf->page;
+
+ if (!PageUptodate(page)) {
+ wait_on_page_locked(page);
+ if (!PageUptodate(page))
+ return NULL;
+ }
+ return kmap(page);
+}
+
+static void page_cache_pipe_buf_unmap(struct pipe_inode_info *info, struct pipe_buffer *buf)
+{
+ kunmap(buf->page);
+}
+
+static struct pipe_buf_operations page_cache_pipe_buf_ops = {
+ .can_merge = 0,
+ .map = page_cache_pipe_buf_map,
+ .unmap = page_cache_pipe_buf_unmap,
+ .release = page_cache_pipe_buf_release,
+};
+
+static ssize_t move_to_pipe(struct inode *inode, struct page **array, int pages, unsigned long offset, unsigned long len)
+{
+ struct pipe_inode_info *info;
+ int ret, do_wakeup;
+
+ down(PIPE_SEM(*inode));
+ info = inode->i_pipe;
+ ret = 0;
+ do_wakeup = 0;
+ for (;;) {
+ int bufs;
+
+ if (!PIPE_READERS(*inode)) {
+ send_sig(SIGPIPE, current, 0);
+ if (!ret) ret = -EPIPE;
+ break;
+ }
+ bufs = info->nrbufs;
+ if (bufs < PIPE_BUFFERS) {
+ int newbuf = (info->curbuf + bufs) & (PIPE_BUFFERS-1);
+ struct pipe_buffer *buf = info->bufs + newbuf;
+ struct page *page = *array++;
+ unsigned long this_len;
+
+ this_len = PAGE_SIZE - offset;
+ if (this_len > len)
+ this_len = len;
+
+ buf->page = page;
+ buf->offset = offset;
+ buf->len = this_len;
+ buf->ops = &page_cache_pipe_buf_ops;
+ info->nrbufs = ++bufs;
+ do_wakeup = 1;
+
+ ret += this_len;
+ len -= this_len;
+ offset = 0;
+ if (!--pages)
+ break;
+ if (!len)
+ break;
+ if (bufs < PIPE_BUFFERS)
+ continue;
+ break;
+ }
+ if (signal_pending(current)) {
+ if (!ret) ret = -ERESTARTSYS;
+ break;
+ }
+ if (do_wakeup) {
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+ do_wakeup = 0;
+ }
+ PIPE_WAITING_WRITERS(*inode)++;
+ pipe_wait(inode);
+ PIPE_WAITING_WRITERS(*inode)--;
+ }
+ up(PIPE_SEM(*inode));
+ if (do_wakeup) {
+ wake_up_interruptible(PIPE_WAIT(*inode));
+ kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
+ }
+ while (--pages >= 0)
+ page_cache_release(*++array);
+ return ret;
+}
+
+ssize_t generic_file_splice_read(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
+{
+ struct address_space *mapping = in->f_mapping;
+ unsigned long long pos, size, left;
+ unsigned long index, offset, pages;
+ struct page *array[PIPE_BUFFERS];
+ int i;
+
+ pos = in->f_pos;
+ size = i_size_read(mapping->host);
+ if (pos >= size)
+ return 0;
+ left = size - pos;
+ if (left < len)
+ len = left;
+ index = pos >> PAGE_CACHE_SHIFT;
+ offset = pos & ~PAGE_CACHE_MASK;
+ pages = (len + offset + PAGE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+ if (pages > PIPE_BUFFERS)
+ pages = PIPE_BUFFERS;
+
+ /*
+ * Get as many pages from the page cache as possible..
+ * Start IO on the page cache entries we create (we
+ * can assume that any pre-existing ones we find have
+ * already had IO started on them).
+ */
+ i = find_get_pages(mapping, index, pages, array);
+
+ /*
+ * If not all pages were in the page-cache, we'll
+ * just assume that the rest haven't been read in,
+ * so we'll get the rest locked and start IO on
+ * them if we can..
+ */
+ while (i < pages) {
+ struct page *page = find_or_create_page(mapping, index + i, GFP_USER);
+ int error;
+ if (!page)
+ break;
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ } else {
+ error = mapping->a_ops->readpage(in, page);
+ if (unlikely(error)) {
+ page_cache_release(page);
+ break;
+ }
+ }
+ array[i++] = page;
+ }
+
+ if (!i)
+ return 0;
+
+ /*
+ * Now we splice them into the pipe..
+ */
+ return move_to_pipe(pipe, array, i, offset, len);
+}
+
+static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags)
+{
+ if (out->f_op && out->f_op->splice_write)
+ return out->f_op->splice_write(pipe, out, len, flags);
+ return -EINVAL;
+}
+
+static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
+{
+ if (in->f_op && in->f_op->splice_read)
+ return in->f_op->splice_read(in, pipe, len, flags);
+ return -EINVAL;
+}
+
+static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags)
+{
+ struct inode *pipe;
+
+ if (!len)
+ return 0;
+ pipe = in->f_dentry->d_inode;
+ if (pipe->i_pipe)
+ return do_splice_from(pipe, out, len, flags);
+ pipe = out->f_dentry->d_inode;
+ if (pipe->i_pipe)
+ return do_splice_to(in, pipe, len, flags);
+ return -EINVAL;
+}
+
+asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags)
+{
+ long error;
+ struct file *in, *out;
+ int fput_in, fput_out;
+
+ error = -EBADF;
+ in = fget_light(fdin, &fput_in);
+ if (in) {
+ if (in->f_mode & FMODE_READ) {
+ out = fget_light(fdout, &fput_out);
+ if (out) {
+ if (out->f_mode & FMODE_WRITE)
+ error = do_splice(in, out, len, flags);
+ fput_light(out, fput_out);
+ }
+ }
+ fput_light(in, fput_in);
+ }
+ return error;
+}
diff -Nru a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h 2005-01-15 18:36:40 -08:00
+++ b/include/asm-i386/unistd.h 2005-01-15 18:36:40 -08:00
@@ -294,8 +294,9 @@
#define __NR_add_key 286
#define __NR_request_key 287
#define __NR_keyctl 288
+#define __NR_splice 289

-#define NR_syscalls 289
+#define NR_syscalls 290

/*
* user-visible error numbers are in the range -1 - -128: see
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h 2005-01-15 18:36:40 -08:00
+++ b/include/linux/fs.h 2005-01-15 18:36:40 -08:00
@@ -944,6 +944,8 @@
int (*check_flags)(int);
int (*dir_notify)(struct file *filp, unsigned long arg);
int (*flock) (struct file *, int, struct file_lock *);
+ ssize_t (*splice_write)(struct inode *, struct file *, size_t, unsigned long);
+ ssize_t (*splice_read)(struct file *, struct inode *, size_t, unsigned long);
};

struct inode_operations {

2005-01-17 16:04:29

by Ingo Oeser

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather

Hi Linus,

Linus Torvalds wrote:
> +static long do_splice_from(struct inode *pipe, struct file *out, size_t len, unsigned long flags)
> +static long do_splice_to(struct file *in, struct inode *pipe, size_t len, unsigned long flags)
> +static long do_splice(struct file *in, struct file *out, size_t len, unsigned long flags)
> +asmlinkage long sys_splice(int fdin, int fdout, size_t len, unsigned long flags)

That part looks quite perfect. As long as they stay like this, I'm totally
happy. I have even no problem about limiting to a length, since I can use that
to measure progress (e.g. a simple progress bar).

This way I also keep the process as an "actor" like "[email protected]" pointed out.

It has unnecessary scheduling overhead, but the ability to stop/resume
the transfer by killing the process doing it is worth it, I agree.

So I would put a structure in the inode identifying the special device
and check, whether the "in" and "out" parameters are from devices suitable
for a direct on wire transfer. If they are, I just set up some registers
and wait for the transfer to happen.

Then I get an interrupt/wakeup, if the requested amount is streamed, increment some
user space pointers, switch to user space, user space tells me abort or stream
more and I follow.

Continue until abort by user or streaming problems happen.

Just to give you an idea: I debugged such a machine and I had a hard hanging
kernel with interrupts disabled. It still got data from a tuner, through an
MPEG decoder, an MPEG demultiplexer and played it to the audio card.
Not just a buffer like ALSA/OSS, but as long as I would like and it's end to end
without any CPU intervention.

That behavior would be perfect, but I could also live with a "pushing process".


Regards

Ingo Oeser

2005-01-19 21:13:29

by George Spelvin

[permalink] [raw]
Subject: Re: Make pipe data structure be a circular list of pages, rather than

[email protected] wrote:
> You seem to have misunderstood the original proposal, it had little to do
> with file descriptors. The idea was that different subsystems in the OS
> export pull() and push() interfaces and you use them. The file decriptors
> are only involved if you provide them with those interfaces(which you
> would, it makes sense). You are hung up on the pipe idea, the idea I
> see in my head is far more generic. Anything can play and you don't
> need a pipe at all, you need

I was fantasizing about more generality as well. In particular, my
original fantasy allowed data to, in theory and with compatible devices,
be read from one PCI device, passed through a series of pipes, and
written to another without ever hitting main memory - only one PCI-PCI
DMA operation performed.

A slightly more common case would be zero-copy, where data gets DMAed
from the source into memory and from memory to the destination.
That's roughly Larry's pull/push model.

The direct DMA case requires buffer memory on one of the two cards.
(And would possibly be a fruitful source of hardware bugs, since I
suspect that Windows Doesn't Do That.)

Larry has the "full-page gift" optimization, which could in theory allow
data to be "renamed" straight into the page cache. However, the page also
has to be properly aligned and not in some awkward highmem address space.
I'm not currently convinced that this would happen often enough to be
worth the extra implementation hair, but feel free to argue otherwise.

(And Larry, what's the "loan" bit for? When is loan != !gift ?)


The big gotcha, as Larry's original paper properly points out, is handling
write errors. We need some sort of "unpull" operation to put data back
of the destination can't accept it. Otherwise, what do you return from
splice()? If the source is seekable, that's easy, and a pipe isn't much
harder, but for a general character device, we need a bit of help.

The way I handle this in user-level software, to connect modules that
provide data buffering, is to split "pull" into two operations.
"Show me some buffered data" and "Consume some buffered data".
The first returns a buffer pointer (to a const buffer) and length.
(The length must be non-zero except at EOF, but may be 1 byte.)

The second advances the buffer pointer. The advance distance must be
no more than the length returned previously, but may be less.

In typical single-threaded code, I allow not calling the advance function
or calling it multiple times, but they're typically called 1:1, and
requiring that would give you a good place to do locking. A character
device, network stream, or the like, would acquire an exclusive lock.
A block device or file would not need to (or could make it a shared lock
or refcount).


The same technique can be used when writing data to a module that does
buffering: "Give me some bufer space" and "Okay, I filled some part of
it in." In some devices, the latter call can fail, and the writer has
to be able to cope with that.


By allowing both of those (and, knowing that PCI writes are more efficient
than PCI reads, giving the latter preference if both are available),
you can do direct device-to-device copies on splice().


The problem with Larry's separate pull() and push() calls is that you
then need a user-visible abstraction for "pulled but not yet pushed"
data, which seems lile unnecessary abstraction violation.


The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.

Ooh. Need to think. If there's a -EIO problem on one of the file
descriptors, how does the caller know which one? That's an argument for
separate pull and push (although the splice() library routine still
has the problem). Any suggestions? Does userland need to fall
back on read()/write() for a byte?

2005-01-20 02:07:15

by Robert White

[permalink] [raw]
Subject: RE: Make pipe data structure be a circular list of pages, rather than

Howdy,

The below (sorry about the Outlook style includes, work _requires_ outlook 8-() is
why I previously mentioned the STREAMS implementation from SVR4 (and so Solaris etc).

Every interface point has a pointer to its peer, a pair of queues and a put()
routine. Every message is in two or more pieces, the message header and the message
buffer/buffers (read real data pages).

So the "participating interface" requirement is really that the device be able to
expose a data structure consisting of four pointers that their peer can point to.

The "message types" for the messages passing around are all contained in the message
header structures. These headers have all the reference counting and
disposal/cleanup hook procedures (pointers) along with the references to the actual
data. This means that you could have a stock pool (from the slab cache) of these
headers for moving plain data around, and "magical devices" could use specialty
message headers with the necessary disposal elements or necessary "extra data" in
them.

All the devices in any kind of chain/mux/pipe/whatever arrangement pass messages to
their peers by invoking peer->put(message) [in OOD parlance, really probably
(*peer)(peer,message) in straight C]. And modules can be induced into the
chain/mux/pipe/whtever at runtime. The target put() procedure decides to copy or
reference-increment the message and put it on it's receive queue or process it
immediately. At no time does any member of a STREAMS structure need to know any
details about the workings of their peers.

In the abstract, most communications would have message headers referring to blocks
of normal memory. Some devices could send out messages that reference their PCI
mapped memory or whatever, and all _that_ device would have to do to mesh with the
system is to replace the cleanup function pointer with one appropriate to its need.
When the message has gone as far as it is going to go, its reference count drops to
zero and it gets cleaned up.

Unget can be done with some queue insertion/manipulation. There is no pull() as
every target is either hot-processed or queued. Flow control takes place when a
put() procedure returns a no-space-for-message indication.

If you absolutely _must_ know if the data was actually consumed by the target (and I
could argue that this is virtually never the case as things get lost in error-induced
pipe tear-down all the time) it would only be necessary to add a "cursor" member to
the message header and, at destruction time, see if the cursor had been advanced to
"use up" the data.

One could even have message headers that know they are dealing with

Now the message header is Linus' "impedance matcher" interface, transport is really
just about knowing your peer, calling put(), and doing reference counting.

There were a lot of (IMHO) mistakes in the STREAMS design, but the programming
document (previously referenced in this thread) covers a lot of the same message
typing and management questions that are being raised here.

While I can see that the fit between Linux and SVR4 STREAMS wouldn't be identical, a
good bit of the pipe-fitting (to pun 8-) and operation definitions would be.

(I'll go back to lurking now... 8-)

Rob White,
Casabyte, Inc.


-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of [email protected]
[email protected] wrote:

The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.

Ooh. Need to think. If there's a -EIO problem on one of the file
descriptors, how does the caller know which one? That's an argument for
separate pull and push (although the splice() library routine still
has the problem). Any suggestions? Does userland need to fall
back on read()/write() for a byte?