2001-03-07 05:20:24

by Jeremy Elson

[permalink] [raw]
Subject: Mapping a piece of one process' addrspace to another?

Greetings,

Is there some way to map a piece of process X's address space into
process Y, without X's knowledge or cooperation? (The non-cooperating
nature of process X is why I can't use plain old shared memory.)

Put another way, I need to grant Process Y permission to write into a
private buffer owned by process X. X doesn't know this is happening,
and I don't know where X's buffer even lives (X's stack, heap, etc.).
X just passes a pointer to the kernel via a system call (e.g., like
sys_read).

In the system I currently have implemented, Y writes a buffer to the
kernel, and the kernel relays it to X on behalf of Y. But, for
performance reasons, I want to try to get Y to write its data directly
to X instead. X might be any process calling read() (e.g., "cat"),
but Y is in collusion with the kernel.

If this is possible at all, is it also possible on a granularity down
to one byte (as opposed to an entire page)?

Sorry if this seems like a strange thing to do - but, I hope to
document and release my code soon, in which case its purpose should be
clearer :-).

Regards,
Jeremy


2001-03-07 05:56:25

by Alexander Viro

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?



You are reinventing the wheel.
man ptrace (see PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_{ATTACH,CONT,DETACH})

Cheers,
Al

2001-03-07 07:48:25

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?



On Wed, 7 Mar 2001, Alexander Viro wrote:

>
>
> You are reinventing the wheel.
> man ptrace (see PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_{ATTACH,CONT,DETACH})

With ptrace data will be copied twice. As far as I understood, Jeremy
wants to avoid that.

2001-03-07 08:14:46

by Jeremy Elson

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Marcelo Tosatti writes:
>On Wed, 7 Mar 2001, Alexander Viro wrote:
>> You are reinventing the wheel.
>> man ptrace (see PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_{ATTACH,CONT,DETACH})
>
>With ptrace data will be copied twice. As far as I understood, Jeremy
>wants to avoid that.

Yes - I've been looking at the sys_ptrace code and it seems that it
does two copies through the kernel (but, using access_process_vm
instead of copy_from_user -- I'm not sure how they differ). I'm
trying to reduce the number of copies by giving one process a pointer
into another process's address space.

Right now, my code looks something like this: (it might make more
sense if you know that I've written a framework for writing user-space
device drivers... I'm going to be releasing it soon, hopefully after I
resolve this performance problem. Or maybe before, if it's hard.)

Process X: (any arbitrary process)
read(fd, my_local_buffer);
Kernel:
[notifies process Y that X needs data]
Process Y: (the userspace device driver)
memcpy(my_local_buffer, data_destined_for_process_x);
write(fd, my_local_buffer);
Kernel (in sys_write:)
copy_from_user(kernel_buffer, Y's my_local_buffer);
copy_to_user(X's my_local_buffer, kernel_buffer);

Now, this all works fine, but it's slower than I'd like. I used SGI's
kernprof (oss.sgi.com/projects/kernprof) and found that the kernel was
spending most of its time (like 70%) in copy_to_user and
copy_from_user. So, I am hoping to make this all go faster, while
hopefully preserving correctness, by changing to something like this:

Process X:
read(fd, my_local_buffer);
Kernel:
[maps X's my_local_buffer to Y's address space]
[notifies process Y that X needs data, and sends along a
pointer to write data to, or a handle usable by shmat,
or something]
Process Y:
memcpy(X's my_local_buffer, data_destined_for_process_x);

If I'm lucky we can avoid both copy_to_user and copy_from_user.

This would be easy if X was a "cooperative" process because we could
just set up a shared memory segment between the two processes. The
problem is that X is not cooperative -- it can be any existing
program.

Regards,
Jeremy

2001-03-07 08:42:07

by Alexander Viro

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?



On Wed, 7 Mar 2001, Jeremy Elson wrote:

> Right now, my code looks something like this: (it might make more
> sense if you know that I've written a framework for writing user-space
> device drivers... I'm going to be releasing it soon, hopefully after I
> resolve this performance problem. Or maybe before, if it's hard.)

Ugh. Why not make that a named pipe and use zerocopy stuff for pipes?
I.e. why bother with making it look like a character device rather than
a FIFO?
Cheers,
Al

2001-03-07 08:55:18

by Jeremy Elson

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Alexander Viro writes:
>On Wed, 7 Mar 2001, Jeremy Elson wrote:
>
>> Right now, my code looks something like this: (it might make more
>> sense if you know that I've written a framework for writing user-space
>> device drivers... I'm going to be releasing it soon, hopefully after I
>> resolve this performance problem. Or maybe before, if it's hard.)
>
>Ugh. Why not make that a named pipe and use zerocopy stuff for pipes?
>I.e. why bother with making it look like a character device rather than
>a FIFO?

Well, because it's a character device :-). i.e,. the framework allows
you to write a userspace program that services callbacks for character
devices. Inside the kernel, all open()/release()/ioctl()/etc calls
for the device are proxied out to userspace where a library calls a
userspace callback, and the result goes back to the kernel where it is
then returned to the calling process.

The problem is just that to return data (instead of just a retval), as
is needed for read and some ioctls, it leads to 3 copies as I
described earlier.

BTW, where are the zerocopy patches for pipes? Maybe I'm missing
something but it seems that pipes inside the kernel are still
implememented by copying into the kernel and then copying out.
Whatever method the zerocopy pipes use is probably what I'm looking
for though.

-Jer

2001-03-07 08:59:48

by Abramo Bagnara

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Jeremy Elson wrote:
>
> Marcelo Tosatti writes:
> >On Wed, 7 Mar 2001, Alexander Viro wrote:
> >> You are reinventing the wheel.
> >> man ptrace (see PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_{ATTACH,CONT,DETACH})
> >
> >With ptrace data will be copied twice. As far as I understood, Jeremy
> >wants to avoid that.
>
> Yes - I've been looking at the sys_ptrace code and it seems that it
> does two copies through the kernel (but, using access_process_vm
> instead of copy_from_user -- I'm not sure how they differ). I'm
> trying to reduce the number of copies by giving one process a pointer
> into another process's address space.

I've investigated this in past to do the same thing you're trying (in my
case I was interested in a full user space implementation of OSS
emulation layer for ALSA).

The other huge drawback for ptrace is the impossibility to use mmap.

With older versions of Linux it was feasible to mmap /proc/PID/mem and
this was a very elegant way to do the job.

Unfortunately this feature has been dropped, but I don't know for which
reasons.

--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project is http://www.alsa-project.org
sponsored by SuSE Linux http://www.suse.com

It sounds good!

2001-03-07 09:05:09

by Abramo Bagnara

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Alexander Viro wrote:
>
> On Wed, 7 Mar 2001, Jeremy Elson wrote:
>
> > Right now, my code looks something like this: (it might make more
> > sense if you know that I've written a framework for writing user-space
> > device drivers... I'm going to be releasing it soon, hopefully after I
> > resolve this performance problem. Or maybe before, if it's hard.)
>
> Ugh. Why not make that a named pipe and use zerocopy stuff for pipes?
> I.e. why bother with making it look like a character device rather than
> a FIFO?

What about ioctl? Device drivers sometimes need it ;-)


--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project is http://www.alsa-project.org
sponsored by SuSE Linux http://www.suse.com

It sounds good!

2001-03-07 09:10:28

by Alexander Viro

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?



On Wed, 7 Mar 2001, Jeremy Elson wrote:

> Well, because it's a character device :-). i.e,. the framework allows
> you to write a userspace program that services callbacks for character
> devices. Inside the kernel, all open()/release()/ioctl()/etc calls
> for the device are proxied out to userspace where a library calls a
> userspace callback, and the result goes back to the kernel where it is
> then returned to the calling process.

Double ugh. Why bother with ioctl() when you can just have a second
channel and do read()/write() on it?

> The problem is just that to return data (instead of just a retval), as
> is needed for read and some ioctls, it leads to 3 copies as I
> described earlier.
>
> BTW, where are the zerocopy patches for pipes? Maybe I'm missing
> something but it seems that pipes inside the kernel are still
> implememented by copying into the kernel and then copying out.
> Whatever method the zerocopy pipes use is probably what I'm looking
> for though.

Ask DaveM or look through l-k archives for URL of recent variant...
Cheers,
Al

2001-03-07 09:11:38

by Alexander Viro

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?



On Wed, 7 Mar 2001, Abramo Bagnara wrote:

> Alexander Viro wrote:
> >
> > On Wed, 7 Mar 2001, Jeremy Elson wrote:
> >
> > > Right now, my code looks something like this: (it might make more
> > > sense if you know that I've written a framework for writing user-space
> > > device drivers... I'm going to be releasing it soon, hopefully after I
> > > resolve this performance problem. Or maybe before, if it's hard.)
> >
> > Ugh. Why not make that a named pipe and use zerocopy stuff for pipes?
> > I.e. why bother with making it look like a character device rather than
> > a FIFO?
>
> What about ioctl? Device drivers sometimes need it ;-)

No, they don't. OOB data is equivalent to data on parallel channel.

2001-03-07 09:21:18

by Abramo Bagnara

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Alexander Viro wrote:
>
> On Wed, 7 Mar 2001, Abramo Bagnara wrote:
>
> > Alexander Viro wrote:
> > >
> > > On Wed, 7 Mar 2001, Jeremy Elson wrote:
> > >
> > > > Right now, my code looks something like this: (it might make more
> > > > sense if you know that I've written a framework for writing user-space
> > > > device drivers... I'm going to be releasing it soon, hopefully after I
> > > > resolve this performance problem. Or maybe before, if it's hard.)
> > >
> > > Ugh. Why not make that a named pipe and use zerocopy stuff for pipes?
> > > I.e. why bother with making it look like a character device rather than
> > > a FIFO?
> >
> > What about ioctl? Device drivers sometimes need it ;-)
>
> No, they don't. OOB data is equivalent to data on parallel channel.

Al, you're perfectly right in principle (although last time I've checked
pipe and unix socket did not support OOB data. Is this changed
recently?).

But you're forgetting that we need to cope with non collaborative
applications (that *use* ioctl).

--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project is http://www.alsa-project.org
sponsored by SuSE Linux http://www.suse.com

It sounds good!

2001-03-07 09:50:48

by Alexander Viro

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?



On Wed, 7 Mar 2001, Abramo Bagnara wrote:

> > No, they don't. OOB data is equivalent to data on parallel channel.
>
> Al, you're perfectly right in principle (although last time I've checked
> pipe and unix socket did not support OOB data. Is this changed
> recently?).

ioctl() is OOB stuff. Replace with second pipe and you don't need it
anymore.

> But you're forgetting that we need to cope with non collaborative
> applications (that *use* ioctl).

Erm. If ioctls are device-specific - the program is already bound to
specific driver. If they are for class of devices (and if I guessed
right that's the case you are interested in - sound, isn't it?) we
could let the stub driver in kernel open two pipes and redirect
read()/write() on device to the first one turn ioctls into read()/write()
on the second. That way you can get userland programs serving that
stuff with no magic required.

I mean something along the lines of

foo_write(struct file *file, char *buf, size_t size, loff_t *ppos)
{
return f1->f_ops->write(file, buf, size, ppos);
}
foo_ioctl(struct file *file, int cmd, long arg)
{
loff_t dummy;
switch (cmd) {
case SNDCTL_DSP_RESET:
f2->f_ops->write(file, "dsp_reset", 10, &dummy);
return 0;
....
}
}
where f1 and f2 are named pipes opened at init time. Notice that
it doesn't introduce extra copying - just splits the stream with OOB data
into two normal byte streams that can be served by naive userland code.
After that any client that uses current sound API is able to talk to
userland drivers and said drivers don't have to pull any special tricks -
they just open two pipes and do usual select()/read()/write() loop.
Kernel side is also quite simple that way - plain redirect for read/write
and trivial decoder for the sound ioctls.
Cheers,
Al

2001-03-07 10:52:40

by Manfred Spraul

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 4
// SUBLEVEL = 1
// EXTRAVERSION =
--- 2.4/fs/pipe.c Wed Feb 7 20:02:07 2001
+++ build-2.4/fs/pipe.c Thu Feb 8 21:11:10 2001
@@ -2,6 +2,9 @@
* linux/fs/pipe.c
*
* Copyright (C) 1991, 1992, 1999 Linus Torvalds
+ *
+ * Major pipe_read() and pipe_write() cleanup, kiobuf based
+ * single copy Copyright (C) 2001 Manfred Spraul
*/

#include <linux/mm.h>
@@ -10,6 +13,8 @@
#include <linux/malloc.h>
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/iobuf.h>
+#include <linux/highmem.h>

#include <asm/uaccess.h>

@@ -35,97 +40,149 @@
down(PIPE_SEM(*inode));
}

+struct pipe_pio {
+ int *pdone;
+ struct kiobuf iobuf;
+};
+
+static int
+pio_copy_to_user(struct kiobuf* iobuf, int offset, char* ubuf, int chars)
+{
+ int page_nr;
+ offset += iobuf->offset;
+ page_nr = offset/PAGE_SIZE;
+ offset %= PAGE_SIZE;
+ while(chars) {
+ int pcount = PAGE_SIZE-offset;
+ void *kaddr;
+ if (pcount > chars)
+ pcount = chars;
+ kaddr = kmap(iobuf->maplist[page_nr]);
+ if (copy_to_user(ubuf, kaddr+offset, pcount))
+ return 1;
+ kunmap(iobuf->maplist[page_nr]);
+ chars -= pcount;
+ ubuf += pcount;
+ offset = 0;
+ page_nr++;
+ }
+ return 0;
+}
+
static ssize_t
pipe_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
{
struct inode *inode = filp->f_dentry->d_inode;
- ssize_t size, read, ret;
+ ssize_t read, ret;

- /* Seeks are not allowed on pipes. */
- ret = -ESPIPE;
- read = 0;
+ /* pread is not allowed on pipes. */
if (ppos != &filp->f_pos)
- goto out_nolock;
+ return -ESPIPE;

/* Always return 0 on null read. */
- ret = 0;
if (count == 0)
- goto out_nolock;
+ return 0;

- /* Get the pipe semaphore */
- ret = -ERESTARTSYS;
- if (down_interruptible(PIPE_SEM(*inode)))
- goto out_nolock;
+ down(PIPE_SEM(*inode));

- if (PIPE_EMPTY(*inode)) {
-do_more_read:
+ read = 0;
+ for (;;) {
+ /* read what data is available */
+ int chars = PIPE_LEN(*inode);
+ if (chars) {
+ char *pipebuf = PIPE_BASE(*inode);
+ int offset = PIPE_START(*inode);
+
+ if (chars > count)
+ chars = count;
+ ret = -EFAULT;
+ if(PIPE_IS_PIO(*inode)) {
+ struct pipe_pio* pio = ((struct pipe_pio*)pipebuf);
+ if(pio_copy_to_user(&pio->iobuf, offset, buf, chars))
+ goto out;
+
+ PIPE_LEN(*inode) -= chars;
+ if(!PIPE_LEN(*inode)) {
+ unmap_kiobuf(&pio->iobuf);
+ *pio->pdone = 1;
+ PIPE_IS_PIO(*inode) = 0;
+ PIPE_START(*inode) = 0;
+ } else {
+ PIPE_START(*inode) += chars;
+ }
+ } else {
+ if (chars > PIPE_SIZE-offset)
+ chars = PIPE_SIZE-offset;
+ if (copy_to_user(buf, pipebuf+offset, chars))
+ goto out;
+ PIPE_LEN(*inode) -= chars;
+ if (!PIPE_LEN(*inode)) {
+ /* Cache behaviour optimization */
+ PIPE_START(*inode) = 0;
+ } else {
+ PIPE_START(*inode) += chars;
+ PIPE_START(*inode) &= (PIPE_SIZE - 1);
+ }
+ }
+ read += chars;
+ count -= chars;
+ buf += chars;
+ }
ret = 0;
+ if (!count)
+ goto out;
+
+ /* Rare special case:
+ * The pipe buffer was really circular,
+ * the wrapped bytes must be read before sleeping.
+ */
+ if (PIPE_LEN(*inode))
+ continue;
+
+ /* Never sleep if no process has the pipe open
+ * for writing */
if (!PIPE_WRITERS(*inode))
goto out;

+ /* Never sleep if O_NONBLOCK is set */
ret = -EAGAIN;
if (filp->f_flags & O_NONBLOCK)
goto out;

- for (;;) {
- PIPE_WAITING_READERS(*inode)++;
- pipe_wait(inode);
- PIPE_WAITING_READERS(*inode)--;
- ret = -ERESTARTSYS;
- if (signal_pending(current))
- goto out;
+ /* optimization:
+ * pipe_read() should return even if only a single byte
+ * was read. (Posix Std. 6.4.1.2)
+ * But if another process is sleeping in pipe_write()
+ * then we wait for that data - it's invisible for user
+ * space programs.
+ */
+ if (PIPE_MORE_DATA_WAITING(*inode)) {
+ /*
+ * We know that we are going to sleep: signal
+ * writers synchronously that there is more
+ * room.
+ */
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ } else if (read) {
+ /* We know that there are no writers, no need
+ * for wake up.
+ */
ret = 0;
- if (!PIPE_EMPTY(*inode))
- break;
- if (!PIPE_WRITERS(*inode))
- goto out;
+ goto out;
}
- }

- /* Read what data is available. */
- ret = -EFAULT;
- while (count > 0 && (size = PIPE_LEN(*inode))) {
- char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
- ssize_t chars = PIPE_MAX_RCHUNK(*inode);
-
- if (chars > count)
- chars = count;
- if (chars > size)
- chars = size;
-
- if (copy_to_user(buf, pipebuf, chars))
+ pipe_wait(inode);
+ ret = -ERESTARTSYS;
+ if (signal_pending(current))
goto out;
-
- read += chars;
- PIPE_START(*inode) += chars;
- PIPE_START(*inode) &= (PIPE_SIZE - 1);
- PIPE_LEN(*inode) -= chars;
- count -= chars;
- buf += chars;
}

- /* Cache behaviour optimization */
- if (!PIPE_LEN(*inode))
- PIPE_START(*inode) = 0;
-
- if (count && PIPE_WAITING_WRITERS(*inode) && !(filp->f_flags & O_NONBLOCK)) {
- /*
- * We know that we are going to sleep: signal
- * writers synchronously that there is more
- * room.
- */
- wake_up_interruptible_sync(PIPE_WAIT(*inode));
- if (!PIPE_EMPTY(*inode))
- BUG();
- goto do_more_read;
- }
+out:
/* Signal writers asynchronously that there is more room. */
- wake_up_interruptible(PIPE_WAIT(*inode));
+ if (read && !PIPE_IS_PIO(*inode))
+ wake_up_interruptible(PIPE_WAIT(*inode));

- ret = read;
-out:
up(PIPE_SEM(*inode));
-out_nolock:
if (read)
ret = read;
return ret;
@@ -136,113 +193,143 @@
{
struct inode *inode = filp->f_dentry->d_inode;
ssize_t free, written, ret;
+ int pio_done, do_wakeup;

- /* Seeks are not allowed on pipes. */
- ret = -ESPIPE;
- written = 0;
+ /* pwrite is not allowed on pipes. */
if (ppos != &filp->f_pos)
- goto out_nolock;
+ return -ESPIPE;

/* Null write succeeds. */
- ret = 0;
if (count == 0)
- goto out_nolock;
-
- ret = -ERESTARTSYS;
- if (down_interruptible(PIPE_SEM(*inode)))
- goto out_nolock;
+ return 0;

- /* No readers yields SIGPIPE. */
- if (!PIPE_READERS(*inode))
- goto sigpipe;
+ down(PIPE_SEM(*inode));

/* If count <= PIPE_BUF, we have to make it atomic. */
free = (count <= PIPE_BUF ? count : 1);

- /* Wait, or check for, available space. */
- if (filp->f_flags & O_NONBLOCK) {
- ret = -EAGAIN;
- if (PIPE_FREE(*inode) < free)
+ written = 0;
+ pio_done = 1;
+ do_wakeup = 0;
+ for(;;) {
+ /* No readers yields SIGPIPE. */
+ ret = -EPIPE;
+ if (!PIPE_READERS(*inode))
goto out;
- } else {
- while (PIPE_FREE(*inode) < free) {
- PIPE_WAITING_WRITERS(*inode)++;
- pipe_wait(inode);
- PIPE_WAITING_WRITERS(*inode)--;
- ret = -ERESTARTSYS;
- if (signal_pending(current))
- goto out;

- if (!PIPE_READERS(*inode))
- goto sigpipe;
- }
- }
-
- /* Copy into available space. */
- ret = -EFAULT;
- while (count > 0) {
- int space;
- char *pipebuf = PIPE_BASE(*inode) + PIPE_END(*inode);
- ssize_t chars = PIPE_MAX_WCHUNK(*inode);
-
- if ((space = PIPE_FREE(*inode)) != 0) {
- if (chars > count)
- chars = count;
- if (chars > space)
- chars = space;
-
- if (copy_from_user(pipebuf, buf, chars))
- goto out;
-
- written += chars;
- PIPE_LEN(*inode) += chars;
- count -= chars;
- buf += chars;
- space = PIPE_FREE(*inode);
- continue;
+ if(!PIPE_IS_PIO(*inode)) {
+ int chars;
+ /* Copy into available space. */
+ chars = PIPE_FREE(*inode);
+
+ /*
+ * Try zero-copy:
+ * - only possible if the normal pipe buffer
+ * is empty
+ * - only possible if we can block:
+ * a) O_NONBLOCK not set
+ * and
+ * b) request for more than PIPE_BUF bytes.
+ * No Unix version blocks in pipe write for
+ * <= PIPE_BUF bytes after poll() returned POLLOUT.
+ */
+ ret = -EFAULT;
+ if (count > PIPE_BUF && chars == PIPE_SIZE &&
+ (!(filp->f_flags & O_NONBLOCK))) {
+ struct pipe_pio* pio = (struct pipe_pio*)PIPE_BASE(*inode);
+ chars = KIO_MAX_ATOMIC_BYTES;
+ if (chars > count)
+ chars = count;
+ kiobuf_init(&pio->iobuf);
+ if(map_user_kiobuf(READ, &pio->iobuf, (unsigned long)buf, chars))
+ goto out;
+ PIPE_IS_PIO(*inode) = 1;
+ pio_done = 0;
+ pio->pdone = &pio_done;
+
+ written += chars;
+ PIPE_LEN(*inode) += chars;
+ count -= chars;
+ buf += chars;
+ do_wakeup = 1;
+ } else if (chars >= free) {
+ int offset;
+next_chunk:
+ offset = PIPE_END(*inode);
+
+ if (chars > count)
+ chars = count;
+ if (chars > PIPE_SIZE-offset)
+ chars = PIPE_SIZE-offset;
+ if (copy_from_user(PIPE_BASE(*inode)+offset, buf, chars))
+ goto out;
+
+ written += chars;
+ PIPE_LEN(*inode) += chars;
+ count -= chars;
+ buf += chars;
+ do_wakeup = 1;
+
+ if(!count)
+ break; /* DONE! */
+
+ /* special case: pipe buffer wrapped */
+ if(PIPE_LEN(*inode) != PIPE_SIZE) {
+ chars = PIPE_FREE(*inode);
+ goto next_chunk;
+ }
+ }
}

- ret = written;
+ ret = -EAGAIN;
if (filp->f_flags & O_NONBLOCK)
break;

- do {
+ /* Do not wakeup unless data was written, otherwise
+ * multiple writers can cause a wakeup storm
+ */
+ if(do_wakeup) {
/*
* Synchronous wake-up: it knows that this process
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
wake_up_interruptible_sync(PIPE_WAIT(*inode));
- PIPE_WAITING_WRITERS(*inode)++;
- pipe_wait(inode);
- PIPE_WAITING_WRITERS(*inode)--;
- if (signal_pending(current))
- goto out;
- if (!PIPE_READERS(*inode))
- goto sigpipe;
- } while (!PIPE_FREE(*inode));
- ret = -EFAULT;
+ do_wakeup = 0;
+ }
+ if (count)
+ PIPE_MORE_DATA_WAITING(*inode)++;
+ pipe_wait(inode);
+ if (count)
+ PIPE_MORE_DATA_WAITING(*inode)--;
+ if (!count && pio_done)
+ break; /* DONE */
+ ret = -ERESTARTSYS;
+ if (signal_pending(current))
+ goto out;
}
-
- /* Signal readers asynchronously that there is more data. */
- wake_up_interruptible(PIPE_WAIT(*inode));
-
- inode->i_ctime = inode->i_mtime = CURRENT_TIME;
- mark_inode_dirty(inode);
-
out:
- up(PIPE_SEM(*inode));
-out_nolock:
- if (written)
- ret = written;
- return ret;
+ if(!pio_done) {
+ struct pipe_pio* pio = (struct pipe_pio*)PIPE_BASE(*inode);
+ PIPE_IS_PIO(*inode) = 0;
+ written -= PIPE_LEN(*inode);
+ PIPE_LEN(*inode) = 0;
+ unmap_kiobuf(&pio->iobuf);
+ wake_up_interruptible(PIPE_WAIT(*inode));
+ }
+ if (written) {
+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+ mark_inode_dirty(inode);

-sigpipe:
- if (written)
- goto out;
+ ret = written;
+ }
up(PIPE_SEM(*inode));
- send_sig(SIGPIPE, current, 0);
- return -EPIPE;
+ /* Signal readers asynchronously that there is more data. */
+ if(do_wakeup)
+ wake_up_interruptible(PIPE_WAIT(*inode));
+ if (ret == -EPIPE)
+ send_sig(SIGPIPE, current, 0);
+ return ret;
}

static loff_t
@@ -453,9 +540,10 @@

init_waitqueue_head(PIPE_WAIT(*inode));
PIPE_BASE(*inode) = (char*) page;
+ PIPE_IS_PIO(*inode) = 0;
PIPE_START(*inode) = PIPE_LEN(*inode) = 0;
PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 0;
- PIPE_WAITING_READERS(*inode) = PIPE_WAITING_WRITERS(*inode) = 0;
+ PIPE_MORE_DATA_WAITING(*inode) = 0;
PIPE_RCOUNTER(*inode) = PIPE_WCOUNTER(*inode) = 1;

return inode;
--- 2.4/include/linux/pipe_fs_i.h Wed Feb 7 20:02:07 2001
+++ build-2.4/include/linux/pipe_fs_i.h Wed Feb 7 22:08:15 2001
@@ -5,11 +5,11 @@
struct pipe_inode_info {
wait_queue_head_t wait;
char *base;
+ unsigned int is_pio;
unsigned int start;
unsigned int readers;
unsigned int writers;
- unsigned int waiting_readers;
- unsigned int waiting_writers;
+ unsigned int more_data;
unsigned int r_counter;
unsigned int w_counter;
};
@@ -21,12 +21,12 @@
#define PIPE_SEM(inode) (&(inode).i_sem)
#define PIPE_WAIT(inode) (&(inode).i_pipe->wait)
#define PIPE_BASE(inode) ((inode).i_pipe->base)
+#define PIPE_IS_PIO(inode) ((inode).i_pipe->is_pio)
#define PIPE_START(inode) ((inode).i_pipe->start)
#define PIPE_LEN(inode) ((inode).i_size)
#define PIPE_READERS(inode) ((inode).i_pipe->readers)
#define PIPE_WRITERS(inode) ((inode).i_pipe->writers)
-#define PIPE_WAITING_READERS(inode) ((inode).i_pipe->waiting_readers)
-#define PIPE_WAITING_WRITERS(inode) ((inode).i_pipe->waiting_writers)
+#define PIPE_MORE_DATA_WAITING(inode) ((inode).i_pipe->more_data)
#define PIPE_RCOUNTER(inode) ((inode).i_pipe->r_counter)
#define PIPE_WCOUNTER(inode) ((inode).i_pipe->w_counter)

@@ -34,8 +34,6 @@
#define PIPE_FULL(inode) (PIPE_LEN(inode) == PIPE_SIZE)
#define PIPE_FREE(inode) (PIPE_SIZE - PIPE_LEN(inode))
#define PIPE_END(inode) ((PIPE_START(inode) + PIPE_LEN(inode)) & (PIPE_SIZE-1))
-#define PIPE_MAX_RCHUNK(inode) (PIPE_SIZE - PIPE_START(inode))
-#define PIPE_MAX_WCHUNK(inode) (PIPE_SIZE - PIPE_END(inode))

/* Drop the inode semaphore and wait for a pipe event, atomically */
void pipe_wait(struct inode * inode);


Attachments:
patch-kiopipe (12.73 kB)

2001-03-07 13:06:52

by Abramo Bagnara

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Alexander Viro wrote:
>
>
> > But you're forgetting that we need to cope with non collaborative
> > applications (that *use* ioctl).
>
> Erm. If ioctls are device-specific - the program is already bound to
> specific driver. If they are for class of devices (and if I guessed
> right that's the case you are interested in - sound, isn't it?) we
> could let the stub driver in kernel open two pipes and redirect
> read()/write() on device to the first one turn ioctls into read()/write()
> on the second. That way you can get userland programs serving that
> stuff with no magic required.
>
> I mean something along the lines of
>
> foo_write(struct file *file, char *buf, size_t size, loff_t *ppos)
> {
> return f1->f_ops->write(file, buf, size, ppos);
> }
> foo_ioctl(struct file *file, int cmd, long arg)
> {
> loff_t dummy;
> switch (cmd) {
> case SNDCTL_DSP_RESET:
> f2->f_ops->write(file, "dsp_reset", 10, &dummy);
> return 0;
> ....
> }
> }
> where f1 and f2 are named pipes opened at init time. Notice that
> it doesn't introduce extra copying - just splits the stream with OOB data
> into two normal byte streams that can be served by naive userland code.
> After that any client that uses current sound API is able to talk to
> userland drivers and said drivers don't have to pull any special tricks -
> they just open two pipes and do usual select()/read()/write() loop.
> Kernel side is also quite simple that way - plain redirect for read/write
> and trivial decoder for the sound ioctls.

I agree, and this is how I wanted to implement it (apart that I thought
to packetize read/write/ioctl with an header and then to avoid to have
two pipes).

This has two major drawbacks (that have discouraged me):
a) no mmap support
b) the kernel side need to know the details of ioctl contents

a) bring to have an incomplete solution
b) bring the kernel space to be OSS dependent then giving a not general
solution for user space drivers.

The only thing I see that would cope with these drawbacks is to allow a
privileged server application to mmap the address space of another
client application. Then:

a) is solved trivially
b) is ok because server would read/write directly ioctl arguments from
client address space.

Security is a non issue here as also for kernel space driver
applications expose all their address space to driver.

--
Abramo Bagnara mailto:[email protected]

Opera Unica Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

ALSA project is http://www.alsa-project.org
sponsored by SuSE Linux http://www.suse.com

It sounds good!

2001-03-07 13:59:13

by Jesse Pollard

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Marcelo Tosatti <[email protected]>:
> On Wed, 7 Mar 2001, Alexander Viro wrote:
>
> >
> >
> > You are reinventing the wheel.
> > man ptrace (see PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_{ATTACH,CONT,DETACH})
>
> With ptrace data will be copied twice. As far as I understood, Jeremy
> wants to avoid that.

The the only way left would be to mmap a file. The second process could
mmap the same file to put data.I believe the buffers holding the data
would be shared between the two processes.

How the first process detects that I don't know (semaphore? signal?).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-03-07 21:03:26

by Jeremy Elson

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Alexander Viro writes:
>Erm. If ioctls are device-specific - the program is already bound to
>specific driver. If they are for class of devices (and if I guessed
>right that's the case you are interested in - sound, isn't it?) we
>could let the stub driver in kernel open two pipes and redirect
>read()/write() on device to the first one turn ioctls into read()/write()
>on the second. That way you can get userland programs serving that
>stuff with no magic required.

One problem, in addition to the points Abramo made, is that there's no
metadata that goes along with the data you're writing to the pipe.
This is fine as long as there's only one process reading, but in the
case of a single driver serving multiple instances of an open file,
there has to be some way for the userspace driver to communicate to
the kernel which open file needs to get the data.

A similar problem is that metadata needs to accompany the read request
when the userspace driver gets it (i.e., current file position, file
flags, size of the read that was requested, etc.) Although, that data
might be transmitted on the OOB channel.

Also, even though my framework right now only does character devices,
it's a pretty simple extension to support userspace block devices and
network devices as well. In these cases I think using a pipe gets
more and more complicated.

>From reading all the responses to the thread (thanks, everyone, for
your useful comments), it sounds like the best thing to do is use
Manfred's zerocopy pipe code as a model and implement something
similar in my framework. It's not really that much code anyway so
it's not much of a waste reimplementing it. And, I think
reimplementing those 30 lines will be much simpler than trying to
manage 2 parallel channels. If the zerocopy patch becomes part of the
kernel anyway, it gets even simpler, since I can probably just call
pio_copy_to_user instead of copying it.

Regards,
Jer

2001-03-07 23:23:10

by James Cloos

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

>>>>> "AV" == Alexander Viro <[email protected]> writes:

AV> Double ugh. Why bother with ioctl() when you can just have a
AV> second channel and do read()/write() on it?

Because you cannot rewrite -- or even re-compile -- every app this
should support. OSS emulation by ALSA is a great example, given
how many binary-only apps already exist which need OSS emulation
on an ALSA box.

I'm sure sound is not the only real application for this.

-JimC
--
James H. Cloos, Jr. <http://jhcloos.com/public_key> 1024D/ED7DAEA6
<[email protected]> E9E9 F828 61A4 6EA9 0F2B 63E7 997A 9F17 ED7D AEA6

2001-03-08 08:45:57

by Pavel Machek

[permalink] [raw]
Subject: Re: Mapping a piece of one process' addrspace to another?

Hi!

> > You are reinventing the wheel.
> > man ptrace (see PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_{ATTACH,CONT,DETACH})
>
> With ptrace data will be copied twice. As far as I understood, Jeremy
> wants to avoid that.

mmap /proc/pid/memory?
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.