2005-12-20 16:17:38

by Badari Pulavarty

[permalink] [raw]
Subject: [RFC][PATCH] New iovec support & VFS changes

Hi,

I was trying to add support for preadv()/pwritev() for threaded
databases. Currently the patch is in -mm tree.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-
rc5/2.6.15-rc5-mm3/broken-out/support-for-preadv-pwritev.patch

This needs a new set of system calls. Ulrich Drepper pointed out
that, instead of adding a system call for the limited functionality
it provides, why not we add new iovec interface as follows (offset-per-
segment) which provides greater functionality & flexibility.

+struct niovec
+{
+ void __user *iov_base;
+ __kernel_size_t iov_len;
+ __kernel_loff_t iov_off; /* NEW */
+};

In order to support this, we need to change all the file_operations
(readv/writev) and its helper functions to take this new structure.

I took a stab at doing it and I want feedback on whether this is
acceptable. All the patch does - is to make kernel use new structure,
but the existing syscalls like readv()/writev() still deals with
original one to keep the compatibility. (pipes and sockets need
changing too - which I have not addressed yet).

Is this the right approach ?

Please DO NOT apply (I haven't tested it at all).

Thanks,
Badari



Attachments:
niovec.patch (17.99 kB)

2005-12-20 17:00:06

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC][PATCH] New iovec support & VFS changes

Badari Pulavarty wrote:
> I was trying to add support for preadv()/pwritev() for threaded
> databases. Currently the patch is in -mm tree.
>
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-
> rc5/2.6.15-rc5-mm3/broken-out/support-for-preadv-pwritev.patch
>
> This needs a new set of system calls. Ulrich Drepper pointed out
> that, instead of adding a system call for the limited functionality
> it provides, why not we add new iovec interface as follows (offset-per-
> segment) which provides greater functionality & flexibility.
>
> +struct niovec
> +{
> + void __user *iov_base;
> + __kernel_size_t iov_len;
> + __kernel_loff_t iov_off; /* NEW */
> +};

For a database, it's also helpful to know when an operation is going
to block on I/O (i.e. because the data isn't cached, or write buffers
full) and if that's going to happen, move it to another thread, or
move other operations to another thread. This allows a program to
continue to work on other things concurrently with I/O more
effectively than thread pool guesswork.

So if you add these new syscalls, it would be helpful to add a "flags"
argument to each of them, and define one flag: "don't block on I/O".
When the flag is set, the syscalls should do as much reading or
writing as they can without blocking, and then return the count, or
EAGAIN.

(FreeBSD's sendfile() has an SF_NODISKIO flag which means this, and it
is used in exactly that way: so a program can move the sendfile() to
another thread iff that is necessary to avoid blocking the program.)

There's also a case for making these into async I/O operations.
However, if there is any possibility of async I/O blocking a task for
a long time (which there is with Linux async I/O apparently), that is
not half as useful as a flag to stop I/O when it would block, and let
the program decide what to do.

I mention this precisely because it's relevant to I/O performance of
databases and similar programs, and therefore a reason to have a
"flags" argument to these new syscalls, even if no flags are defined
at first.

-- Jamie

2005-12-20 17:26:24

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH] New iovec support & VFS changes

On Tue, 2005-12-20 at 16:59 +0000, Jamie Lokier wrote:
> Badari Pulavarty wrote:
> > I was trying to add support for preadv()/pwritev() for threaded
> > databases. Currently the patch is in -mm tree.
> >
> > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-
> > rc5/2.6.15-rc5-mm3/broken-out/support-for-preadv-pwritev.patch
> >
> > This needs a new set of system calls. Ulrich Drepper pointed out
> > that, instead of adding a system call for the limited functionality
> > it provides, why not we add new iovec interface as follows (offset-per-
> > segment) which provides greater functionality & flexibility.
> >
> > +struct niovec
> > +{
> > + void __user *iov_base;
> > + __kernel_size_t iov_len;
> > + __kernel_loff_t iov_off; /* NEW */
> > +};
>
> For a database, it's also helpful to know when an operation is going
> to block on I/O (i.e. because the data isn't cached, or write buffers
> full) and if that's going to happen, move it to another thread, or
> move other operations to another thread. This allows a program to
> continue to work on other things concurrently with I/O more
> effectively than thread pool guesswork.
>
> So if you add these new syscalls, it would be helpful to add a "flags"
> argument to each of them, and define one flag: "don't block on I/O".
> When the flag is set, the syscalls should do as much reading or
> writing as they can without blocking, and then return the count, or
> EAGAIN.
>
> (FreeBSD's sendfile() has an SF_NODISKIO flag which means this, and it
> is used in exactly that way: so a program can move the sendfile() to
> another thread iff that is necessary to avoid blocking the program.)
>
> There's also a case for making these into async I/O operations.
> However, if there is any possibility of async I/O blocking a task for
> a long time (which there is with Linux async I/O apparently), that is
> not half as useful as a flag to stop I/O when it would block, and let
> the program decide what to do.
>
> I mention this precisely because it's relevant to I/O performance of
> databases and similar programs, and therefore a reason to have a
> "flags" argument to these new syscalls, even if no flags are defined
> at first.

Wow !! More requirements :(

I was hoping for "well, nice but don't need it - drop it" kind of an
answer - so I can sleep better :)

Yes. flags can be added. But my main concern was, its going to change
all the VFS interfaces & helper routines. Is that okay ? This also
means that, its going to break out-of-tree filesystems (which I don't
care much).

Thanks,
Badari

2005-12-20 18:00:08

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC][PATCH] New iovec support & VFS changes

Badari Pulavarty wrote:

>I was trying to add support for preadv()/pwritev() for threaded
>databases. Currently the patch is in -mm tree.
>
>http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-
>rc5/2.6.15-rc5-mm3/broken-out/support-for-preadv-pwritev.patch
>
>This needs a new set of system calls. Ulrich Drepper pointed out
>that, instead of adding a system call for the limited functionality
>it provides, why not we add new iovec interface as follows (offset-per-
>segment) which provides greater functionality & flexibility.
>
>+struct niovec
>+{
>+ void __user *iov_base;
>+ __kernel_size_t iov_len;
>+ __kernel_loff_t iov_off; /* NEW */
>+};
>
>In order to support this, we need to change all the file_operations
>(readv/writev) and its helper functions to take this new structure.
>
>I took a stab at doing it and I want feedback on whether this is
>acceptable. All the patch does - is to make kernel use new structure,
>but the existing syscalls like readv()/writev() still deals with
>original one to keep the compatibility. (pipes and sockets need
>changing too - which I have not addressed yet).
>
>Is this the right approach ?
>
>
>
You can io_submit() a list of IO_CMD_PREAD[V]s and immediately
io_getevents() them. In addition to specifying different file offsets
you can mix reads and writes, mix file descriptors, and reap nonblocking
events quickly (by specifying a timeout of zero).

Sure, it's two syscalls instead of one, but it's much more flexibles,
and databases should be using aio anyway. Oh, and no kernel changes
needed, apart from merging vectored aio.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2005-12-20 18:08:05

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC][PATCH] New iovec support & VFS changes

On Tue, 2005-12-20 at 20:00 +0200, Avi Kivity wrote:
> Badari Pulavarty wrote:
>
> >I was trying to add support for preadv()/pwritev() for threaded
> >databases. Currently the patch is in -mm tree.
> >
> >http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-
> >rc5/2.6.15-rc5-mm3/broken-out/support-for-preadv-pwritev.patch
> >
> >This needs a new set of system calls. Ulrich Drepper pointed out
> >that, instead of adding a system call for the limited functionality
> >it provides, why not we add new iovec interface as follows (offset-per-
> >segment) which provides greater functionality & flexibility.
> >
> >+struct niovec
> >+{
> >+ void __user *iov_base;
> >+ __kernel_size_t iov_len;
> >+ __kernel_loff_t iov_off; /* NEW */
> >+};
> >
> >In order to support this, we need to change all the file_operations
> >(readv/writev) and its helper functions to take this new structure.
> >
> >I took a stab at doing it and I want feedback on whether this is
> >acceptable. All the patch does - is to make kernel use new structure,
> >but the existing syscalls like readv()/writev() still deals with
> >original one to keep the compatibility. (pipes and sockets need
> >changing too - which I have not addressed yet).
> >
> >Is this the right approach ?
> >
> >
> >
> You can io_submit() a list of IO_CMD_PREAD[V]s and immediately
> io_getevents() them. In addition to specifying different file offsets
> you can mix reads and writes, mix file descriptors, and reap nonblocking
> events quickly (by specifying a timeout of zero).
>
> Sure, it's two syscalls instead of one, but it's much more flexibles,
> and databases should be using aio anyway. Oh, and no kernel changes
> needed, apart from merging vectored aio.


Yes. We discussed this also earlier. Using AIO is the alternative.
But using AIO is not simple as doing preadv()/pwritev() for the
applications doesn't care about using AIO. AIO needs extra coding
to setup context, iocb, submits and getevents etc..

And also, inside the kernel - AIO requests go through lots of
code/routines -- before coming to ->aio_read() -- which I was
planning to avoid by having a direct syscall to do preadv/pwritev.

BTW, we still don't have vectored AIO support in the kernel.
Zack is working on it - which would add another set of
file operations aio_readv/aio_writev.

Thanks,
Badari

2005-12-20 18:20:37

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC][PATCH] New iovec support & VFS changes

Badari Pulavarty wrote:

>On Tue, 2005-12-20 at 20:00 +0200, Avi Kivity wrote:
>
>
>>You can io_submit() a list of IO_CMD_PREAD[V]s and immediately
>>io_getevents() them. In addition to specifying different file offsets
>>you can mix reads and writes, mix file descriptors, and reap nonblocking
>>events quickly (by specifying a timeout of zero).
>>
>>Sure, it's two syscalls instead of one, but it's much more flexibles,
>>and databases should be using aio anyway. Oh, and no kernel changes
>>needed, apart from merging vectored aio.
>>
>>
>
>
>Yes. We discussed this also earlier. Using AIO is the alternative.
>But using AIO is not simple as doing preadv()/pwritev() for the
>applications doesn't care about using AIO. AIO needs extra coding
>to setup context, iocb, submits and getevents etc..
>
>
Possibly a library can do that (placing the context in thread local
storage), but I see your point.

>And also, inside the kernel - AIO requests go through lots of
>code/routines -- before coming to ->aio_read() -- which I was
>planning to avoid by having a direct syscall to do preadv/pwritev.
>
>
I'd be surprised if this isn't dominated by the cost of serving the
request, even from cache.

If we can persuade the aio maintainers to add some flag to io_submit()
which makes the request synchronous, it would reduce the overhead
somewhat, but it is against the spirit of aio.

>BTW, we still don't have vectored AIO support in the kernel.
>Zack is working on it - which would add another set of
>file operations aio_readv/aio_writev.
>
>
Hopefully they will supercede aio_read/aio_write?

BTW, don't databases like using many many files for their backing store?
The ability to write to many fds in one call should be attractive to them.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.