2009-01-16 16:46:50

by Gerd Hoffmann

[permalink] [raw]
Subject: [PATCH v6 0/5] Add preadv & pwritev system calls.

Hi folks,

Next round of the preadv & pwritev patch series. Had no review comments
to fix. That means it is ready to be merged, right?

Changes:
- compat_sys_{read,write}v bugfix patch dropped (merged).
- rebase to latest git, adapt to CVE-2009-0029 changes.

How to proceed now? Is there a syscall maintainer where I could queue
up the patches? If not, anyone (akpm?) willng to pick this up? Should
I try to send to Linus directly?

What is the usual way to handle the arch-specific syscall windup? I'd
prefer to leave that to the arch maintainers as they know best what
needs to be done, is that ok? Right now only x86 (/me) and mips (patch
from Ralf Baechle) is covered ...

cheers,
Gerd


2009-01-16 16:46:26

by Gerd Hoffmann

[permalink] [raw]
Subject: [PATCH 2/5] create compat_writev()

Factor out some code from compat_sys_writev() which can be shared with the
upcoming compat_sys_pwritev().

Signed-off-by: Gerd Hoffmann <[email protected]>
---
fs/compat.c | 25 +++++++++++++++++--------
1 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 63738cc..9f20d10 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1202,15 +1202,11 @@ compat_sys_readv(unsigned long fd, const struct compat_iovec __user *vec, unsign
return ret;
}

-asmlinkage ssize_t
-compat_sys_writev(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
+static size_t compat_writev(struct file *file, const struct compat_iovec __user *vec,
+ unsigned long vlen, loff_t *pos)
{
- struct file *file;
ssize_t ret = -EBADF;

- file = fget(fd);
- if (!file)
- return -EBADF;
if (!(file->f_mode & FMODE_WRITE))
goto out;

@@ -1218,16 +1214,29 @@ compat_sys_writev(unsigned long fd, const struct compat_iovec __user *vec, unsig
if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
goto out;

- ret = compat_do_readv_writev(WRITE, file, vec, vlen, &file->f_pos);
+ ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);

out:
if (ret > 0)
add_wchar(current, ret);
inc_syscw(current);
- fput(file);
return ret;
}

+asmlinkage ssize_t
+compat_sys_writev(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
+{
+ struct file *file;
+ ssize_t ret;
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+ ret = compat_writev(file, vec, vlen, &file->f_pos);
+ fput(file);
+ return ret;
+}
+
asmlinkage long
compat_sys_vmsplice(int fd, const struct compat_iovec __user *iov32,
unsigned int nr_segs, unsigned int flags)
--
1.6.1

2009-01-16 16:47:27

by Gerd Hoffmann

[permalink] [raw]
Subject: [PATCH 1/5] create compat_readv()

Factor out some code from compat_sys_readv() which can be shared with the
upcoming compat_sys_preadv().

Signed-off-by: Gerd Hoffmann <[email protected]>
---
fs/compat.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 65a070e..63738cc 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1167,16 +1167,11 @@ out:
return ret;
}

-asmlinkage ssize_t
-compat_sys_readv(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
+static size_t compat_readv(struct file *file, const struct compat_iovec __user *vec,
+ unsigned long vlen, loff_t *pos)
{
- struct file *file;
ssize_t ret = -EBADF;

- file = fget(fd);
- if (!file)
- return -EBADF;
-
if (!(file->f_mode & FMODE_READ))
goto out;

@@ -1184,12 +1179,25 @@ compat_sys_readv(unsigned long fd, const struct compat_iovec __user *vec, unsign
if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
goto out;

- ret = compat_do_readv_writev(READ, file, vec, vlen, &file->f_pos);
+ ret = compat_do_readv_writev(READ, file, vec, vlen, pos);

out:
if (ret > 0)
add_rchar(current, ret);
inc_syscr(current);
+ return ret;
+}
+
+asmlinkage ssize_t
+compat_sys_readv(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
+{
+ struct file *file;
+ ssize_t ret;
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+ ret = compat_readv(file, vec, vlen, &file->f_pos);
fput(file);
return ret;
}
--
1.6.1

2009-01-16 16:48:00

by Gerd Hoffmann

[permalink] [raw]
Subject: [PATCH 3/5] Add preadv and pwritev system calls.

This patch adds preadv and pwritev system calls. These syscalls are a
pretty straightforward combination of pread and readv (same for write).
They are quite useful for doing vectored I/O in threaded applications.
Using lseek+readv instead opens race windows you'll have to plug with
locking.

Other systems have such system calls too, for example NetBSD, check
here: http://www.daemon-systems.org/man/preadv.2.html

The application-visible interface provided by glibc should look like
this to be compatible to the existing implementations in the *BSD family:

ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);

This prototype has one problem though: On 32bit archs is the (64bit)
offset argument unaligned, which the syscall ABI of several archs
doesn't allow to do. At least s390 needs a wrapper in glibc to handle
this. As we'll need a wrappers in glibc anyway I've decided to push
problem to glibc entriely and use a syscall prototype which works
without arch-specific wrappers inside the kernel: The offset argument
is explicitly splitted into two 32bit values.

The patch sports the actual system call implementation and the windup in
the x86 system call tables. Other archs follow as separate patches.

Signed-off-by: Gerd Hoffmann <[email protected]>
---
arch/x86/ia32/ia32entry.S | 2 +
arch/x86/include/asm/unistd_32.h | 2 +
arch/x86/include/asm/unistd_64.h | 4 +++
arch/x86/kernel/syscall_table_32.S | 2 +
fs/compat.c | 36 ++++++++++++++++++++++++++
fs/read_write.c | 50 ++++++++++++++++++++++++++++++++++++
include/linux/compat.h | 6 ++++
include/linux/syscalls.h | 4 +++
8 files changed, 106 insertions(+), 0 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 256b00b..9a8501b 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -826,4 +826,6 @@ ia32_sys_call_table:
.quad sys_dup3 /* 330 */
.quad sys_pipe2
.quad sys_inotify_init1
+ .quad compat_sys_preadv
+ .quad compat_sys_pwritev
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..6e72d74 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
#define __NR_dup3 330
#define __NR_pipe2 331
#define __NR_inotify_init1 332
+#define __NR_preadv 333
+#define __NR_pwritev 334

#ifdef __KERNEL__

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d2e415e..f818294 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -653,6 +653,10 @@ __SYSCALL(__NR_dup3, sys_dup3)
__SYSCALL(__NR_pipe2, sys_pipe2)
#define __NR_inotify_init1 294
__SYSCALL(__NR_inotify_init1, sys_inotify_init1)
+#define __NR_preadv 295
+__SYSCALL(__NR_preadv, sys_preadv)
+#define __NR_pwritev 296
+__SYSCALL(__NR_pwritev, sys_pwritev)


#ifndef __NO_STUBS
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index e2e86a0..106204c 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+ .long sys_preadv
+ .long sys_pwritev
diff --git a/fs/compat.c b/fs/compat.c
index 9f20d10..20fec0e 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1202,6 +1202,24 @@ compat_sys_readv(unsigned long fd, const struct compat_iovec __user *vec, unsign
return ret;
}

+asmlinkage ssize_t
+compat_sys_preadv(unsigned long fd, const struct compat_iovec __user *vec,
+ unsigned long vlen, u32 pos_high, u32 pos_low)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+ struct file *file;
+ ssize_t ret;
+
+ if (pos < 0)
+ return -EINVAL;
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+ ret = compat_readv(file, vec, vlen, &pos);
+ fput(file);
+ return ret;
+}
+
static size_t compat_writev(struct file *file, const struct compat_iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
@@ -1237,6 +1255,24 @@ compat_sys_writev(unsigned long fd, const struct compat_iovec __user *vec, unsig
return ret;
}

+asmlinkage ssize_t
+compat_sys_pwritev(unsigned long fd, const struct compat_iovec __user *vec,
+ unsigned long vlen, u32 pos_high, u32 pos_low)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+ struct file *file;
+ ssize_t ret;
+
+ if (pos < 0)
+ return -EINVAL;
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+ ret = compat_writev(file, vec, vlen, &pos);
+ fput(file);
+ return ret;
+}
+
asmlinkage long
compat_sys_vmsplice(int fd, const struct compat_iovec __user *iov32,
unsigned int nr_segs, unsigned int flags)
diff --git a/fs/read_write.c b/fs/read_write.c
index 400fe81..39de95c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -731,6 +731,56 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, u32, pos_high, u32, pos_low)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+ struct file *file;
+ ssize_t ret = -EBADF;
+ int fput_needed;
+
+ if (pos < 0)
+ return -EINVAL;
+
+ file = fget_light(fd, &fput_needed);
+ if (file) {
+ ret = -ESPIPE;
+ if (file->f_mode & FMODE_PREAD)
+ ret = vfs_readv(file, vec, vlen, &pos);
+ fput_light(file, fput_needed);
+ }
+
+ if (ret > 0)
+ add_rchar(current, ret);
+ inc_syscr(current);
+ return ret;
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, u32, pos_high, u32, pos_low)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+ struct file *file;
+ ssize_t ret = -EBADF;
+ int fput_needed;
+
+ if (pos < 0)
+ return -EINVAL;
+
+ file = fget_light(fd, &fput_needed);
+ if (file) {
+ ret = -ESPIPE;
+ if (file->f_mode & FMODE_PWRITE)
+ ret = vfs_writev(file, vec, vlen, &pos);
+ fput_light(file, fput_needed);
+ }
+
+ if (ret > 0)
+ add_wchar(current, ret);
+ inc_syscw(current);
+ return ret;
+}
+
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3fd2194..79dba49 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -183,6 +183,12 @@ asmlinkage ssize_t compat_sys_readv(unsigned long fd,
const struct compat_iovec __user *vec, unsigned long vlen);
asmlinkage ssize_t compat_sys_writev(unsigned long fd,
const struct compat_iovec __user *vec, unsigned long vlen);
+asmlinkage ssize_t compat_sys_preadv(unsigned long fd,
+ const struct compat_iovec __user *vec,
+ unsigned long vlen, u32 pos_high, u32 pos_low);
+asmlinkage ssize_t compat_sys_pwritev(unsigned long fd,
+ const struct compat_iovec __user *vec,
+ unsigned long vlen, u32 pos_high, u32 pos_low);

int compat_do_execve(char * filename, compat_uptr_t __user *argv,
compat_uptr_t __user *envp, struct pt_regs * regs);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 16875f8..333377e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -456,6 +456,10 @@ asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
size_t count, loff_t pos);
asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
size_t count, loff_t pos);
+asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, u32 pos_high, u32 pos_low);
+asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, u32 pos_high, u32 pos_low);
asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
asmlinkage long sys_mkdir(const char __user *pathname, int mode);
asmlinkage long sys_chdir(const char __user *filename);
--
1.6.1

2009-01-16 16:48:27

by Gerd Hoffmann

[permalink] [raw]
Subject: [PATCH 5/5] switch compat readv/preadv/writev/pwritev from fget to fget_light


Signed-off-by: Gerd Hoffmann <[email protected]>
---
fs/compat.c | 20 ++++++++++++--------
1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 20fec0e..90a10e9 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1192,13 +1192,14 @@ asmlinkage ssize_t
compat_sys_readv(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
{
struct file *file;
+ int fput_needed;
ssize_t ret;

- file = fget(fd);
+ file = fget_light(fd, &fput_needed);
if (!file)
return -EBADF;
ret = compat_readv(file, vec, vlen, &file->f_pos);
- fput(file);
+ fput_light(file, fput_needed);
return ret;
}

@@ -1208,15 +1209,16 @@ compat_sys_preadv(unsigned long fd, const struct compat_iovec __user *vec,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
struct file *file;
+ int fput_needed;
ssize_t ret;

if (pos < 0)
return -EINVAL;
- file = fget(fd);
+ file = fget_light(fd, &fput_needed);
if (!file)
return -EBADF;
ret = compat_readv(file, vec, vlen, &pos);
- fput(file);
+ fput_light(file, fput_needed);
return ret;
}

@@ -1245,13 +1247,14 @@ asmlinkage ssize_t
compat_sys_writev(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
{
struct file *file;
+ int fput_needed;
ssize_t ret;

- file = fget(fd);
+ file = fget_light(fd, &fput_needed);
if (!file)
return -EBADF;
ret = compat_writev(file, vec, vlen, &file->f_pos);
- fput(file);
+ fput_light(file, fput_needed);
return ret;
}

@@ -1261,15 +1264,16 @@ compat_sys_pwritev(unsigned long fd, const struct compat_iovec __user *vec,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
struct file *file;
+ int fput_needed;
ssize_t ret;

if (pos < 0)
return -EINVAL;
- file = fget(fd);
+ file = fget_light(fd, &fput_needed);
if (!file)
return -EBADF;
ret = compat_writev(file, vec, vlen, &pos);
- fput(file);
+ fput_light(file, fput_needed);
return ret;
}

--
1.6.1

2009-01-16 16:48:51

by Gerd Hoffmann

[permalink] [raw]
Subject: [PATCH 4/5] MIPS: Add preadv(2) and pwritev(2) syscalls.

From: Ralf Baechle <[email protected]>

From: Ralf Baechle <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
Signed-off-by: Gerd Hoffmann <[email protected]>
---
arch/mips/include/asm/unistd.h | 18 ++++++++++++------
arch/mips/kernel/scall32-o32.S | 2 ++
arch/mips/kernel/scall64-64.S | 2 ++
arch/mips/kernel/scall64-n32.S | 2 ++
arch/mips/kernel/scall64-o32.S | 2 ++
5 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/asm/unistd.h b/arch/mips/include/asm/unistd.h
index a73e153..4000501 100644
--- a/arch/mips/include/asm/unistd.h
+++ b/arch/mips/include/asm/unistd.h
@@ -350,16 +350,18 @@
#define __NR_dup3 (__NR_Linux + 327)
#define __NR_pipe2 (__NR_Linux + 328)
#define __NR_inotify_init1 (__NR_Linux + 329)
+#define __NR_preadv (__NR_Linux + 330)
+#define __NR_pwritev (__NR_Linux + 331)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 329
+#define __NR_Linux_syscalls 331

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 329
+#define __NR_O32_Linux_syscalls 331

#if _MIPS_SIM == _MIPS_SIM_ABI64

@@ -656,16 +658,18 @@
#define __NR_dup3 (__NR_Linux + 286)
#define __NR_pipe2 (__NR_Linux + 287)
#define __NR_inotify_init1 (__NR_Linux + 288)
+#define __NR_preadv (__NR_Linux + 289)
+#define __NR_pwritev (__NR_Linux + 290)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 288
+#define __NR_Linux_syscalls 290

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 288
+#define __NR_64_Linux_syscalls 290

#if _MIPS_SIM == _MIPS_SIM_NABI32

@@ -966,16 +970,18 @@
#define __NR_dup3 (__NR_Linux + 290)
#define __NR_pipe2 (__NR_Linux + 291)
#define __NR_inotify_init1 (__NR_Linux + 292)
+#define __NR_preadv (__NR_Linux + 293)
+#define __NR_pwritev (__NR_Linux + 294)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 292
+#define __NR_Linux_syscalls 294

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 292
+#define __NR_N32_Linux_syscalls 294

#ifdef __KERNEL__

diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 51d1ba4..0198a9c 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -650,6 +650,8 @@ einval: li v0, -ENOSYS
sys sys_dup3 3
sys sys_pipe2 2
sys sys_inotify_init1 1
+ sys sys_preadv 6 /* 4330 */
+ sys sys_pwritev 6
.endm

/* We pre-compute the number of _instruction_ bytes needed to
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index a9e1716..217e3ce 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -487,4 +487,6 @@ sys_call_table:
PTR sys_dup3
PTR sys_pipe2
PTR sys_inotify_init1
+ PTR sys_preadv
+ PTR sys_pwritev /* 5390 */
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index 30f3b63..f340963 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -413,4 +413,6 @@ EXPORT(sysn32_call_table)
PTR sys_dup3 /* 5290 */
PTR sys_pipe2
PTR sys_inotify_init1
+ PTR sys_preadv
+ PTR sys_pwritev
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index fefef4a..b1d281a 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -533,4 +533,6 @@ sys_call_table:
PTR sys_dup3
PTR sys_pipe2
PTR sys_inotify_init1
+ PTR compat_sys_preadv /* 4330 */
+ PTR compat_sys_pwritev
.size sys_call_table,.-sys_call_table
--
1.6.1

2009-01-16 16:53:44

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] Add preadv & pwritev system calls.

Gerd,

On Sat, Jan 17, 2009 at 5:45 AM, Gerd Hoffmann <[email protected]> wrote:
> Hi folks,
>
> Next round of the preadv & pwritev patch series.

Have you somewhere along the way posted some userland example/test
programs for these syscalls? If not, could you provide some?

Cheers,

Michael

> Had no review comments
> to fix. That means it is ready to be merged, right?
>
> Changes:
> - compat_sys_{read,write}v bugfix patch dropped (merged).
> - rebase to latest git, adapt to CVE-2009-0029 changes.
>
> How to proceed now? Is there a syscall maintainer where I could queue
> up the patches? If not, anyone (akpm?) willng to pick this up? Should
> I try to send to Linus directly?
>
> What is the usual way to handle the arch-specific syscall windup? I'd
> prefer to leave that to the arch maintainers as they know best what
> needs to be done, is that ok? Right now only x86 (/me) and mips (patch
> from Ralf Baechle) is covered ...
>
> cheers,
> Gerd
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html

2009-01-16 17:28:24

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/5] create compat_writev()

On Friday 16 January 2009, Gerd Hoffmann wrote:
> +asmlinkage ssize_t
> +compat_sys_writev(unsigned long fd, const struct compat_iovec __user *vec, unsigned long vlen)
> +{
> +???????struct file *file;
> +???????ssize_t ret;
> +
> +???????file = fget(fd);
> +???????if (!file)
> +???????????????return -EBADF;
> +???????ret = compat_writev(file, vec, vlen, &file->f_pos);
> + ? ? ? ?fput(file);
> + ? ? ? ?return ret;
> +}

This one still looks whitespace damaged, did you run the latest version
through checkpatch.pl?

Arnd <><

2009-01-16 17:34:18

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/5] Add preadv and pwritev system calls.

On Friday 16 January 2009, Gerd Hoffmann wrote:
> +asmlinkage ssize_t compat_sys_preadv(unsigned long fd,
> +???????????????const struct compat_iovec __user *vec,
> +???????????????unsigned long vlen, u32 pos_high, u32 pos_low);
> +asmlinkage ssize_t compat_sys_pwritev(unsigned long fd,
> +???????????????const struct compat_iovec __user *vec,
> +???????????????unsigned long vlen, u32 pos_high, u32 pos_low);
> ?
> ?int compat_do_execve(char * filename, compat_uptr_t __user *argv,
> ???????? ? ? ? ?compat_uptr_t __user *envp, struct pt_regs * regs);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 16875f8..333377e 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -456,6 +456,10 @@ asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
> ???????????????????????? ? ?size_t count, loff_t pos);
> ?asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
> ???????????????????????? ? ? size_t count, loff_t pos);
> +asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long vlen, u32 pos_high, u32 pos_low);
> +asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long vlen, u32 pos_high, u32 pos_low);

Conventionally, the 'fd' argument has type 'int', not 'unsigned long', but you
evidently copied this from readv/writev, so you can't really be blamed for it.
Not sure what the right thing to do here is.

Arnd <><

2009-01-16 17:53:21

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] Add preadv & pwritev system calls.

On Friday 16 January 2009, Gerd Hoffmann wrote:
> Next round of the preadv & pwritev patch series. ?Had no review comments
> to fix. ?That means it is ready to be merged, right?

In general, getting no feedback does not mean something's ready ;-)

Your patches look pretty polished though, and IIRC the only
real objection was to whether or not the syscalls should be
there in the first place. Did you get any feedback from Ulrich
Drepper as to whether he plans to add support to glibc?

> Changes:
> ?- compat_sys_{read,write}v bugfix patch dropped (merged).
> ?- rebase to latest git, adapt to CVE-2009-0029 changes.
>
> How to proceed now? ?Is there a syscall maintainer where I could queue
> up the patches? ?If not, anyone (akpm?) willng to pick this up? ?Should
> I try to send to Linus directly?
>
> What is the usual way to handle the arch-specific syscall windup? ?I'd
> prefer to leave that to the arch maintainers as they know best what
> needs to be done, is that ok? ?Right now only x86 (/me) and mips (patch
> from Ralf Baechle) is covered ...

I'd say get it into linux-next as a git tree, then let the arch maintainers
send you the missing patches to hook up the syscalls so that it can go
in as one chunk.

Have you done the glibc patch already? You probably also need to provide
an alternative user space implementation based on a pread/pwrite loop for
older kernels.

Arnd <><

2009-01-16 19:20:50

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] Add preadv & pwritev system calls.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Arnd Bergmann wrote:
> Did you get any feedback from Ulrich
> Drepper as to whether he plans to add support to glibc?

If they are in the kernel there is no reason not to export them from
glibc. But I have a general comment about all kinds of read syscalls.
If think they have been misdesigned from day one and if we are going to
add new ones we might want to fix them.

The problem is that they don't allow for zero-copy operations in enough
cases. The kernel is not free to store the data wherever it wants even
if the userlevel code is fine with that. Ideally the program would tell
the kernel that it is fine with any addressable address and provides a
buffer for the kernel to use in case zero-copy into that buffer is
possible or no zero-copy is possible at all. An interface could look
like this:

ssize_t readz (int fd, void *buf, size_t len, void **res)

(and accordingly for similar calls). The application will then use the
pointer stored at the address pointed to by the fourth parameter instead
of unconditionally using the buffer pointed to by the second parameter.
For res==NULL the semantics could be the same as the normal read().

This is not the only interface needed to make this work. Somehow the
memory used for the zero-copy buffers has to be administrated. At the
very least an interface to mark the buffer returned by readz() as unused
is needed.

There is a lot to think about before this can be done (something I
started back in my 2006 OLS paper [1]). But I wonder whether it's worth
preparing for it and not add yet more interfaces which aren't ready for
this type of I/O.


[1] http://people.redhat.com/drepper/newni.pdf

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklw3bIACgkQ2ijCOnn/RHSutgCgvIZki4gZfuLzwCOGkZqOf97v
1LYAn3fQj0C8CabsfvaYonFTZQ3oUtSn
=EDYF
-----END PGP SIGNATURE-----

2009-01-19 14:21:15

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] Add preadv & pwritev system calls.

Ulrich Drepper wrote:
> If they are in the kernel there is no reason not to export them from
> glibc.

Great.

> But I have a general comment about all kinds of read syscalls.
> If think they have been misdesigned from day one and if we are going to
> add new ones we might want to fix them.
>
> The problem is that they don't allow for zero-copy operations in enough
> cases. The kernel is not free to store the data wherever it wants even
> if the userlevel code is fine with that.

[ ... more text snipped ... ]

I do see the point in adding a interface like this ...

> ssize_t readz (int fd, void *buf, size_t len, void **res)

... to help the kernel do zero-copy I/O.

I think system calls for vector I/O are *not* the right place for that
though. Usually applications use vectored I/O because they *do* care
about the place the data is stored, because vectored I/O allows them to
avoid copying data within the application.

cheers,
Gerd

2009-01-21 00:19:24

by Petr Baudis

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] Add preadv & pwritev system calls.

On Mon, Jan 19, 2009 at 03:20:11PM +0100, Gerd Hoffmann wrote:
> I do see the point in adding a interface like this ...
>
> > ssize_t readz (int fd, void *buf, size_t len, void **res)
>
> ... to help the kernel do zero-copy I/O.
>
> I think system calls for vector I/O are *not* the right place for that
> though. Usually applications use vectored I/O because they *do* care
> about the place the data is stored, because vectored I/O allows them to
> avoid copying data within the application.

Can you elaborate on this? An application would have to have quite a
contrived design if its pointers simply cannot be updated according
to what the kernel returns.

Then again, I'm not sure why wouldn't readv() actually be
zerocopy-ready. Just make sure you handle iov_base being NULL gracefully
now (EINVAL, with the remark that the kernel can write to the iovec
memory area in the future) and later the kernel can in that case set
iov_base to the buffer location?

--
Petr "Pasky" Baudis
The average, healthy, well-adjusted adult gets up at seven-thirty
in the morning feeling just terrible. -- Jean Kerr

2009-01-21 09:32:28

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] Add preadv & pwritev system calls.

Petr Baudis wrote:
> On Mon, Jan 19, 2009 at 03:20:11PM +0100, Gerd Hoffmann wrote:
>> I do see the point in adding a interface like this ...
>>
>>> ssize_t readz (int fd, void *buf, size_t len, void **res)
>> ... to help the kernel do zero-copy I/O.
>>
>> I think system calls for vector I/O are *not* the right place for that
>> though. Usually applications use vectored I/O because they *do* care
>> about the place the data is stored, because vectored I/O allows them to
>> avoid copying data within the application.
>
> Can you elaborate on this? An application would have to have quite a
> contrived design if its pointers simply cannot be updated according
> to what the kernel returns.

Well. The "just update the pointers" argument is bogous. It simply
doesn't work in general for a number of reasons:

First, it assumes that there is a pointer you can update in the first place.

Second, it assumes you can easily update all pointer instances pointing
to your data. Finding all places which need updating might be
non-trivial or impossible. This tends to be true for refcounted data
structures.

Third, updating pointers can have extra costs, such as locking
requirements in threaded applications.

Of course there are also tons of applications where it is absolutely no
problem to just update the pointer.

But I think that applications using the vectored I/O very likely belong
to the group which can't. Otherwise there would be little reason to use
the vectored API in the first place.

> Then again, I'm not sure why wouldn't readv() actually be
> zerocopy-ready. Just make sure you handle iov_base being NULL gracefully
> now (EINVAL, with the remark that the kernel can write to the iovec
> memory area in the future) and later the kernel can in that case set
> iov_base to the buffer location?

You could (assuming you don't break POSIX along the way). I don't think
apps would use such an interface though for the reasons outlined above.
The readz() prototype by Ullrich (without iovecs being involved) looks
much more reasonable to me.

cheers,
Gerd