2014-11-10 16:40:41

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
extra flag argument (RWF_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

Version 6 highlight:
- Compat syscall flag checks, per. Jeff.
- Minor stylistic suggestions.

Version 5 highlight:
- XFS support for RWF_NONBLOCK. from Christoph.
- RWF_DSYNC flag and support for pwritev2, from Christoph.
- Implemented compat syscalls, per. Jeff.
- Missing nfs, ceph changes from older patchset.

Version 4 highlight:
- Updated for 3.18-rc1.
- Performance data from our application.
- First stab at man page with Jeff's help. Patch is in-reply to.

RFC Version 3 highlights:
- Down to 2 syscalls from 4; can user fp or argument position.
- RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.

RFC Version 2 highlights:
- Put the flags argument into kiocb (less noise), per. Al Viro
- O_DIRECT checking early in the process, per. Jeff Moyer
- Resolved duplicate (c&p) code in syscall code, per. Jeff
- Included perf data in thread cover letter, per. Jeff
- Created a new flag (not O_NONBLOCK) for readv2, perf Jeff


Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
lat (msec) : >=2000=4.33%
f3:
bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10,
stdev=34526.89
lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
lat (usec) : 2=70.63%, 4=0.01%
lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
lat (msec) : >=2000=9.99%
f3:
bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50,
stdev=35995.60
lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
lat (msec) : 100=0.05%, 250=0.02%
total:
READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases. In our real world
work load we were able to reduce average response time by 20 to 30% (depends
on amount of IO done by request).

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.

I have co-developed these changes with Christoph Hellwig.

Christoph Hellwig (3):
xfs: add RWF_NONBLOCK support
fs: pass iocb to generic_write_sync
fs: add a flag for per-operation O_DSYNC semantics

Milosz Tanski (4):
vfs: Prepare for adding a new preadv/pwritev with user flags.
vfs: Define new syscalls preadv2,pwritev2
x86: wire up preadv2 and pwritev2
vfs: RWF_NONBLOCK flag for preadv2

arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
drivers/target/target_core_file.c | 6 +-
fs/block_dev.c | 8 +-
fs/btrfs/file.c | 7 +-
fs/ceph/file.c | 6 +-
fs/cifs/file.c | 14 +--
fs/direct-io.c | 8 +-
fs/ext4/file.c | 8 +-
fs/fuse/file.c | 2 +
fs/gfs2/file.c | 9 +-
fs/nfs/file.c | 15 ++-
fs/nfsd/vfs.c | 4 +-
fs/ntfs/file.c | 8 +-
fs/ocfs2/file.c | 12 +-
fs/pipe.c | 3 +-
fs/read_write.c | 233 +++++++++++++++++++++++++++++---------
fs/splice.c | 2 +-
fs/udf/file.c | 11 +-
fs/xfs/xfs_file.c | 36 ++++--
include/linux/aio.h | 2 +
include/linux/compat.h | 6 +
include/linux/fs.h | 16 ++-
include/linux/syscalls.h | 6 +
include/uapi/asm-generic/unistd.h | 6 +-
mm/filemap.c | 56 +++++++--
mm/shmem.c | 4 +
27 files changed, 343 insertions(+), 149 deletions(-)

--
1.9.1


2014-11-10 16:40:43

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 1/7] vfs: Prepare for adding a new preadv/pwritev with user flags.

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
---
drivers/target/target_core_file.c | 6 +++---
fs/nfsd/vfs.c | 4 ++--
fs/read_write.c | 27 +++++++++++++++------------
fs/splice.c | 2 +-
include/linux/aio.h | 2 ++
include/linux/fs.h | 4 ++--
6 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
set_fs(get_ds());

if (is_write)
- ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+ ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
else
- ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+ ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);

set_fs(old_fs);

@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)

old_fs = get_fs();
set_fs(get_ds());
- rc = vfs_writev(f, &iov[0], iov_num, &pos);
+ rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
set_fs(old_fs);

vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 989129e..ef01c78 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -872,7 +872,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,

oldfs = get_fs();
set_fs(KERNEL_DS);
- host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+ host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
set_fs(oldfs);
return nfsd_finish_read(file, count, host_err);
}
@@ -960,7 +960,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,

/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
- host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+ host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
set_fs(oldfs);
if (host_err < 0)
goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 7d9318c..94b2d34 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -653,7 +653,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
EXPORT_SYMBOL(iov_shorten);

static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
- unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+ unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+ int flags)
{
struct kiocb kiocb;
struct iov_iter iter;
@@ -662,6 +663,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
kiocb.ki_nbytes = len;
+ kiocb.ki_rwflags = flags;

iov_iter_init(&iter, rw, iov, nr_segs, len);
ret = fn(&kiocb, &iter);
@@ -800,7 +802,8 @@ out:

static ssize_t do_readv_writev(int type, struct file *file,
const struct iovec __user * uvector,
- unsigned long nr_segs, loff_t *pos)
+ unsigned long nr_segs, loff_t *pos,
+ int flags)
{
size_t tot_len;
struct iovec iovstack[UIO_FASTIOV];
@@ -834,7 +837,7 @@ static ssize_t do_readv_writev(int type, struct file *file,

if (iter_fn)
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
- pos, iter_fn);
+ pos, iter_fn, flags);
else if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
pos, fnv);
@@ -857,27 +860,27 @@ out:
}

ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
if (!(file->f_mode & FMODE_READ))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;

- return do_readv_writev(READ, file, vec, vlen, pos);
+ return do_readv_writev(READ, file, vec, vlen, pos, flags);
}

EXPORT_SYMBOL(vfs_readv);

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;

- return do_readv_writev(WRITE, file, vec, vlen, pos);
+ return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
}

EXPORT_SYMBOL(vfs_writev);
@@ -890,7 +893,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_readv(f.file, vec, vlen, &pos);
+ ret = vfs_readv(f.file, vec, vlen, &pos, 0);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -910,7 +913,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_writev(f.file, vec, vlen, &pos);
+ ret = vfs_writev(f.file, vec, vlen, &pos, 0);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -942,7 +945,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = vfs_readv(f.file, vec, vlen, &pos);
+ ret = vfs_readv(f.file, vec, vlen, &pos, 0);
fdput(f);
}

@@ -966,7 +969,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = vfs_writev(f.file, vec, vlen, &pos);
+ ret = vfs_writev(f.file, vec, vlen, &pos, 0);
fdput(f);
}

@@ -1014,7 +1017,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,

if (iter_fn)
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
- pos, iter_fn);
+ pos, iter_fn, 0);
else if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
pos, fnv);
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
old_fs = get_fs();
set_fs(get_ds());
/* The cast to a user pointer is valid due to the set_fs() */
- res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+ res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
set_fs(old_fs);

return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
* this is the underlying eventfd context to deliver events to.
*/
struct eventfd_ctx *ki_eventfd;
+
+ int ki_rwflags;
};

static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a957d43..9ed5711 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1538,9 +1538,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
- unsigned long, loff_t *);
+ unsigned long, loff_t *, int);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
- unsigned long, loff_t *);
+ unsigned long, loff_t *, int);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
--
1.9.1

2014-11-10 16:40:48

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 2/7] vfs: Define new syscalls preadv2,pwritev2

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/read_write.c | 172 ++++++++++++++++++++++++++++++--------
include/linux/compat.h | 6 ++
include/linux/syscalls.h | 6 ++
include/uapi/asm-generic/unistd.h | 6 +-
mm/filemap.c | 5 +-
5 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 94b2d34..b1b4bc8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;
+ if (flags & ~0)
+ return -EINVAL;

return do_readv_writev(READ, file, vec, vlen, pos, flags);
}
@@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
+ if (flags & ~0)
+ return -EINVAL;

return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
}

EXPORT_SYMBOL(vfs_writev);

-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+ ret = vfs_readv(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -905,15 +909,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+ ret = vfs_writev(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -931,10 +935,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
}

-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, loff_t pos, int flags)
{
- loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
ssize_t ret = -EBADF;

@@ -945,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+ ret = vfs_readv(f.file, vec, vlen, &pos, flags);
fdput(f);
}

@@ -955,10 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, loff_t pos, int flags)
{
- loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
ssize_t ret = -EBADF;

@@ -969,7 +971,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+ ret = vfs_writev(f.file, vec, vlen, &pos, flags);
fdput(f);
}

@@ -979,11 +981,63 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen)
+{
+ return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen)
+{
+ return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ if (pos == -1)
+ return do_readv(fd, vec, vlen, flags);
+
+ return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ if (pos == -1)
+ return do_writev(fd, vec, vlen, flags);
+
+ return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
#ifdef CONFIG_COMPAT

static ssize_t compat_do_readv_writev(int type, struct file *file,
const struct compat_iovec __user *uvector,
- unsigned long nr_segs, loff_t *pos)
+ unsigned long nr_segs, loff_t *pos, int flags)
{
compat_ssize_t tot_len;
struct iovec iovstack[UIO_FASTIOV];
@@ -1017,7 +1071,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,

if (iter_fn)
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
- pos, iter_fn, 0);
+ pos, iter_fn, flags);
else if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
pos, fnv);
@@ -1041,7 +1095,7 @@ out:

static size_t compat_readv(struct file *file,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
ssize_t ret = -EBADF;

@@ -1051,8 +1105,10 @@ static size_t compat_readv(struct file *file,
ret = -EINVAL;
if (!(file->f_mode & FMODE_CAN_READ))
goto out;
+ if (flags & ~0)
+ goto out;

- ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);

out:
if (ret > 0)
@@ -1061,9 +1117,9 @@ out:
return ret;
}

-COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
- const struct compat_iovec __user *,vec,
- compat_ulong_t, vlen)
+static size_t __compat_sys_readv(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret;
@@ -1072,16 +1128,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
if (!f.file)
return -EBADF;
pos = f.file->f_pos;
- ret = compat_readv(f.file, vec, vlen, &pos);
+ ret = compat_readv(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
f.file->f_pos = pos;
fdput_pos(f);
return ret;
+
+}
+
+COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen)
+{
+ return __compat_sys_readv(fd, vec, vlen, 0);
}

static long __compat_sys_preadv64(unsigned long fd,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t pos)
+ unsigned long vlen, loff_t pos, int flags)
{
struct fd f;
ssize_t ret;
@@ -1093,7 +1157,7 @@ static long __compat_sys_preadv64(unsigned long fd,
return -EBADF;
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = compat_readv(f.file, vec, vlen, &pos);
+ ret = compat_readv(f.file, vec, vlen, &pos, flags);
fdput(f);
return ret;
}
@@ -1103,7 +1167,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
const struct compat_iovec __user *,vec,
unsigned long, vlen, loff_t, pos)
{
- return __compat_sys_preadv64(fd, vec, vlen, pos);
+ return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
}
#endif

@@ -1113,12 +1177,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;

- return __compat_sys_preadv64(fd, vec, vlen, pos);
+ return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
+ int, flags)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+ if (pos == -1)
+ return __compat_sys_readv(fd, vec, vlen, flags);
+
+ return __compat_sys_preadv64(fd, vec, vlen, pos, flags);
}

static size_t compat_writev(struct file *file,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
ssize_t ret = -EBADF;

@@ -1128,8 +1205,10 @@ static size_t compat_writev(struct file *file,
ret = -EINVAL;
if (!(file->f_mode & FMODE_CAN_WRITE))
goto out;
+ if (flags & ~0)
+ goto out;

- ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, flags);

out:
if (ret > 0)
@@ -1138,9 +1217,9 @@ out:
return ret;
}

-COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
- const struct compat_iovec __user *, vec,
- compat_ulong_t, vlen)
+static size_t __compat_sys_writev(compat_ulong_t fd,
+ const struct compat_iovec __user* vec,
+ compat_ulong_t vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret;
@@ -1149,28 +1228,36 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
if (!f.file)
return -EBADF;
pos = f.file->f_pos;
- ret = compat_writev(f.file, vec, vlen, &pos);
+ ret = compat_writev(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
f.file->f_pos = pos;
fdput_pos(f);
return ret;
}

+COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
+ const struct compat_iovec __user *, vec,
+ compat_ulong_t, vlen)
+{
+ return __compat_sys_writev(fd, vec, vlen, 0);
+}
+
static long __compat_sys_pwritev64(unsigned long fd,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t pos)
+ unsigned long vlen, loff_t pos, int flags)
{
struct fd f;
ssize_t ret;

if (pos < 0)
return -EINVAL;
+
f = fdget(fd);
if (!f.file)
return -EBADF;
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = compat_writev(f.file, vec, vlen, &pos);
+ ret = compat_writev(f.file, vec, vlen, &pos, flags);
fdput(f);
return ret;
}
@@ -1180,7 +1267,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
const struct compat_iovec __user *,vec,
unsigned long, vlen, loff_t, pos)
{
- return __compat_sys_pwritev64(fd, vec, vlen, pos);
+ return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
}
#endif

@@ -1190,8 +1277,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;

- return __compat_sys_pwritev64(fd, vec, vlen, pos);
+ return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+ if (pos == -1)
+ return __compat_sys_writev(fd, vec, vlen, flags);
+
+ return __compat_sys_pwritev64(fd, vec, vlen, pos, flags);
}
+
#endif

static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e649426..63a94e2 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
const struct compat_iovec __user *vec,
compat_ulong_t vlen, u32 pos_low, u32 pos_high);
+asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
+asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);

#ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
asmlinkage long compat_sys_preadv64(unsigned long fd,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81..cedc22e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -571,8 +571,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
size_t count, loff_t pos);
asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+ int flags);
asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+ int flags);
asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 22749c1..9406018 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
__SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
#define __NR_pwritev 70
__SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 281
+__SC_COMP(__NR_preadv2, sys_preadv2, compat_sys_preadv2)
+#define __NR_pwritev2 282
+__SC_COMP(__NR_pwritev2, sys_pwritev2, compat_sys_pwritev2)

/* fs/sendfile.c */
#define __NR3264_sendfile 71
@@ -709,7 +713,7 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)

#undef __NR_syscalls
-#define __NR_syscalls 281
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..530c263 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1457,6 +1457,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* @ppos: current file position
* @iter: data destination
* @written: already copied
+ * @flags: optional flags
*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1465,7 +1466,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* of the logic when it comes to error handling etc.
*/
static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
- struct iov_iter *iter, ssize_t written)
+ struct iov_iter *iter, ssize_t written, int flags)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
@@ -1735,7 +1736,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
}
}

- retval = do_generic_file_read(file, ppos, iter, retval);
+ retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
out:
return retval;
}
--
1.9.1

2014-11-10 16:40:59

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 7/7] fs: add a flag for per-operation O_DSYNC semantics

From: Christoph Hellwig <[email protected]>

With the new read/write with flags syscalls we can support a flag
to enable O_DSYNC semantics on a per-operation basis. This іs
useful to implement protocols like SMB, NFS or SCSI that have such
per-operation flags.

Example program below:

cat > pwritev2.c << EOF

(off_t) val, \
(off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))

static ssize_t
pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
{
return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
flags);
}

int main(int argc, char **argv)
{
int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666);
char buf[1024];
struct iovec iov = { .iov_base = buf, .iov_len = 1024 };
int ret;

if (fd < 0) {
perror("open");
return 0;
}

memset(buf, 0xfe, sizeof(buf));

ret = pwritev2(fd, &iov, 1, 0, RWF_DSYNC);
if (ret < 0)
perror("pwritev2");
else
printf("ret = %d\n", ret);

return 0;
}
EOF

Signed-off-by: Christoph Hellwig <[email protected]>
[[email protected]: comapt syscall changes for RWF_ODSYNC]
Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Acked-by: Sage Weil <[email protected]>
---
fs/ceph/file.c | 4 +++-
fs/fuse/file.c | 2 ++
fs/nfs/file.c | 10 ++++++----
fs/ocfs2/file.c | 6 ++++--
fs/read_write.c | 8 ++++++--
include/linux/fs.h | 3 ++-
mm/filemap.c | 4 +++-
7 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index b798b5c..2d4e15a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -983,7 +983,9 @@ retry_snap:
ceph_put_cap_refs(ci, got);

if (written >= 0 &&
- ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
+ ((file->f_flags & O_SYNC) ||
+ IS_SYNC(file->f_mapping->host) ||
+ (iocb->ki_rwflags & RWF_DSYNC) ||
ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
err = vfs_fsync_range(file, pos, pos + written - 1, 1);
if (err < 0)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index caa8d95..bb4fb23 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
written += written_buffered;
iocb->ki_pos = pos + written_buffered;
} else {
+ if (iocb->ki_rwflags & RWF_DSYNC)
+ return -EINVAL;
written = fuse_perform_write(file, mapping, from, pos);
if (written >= 0)
iocb->ki_pos = pos + written;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index aa9046f..c59b0b7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -652,13 +652,15 @@ static const struct vm_operations_struct nfs_file_vm_ops = {
.remap_pages = generic_file_remap_pages,
};

-static int nfs_need_sync_write(struct file *filp, struct inode *inode)
+static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode)
{
struct nfs_open_context *ctx;

- if (IS_SYNC(inode) || (filp->f_flags & O_DSYNC))
+ if (IS_SYNC(inode) ||
+ (iocb->ki_filp->f_flags & O_DSYNC) ||
+ (iocb->ki_rwflags & RWF_DSYNC))
return 1;
- ctx = nfs_file_open_context(filp);
+ ctx = nfs_file_open_context(iocb->ki_filp);
if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags) ||
nfs_ctx_key_to_expire(ctx))
return 1;
@@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
written = result;

/* Return error values for O_DSYNC and IS_SYNC() */
- if (result >= 0 && nfs_need_sync_write(file, inode)) {
+ if (result >= 0 && nfs_need_sync_write(iocb, inode)) {
int err = vfs_fsync(file, 0);
if (err < 0)
result = err;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bb66ca4..8f9a86b 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2374,8 +2374,10 @@ out_dio:
/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));

- if (((file->f_flags & O_DSYNC) && !direct_io) || IS_SYNC(inode) ||
- ((file->f_flags & O_DIRECT) && !direct_io)) {
+ if (((file->f_flags & O_DSYNC) && !direct_io) ||
+ IS_SYNC(inode) ||
+ ((file->f_flags & O_DIRECT) && !direct_io) ||
+ (iocb->ki_rwflags & RWF_DSYNC)) {
ret = filemap_fdatawrite_range(file->f_mapping, *ppos,
*ppos + count - 1);
if (ret < 0)
diff --git a/fs/read_write.c b/fs/read_write.c
index adf85ab..c2e3c0a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -841,6 +841,8 @@ static ssize_t do_readv_writev(int type, struct file *file,
} else {
if (type == READ && (flags & RWF_NONBLOCK))
return -EAGAIN;
+ if (type == WRITE && (flags & RWF_DSYNC))
+ return -EINVAL;

if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
@@ -888,7 +890,7 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
- if (flags & ~0)
+ if (flags & ~RWF_DSYNC)
return -EINVAL;

return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
@@ -1082,6 +1084,8 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
} else {
if (type == READ && (flags & RWF_NONBLOCK))
return -EAGAIN;
+ if (type == WRITE && (flags & RWF_DSYNC))
+ return -EINVAL;

if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
@@ -1221,7 +1225,7 @@ static size_t compat_writev(struct file *file,
ret = -EINVAL;
if (!(file->f_mode & FMODE_CAN_WRITE))
goto out;
- if (flags & ~0)
+ if (flags & ~RWF_DSYNC)
goto out;

ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7d0e116..7786b88 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1460,7 +1460,8 @@ struct block_device_operations;
#define HAVE_UNLOCKED_IOCTL 1

/* These flags are used for the readv/writev syscalls with flags. */
-#define RWF_NONBLOCK 0x00000001
+#define RWF_NONBLOCK 0x00000001
+#define RWF_DSYNC 0x00000002

struct iov_iter;

diff --git a/mm/filemap.c b/mm/filemap.c
index 535967b..8c50d35 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2669,7 +2669,9 @@ int generic_write_sync(struct kiocb *iocb, loff_t count)
struct file *file = iocb->ki_filp;

if (count > 0 &&
- ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
+ ((file->f_flags & O_DSYNC) ||
+ (iocb->ki_rwflags & RWF_DSYNC) ||
+ IS_SYNC(file->f_mapping->host))) {
bool fdatasync = !(file->f_flags & __O_SYNC);
ssize_t ret;

--
1.9.1

2014-11-10 16:41:32

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 6/7] fs: pass iocb to generic_write_sync

From: Christoph Hellwig <[email protected]>

Clean up the generic_write_sync by just passing an iocb and a bytes
written / negative errno argument. In addition to simplifying the
callers this also prepares for passing a per-operation O_DSYNC
flag. Two callers didn't quite fit that scheme:

- dio_complete didn't both to update ki_pos as we don't need it
on a iocb that is about to be freed, so we had to add it. Additionally
it also synced out written data in the error case, which has been
changed to operate like the other callers.
- gfs2 also used generic_write_sync to implement a crude version
of fallocate. It has been switched to use an open coded variant
instead.

Signed-off-by: Christoph Hellwig <[email protected]>
[Small change in generic_write_sync() suggested by Anton Altaparmakov]
Signed-off-by: Milosz Tanski <[email protected]>
Acked-by: Steven Whitehouse <[email protected]>
Acked-by: Anton Altaparmakov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/block_dev.c | 8 +-------
fs/btrfs/file.c | 7 ++-----
fs/cifs/file.c | 8 +-------
fs/direct-io.c | 8 ++------
fs/ext4/file.c | 8 +-------
fs/gfs2/file.c | 9 +++++++--
fs/ntfs/file.c | 8 ++------
fs/udf/file.c | 11 ++---------
fs/xfs/xfs_file.c | 8 +-------
include/linux/fs.h | 8 +-------
mm/filemap.c | 31 +++++++++++++++++++++----------
11 files changed, 41 insertions(+), 73 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index cc9d411..c529b1c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
*/
ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct file *file = iocb->ki_filp;
struct blk_plug plug;
ssize_t ret;

blk_start_plug(&plug);
ret = __generic_file_write_iter(iocb, from);
- if (ret > 0) {
- ssize_t err;
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
+ ret = generic_write_sync(iocb, ret);
blk_finish_plug(&plug);
return ret;
}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a18ceab..4f4a6f7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
*/
BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
BTRFS_I(inode)->last_sub_trans = root->log_transid;
- if (num_written > 0) {
- err = generic_write_sync(file, pos, num_written);
- if (err < 0)
- num_written = err;
- }
+
+ num_written = generic_write_sync(iocb, num_written);

if (sync)
atomic_dec(&BTRFS_I(inode)->sync_writers);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c485afa..32359de 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
rc = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (rc > 0) {
- ssize_t err;
-
- err = generic_write_sync(file, iocb->ki_pos - rc, rc);
- if (err < 0)
- rc = err;
- }
+ rc = generic_write_sync(iocb, rc);
} else {
mutex_unlock(&inode->i_mutex);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..b72ac83 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
inode_dio_done(dio->inode);
if (is_async) {
if (dio->rw & WRITE) {
- int err;
-
- err = generic_write_sync(dio->iocb->ki_filp, offset,
- transferred);
- if (err < 0 && ret > 0)
- ret = err;
+ dio->iocb->ki_pos = offset + transferred;
+ ret = generic_write_sync(dio->iocb, ret);
}

aio_complete(dio->iocb, ret, 0);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..79b000c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (ret > 0) {
- ssize_t err;
-
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
+ ret = generic_write_sync(iocb, ret);
if (o_direct)
blk_finish_plug(&plug);

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 80dd44d..3fafeca 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -895,8 +895,13 @@ retry:
gfs2_quota_unlock(ip);
}

- if (error == 0)
- error = generic_write_sync(file, pos, count);
+ if (error)
+ goto out_unlock;
+
+ if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) {
+ error = vfs_fsync_range(file, pos, pos + count - 1,
+ (file->f_flags & __O_SYNC) ? 0 : 1);
+ }
goto out_unlock;

out_trans_fail:
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 643faa4..4f3d664 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2127,12 +2127,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
mutex_lock(&inode->i_mutex);
ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
- if (ret > 0) {
- int err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
- return ret;
+
+ return generic_write_sync(iocb, ret);
}

/**
diff --git a/fs/udf/file.c b/fs/udf/file.c
index bb15771..1cdabd0 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -155,16 +155,9 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
retval = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (retval > 0) {
- ssize_t err;
-
+ if (retval > 0)
mark_inode_dirty(inode);
- err = generic_write_sync(file, iocb->ki_pos - retval, retval);
- if (err < 0)
- retval = err;
- }
-
- return retval;
+ return generic_write_sync(iocb, retval);
}

long udf_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0655915..a8cab66 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -792,14 +792,8 @@ xfs_file_write_iter(
ret = xfs_file_buffered_aio_write(iocb, from);

if (ret > 0) {
- ssize_t err;
-
XFS_STATS_ADD(xs_write_bytes, ret);
-
- /* Handle various SYNC-type writes */
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
+ ret = generic_write_sync(iocb, ret);
}
return ret;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eaebd99..7d0e116 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2242,13 +2242,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
int datasync);
extern int vfs_fsync(struct file *file, int datasync);
-static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
-{
- if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
- return 0;
- return vfs_fsync_range(file, pos, pos + count - 1,
- (file->f_flags & __O_SYNC) ? 0 : 1);
-}
+extern int generic_write_sync(struct kiocb *iocb, loff_t count);
extern void emergency_sync(void);
extern void emergency_remount(void);
#ifdef CONFIG_BLOCK
diff --git a/mm/filemap.c b/mm/filemap.c
index 09d3af3..535967b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2664,6 +2664,25 @@ out:
}
EXPORT_SYMBOL(__generic_file_write_iter);

+int generic_write_sync(struct kiocb *iocb, loff_t count)
+{
+ struct file *file = iocb->ki_filp;
+
+ if (count > 0 &&
+ ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
+ bool fdatasync = !(file->f_flags & __O_SYNC);
+ ssize_t ret;
+
+ ret = vfs_fsync_range(file, iocb->ki_pos - count,
+ iocb->ki_pos - 1, fdatasync);
+ if (ret < 0)
+ return ret;
+ }
+
+ return count;
+}
+EXPORT_SYMBOL(generic_write_sync);
+
/**
* generic_file_write_iter - write data to a file
* @iocb: IO state structure
@@ -2675,22 +2694,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
*/
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
ssize_t ret;

mutex_lock(&inode->i_mutex);
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (ret > 0) {
- ssize_t err;
-
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
- return ret;
+ return generic_write_sync(iocb, ret);
}
EXPORT_SYMBOL(generic_file_write_iter);

--
1.9.1

2014-11-10 16:41:59

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 5/7] xfs: add RWF_NONBLOCK support

From: Christoph Hellwig <[email protected]>

Add support for non-blocking reads. The guts are handled by the generic
code, the only addition is a non-blocking variant of xfs_rw_ilock.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_file.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b1f6334..0655915 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -61,6 +61,25 @@ xfs_rw_ilock(
xfs_ilock(ip, type);
}

+static inline bool
+xfs_rw_ilock_nowait(
+ struct xfs_inode *ip,
+ int type)
+{
+ if (type & XFS_IOLOCK_EXCL) {
+ if (!mutex_trylock(&VFS_I(ip)->i_mutex))
+ return false;
+ if (!xfs_ilock_nowait(ip, type)) {
+ mutex_unlock(&VFS_I(ip)->i_mutex);
+ return false;
+ }
+ } else {
+ if (!xfs_ilock_nowait(ip, type))
+ return false;
+ }
+ return true;
+}
+
static inline void
xfs_rw_iunlock(
struct xfs_inode *ip,
@@ -246,10 +265,6 @@ xfs_file_read_iter(

XFS_STATS_INC(xs_read_calls);

- /* XXX: need a non-blocking iolock helper, shouldn't be too hard */
- if (iocb->ki_rwflags & RWF_NONBLOCK)
- return -EAGAIN;
-
if (unlikely(file->f_flags & O_DIRECT))
ioflags |= XFS_IO_ISDIRECT;
if (file->f_mode & FMODE_NOCMTIME)
@@ -287,7 +302,14 @@ xfs_file_read_iter(
* This allows the normal direct IO case of no page cache pages to
* proceeed concurrently without serialisation.
*/
- xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ if (iocb->ki_rwflags & RWF_NONBLOCK) {
+ if (ioflags & XFS_IO_ISDIRECT)
+ return -EAGAIN;
+ if (!xfs_rw_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+ return -EAGAIN;
+ } else {
+ xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ }
if ((ioflags & XFS_IO_ISDIRECT) && inode->i_mapping->nrpages) {
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
--
1.9.1

2014-11-10 16:42:27

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 4/7] vfs: RWF_NONBLOCK flag for preadv2

generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
only want to read the data if it's already in the page cache.

Additionally, there are a few filesystems that we have to specifically
bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
contributed this code.

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Acked-by: Sage Weil <[email protected]>
---
fs/ceph/file.c | 2 ++
fs/cifs/file.c | 6 ++++++
fs/nfs/file.c | 5 ++++-
fs/ocfs2/file.c | 6 ++++++
fs/pipe.c | 3 ++-
fs/read_write.c | 44 ++++++++++++++++++++++++++++++--------------
fs/xfs/xfs_file.c | 4 ++++
include/linux/fs.h | 3 +++
mm/filemap.c | 18 ++++++++++++++++++
mm/shmem.c | 4 ++++
10 files changed, 79 insertions(+), 16 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d7e0da8..b798b5c 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -822,6 +822,8 @@ again:
if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
(iocb->ki_filp->f_flags & O_DIRECT) ||
(fi->flags & CEPH_F_SYNC)) {
+ if (iocb->ki_rwflags & O_NONBLOCK)
+ return -EAGAIN;

dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 3e4d00a..c485afa 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
struct cifs_readdata *rdata, *tmp;
struct list_head rdata_list;

+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
len = iov_iter_count(to);
if (!len)
return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
return generic_file_read_iter(iocb, to);

+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
/*
* We need to hold the sem to be sure nobody modifies lock list
* with a brlock that prevents reading.
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2ab6f00..aa9046f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t result;

- if (iocb->ki_filp->f_flags & O_DIRECT)
+ if (iocb->ki_filp->f_flags & O_DIRECT) {
+ if (iocb->ki_rwflags & O_NONBLOCK)
+ return -EAGAIN;
return nfs_file_direct_read(iocb, to, iocb->ki_pos);
+ }

dprintk("NFS: read(%pD2, %zu@%lu)\n",
iocb->ki_filp,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 324dc93..bb66ca4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
filp->f_path.dentry->d_name.name,
to->nr_segs); /* GRRRRR */

+ /*
+ * No non-blocking reads for ocfs2 for now. Might be doable with
+ * non-blocking cluster lock helpers.
+ */
+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;

if (!inode) {
ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
*/
if (ret)
break;
- if (filp->f_flags & O_NONBLOCK) {
+ if ((filp->f_flags & O_NONBLOCK) ||
+ (iocb->ki_rwflags & RWF_NONBLOCK)) {
ret = -EAGAIN;
break;
}
diff --git a/fs/read_write.c b/fs/read_write.c
index b1b4bc8..adf85ab 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
file_start_write(file);
}

- if (iter_fn)
+ if (iter_fn) {
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
pos, iter_fn, flags);
- else if (fnv)
- ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
- pos, fnv);
- else
- ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ } else {
+ if (type == READ && (flags & RWF_NONBLOCK))
+ return -EAGAIN;
+
+ if (fnv)
+ ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+ pos, fnv);
+ else
+ ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ }

if (type != READ)
file_end_write(file);
@@ -866,8 +871,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;
- if (flags & ~0)
+ if (flags & ~RWF_NONBLOCK)
return -EINVAL;
+ if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+ return -EAGAIN;

return do_readv_writev(READ, file, vec, vlen, pos, flags);
}
@@ -1069,14 +1076,19 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
file_start_write(file);
}

- if (iter_fn)
+ if (iter_fn) {
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
pos, iter_fn, flags);
- else if (fnv)
- ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
- pos, fnv);
- else
- ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ } else {
+ if (type == READ && (flags & RWF_NONBLOCK))
+ return -EAGAIN;
+
+ if (fnv)
+ ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+ pos, fnv);
+ else
+ ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ }

if (type != READ)
file_end_write(file);
@@ -1105,7 +1117,11 @@ static size_t compat_readv(struct file *file,
ret = -EINVAL;
if (!(file->f_mode & FMODE_CAN_READ))
goto out;
- if (flags & ~0)
+ if (flags & ~RWF_NONBLOCK)
+ goto out;
+
+ ret = -EAGAIN;
+ if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
goto out;

ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index eb596b4..b1f6334 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(

XFS_STATS_INC(xs_read_calls);

+ /* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
if (unlikely(file->f_flags & O_DIRECT))
ioflags |= XFS_IO_ISDIRECT;
if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ed5711..eaebd99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1459,6 +1459,9 @@ struct block_device_operations;
#define HAVE_COMPAT_IOCTL 1
#define HAVE_UNLOCKED_IOCTL 1

+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
+
struct iov_iter;

struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index 530c263..09d3af3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1494,6 +1494,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
find_page:
page = find_get_page(mapping, index);
if (!page) {
+ if (flags & RWF_NONBLOCK)
+ goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
@@ -1585,6 +1587,11 @@ page_ok:
continue;

page_not_up_to_date:
+ if (flags & RWF_NONBLOCK) {
+ page_cache_release(page);
+ goto would_block;
+ }
+
/* Get exclusive access to the page ... */
error = lock_page_killable(page);
if (unlikely(error))
@@ -1604,6 +1611,12 @@ page_not_up_to_date_locked:
goto page_ok;
}

+ if (flags & RWF_NONBLOCK) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto would_block;
+ }
+
readpage:
/*
* A previous I/O error may have been due to temporary
@@ -1674,6 +1687,8 @@ no_cached_page:
goto readpage;
}

+would_block:
+ error = -EAGAIN;
out:
ra->prev_pos = prev_index;
ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1707,6 +1722,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
size_t count = iov_iter_count(iter);
loff_t size;

+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
if (!count)
goto out; /* skip atime */
size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index cd6fc75..5c30f04 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
ssize_t retval = 0;
loff_t *ppos = &iocb->ki_pos;

+ /* XXX: should be easily supportable */
+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
/*
* Might this read be for a stacking filesystem? Then when reading
* holes of a sparse file, we actually need to allocate those pages,
--
1.9.1

2014-11-10 16:43:31

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v6 3/7] x86: wire up preadv2 and pwritev2

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 2 ++
arch/x86/syscalls/syscall_64.tbl | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 9fe1b5d..d592d87 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,5 @@
355 i386 getrandom sys_getrandom
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
+358 i386 preadv2 sys_preadv2
+359 i386 pwritev2 sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 281150b..7be2447 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,8 @@
319 common memfd_create sys_memfd_create
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
+322 64 preadv2 sys_preadv2
+323 64 pwritev2 sys_pwritev2

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.1

2014-11-11 06:44:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Mon, Nov 10, 2014 at 11:40:23AM -0500, Milosz Tanski wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.

Can you write a test (or set of) for fstests that exercises this new
functionality? I'm not worried about performance, just
correctness....

Cheers,

Dave.

--
Dave Chinner
[email protected]

2014-11-11 16:02:07

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Nov 11, 2014 at 1:44 AM, Dave Chinner <[email protected]> wrote:
> On Mon, Nov 10, 2014 at 11:40:23AM -0500, Milosz Tanski wrote:
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
>> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
>> extra flag argument (RWF_NONBLOCK).
>>
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>
> Can you write a test (or set of) for fstests that exercises this new
> functionality? I'm not worried about performance, just
> correctness....

Sure thing. Can you point me at the fstests repo? A quick google
search reveals lots of projects named fstests, most of them abandoned.

>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> [email protected]



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-11 17:04:11

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

Milosz Tanski <[email protected]> writes:

>> Can you write a test (or set of) for fstests that exercises this new
>> functionality? I'm not worried about performance, just
>> correctness....
>
> Sure thing. Can you point me at the fstests repo? A quick google
> search reveals lots of projects named fstests, most of them abandoned.

I think he's referring to xfstests. Still, I think that's the wrong
place for functional testing. ltp would be better, imo.

Cheers,
Jeff

2014-11-11 21:10:03

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] vfs: Define new syscalls preadv2,pwritev2

Milosz Tanski <[email protected]> writes:

> New syscalls that take an flag argument. This change does not add any specific
> flags.

Looks good.

Reviewed-by: Jeff Moyer <[email protected]>

>
> Signed-off-by: Milosz Tanski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> fs/read_write.c | 172 ++++++++++++++++++++++++++++++--------
> include/linux/compat.h | 6 ++
> include/linux/syscalls.h | 6 ++
> include/uapi/asm-generic/unistd.h | 6 +-
> mm/filemap.c | 5 +-
> 5 files changed, 156 insertions(+), 39 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 94b2d34..b1b4bc8 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_READ))
> return -EINVAL;
> + if (flags & ~0)
> + return -EINVAL;
>
> return do_readv_writev(READ, file, vec, vlen, pos, flags);
> }
> @@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_WRITE))
> return -EINVAL;
> + if (flags & ~0)
> + return -EINVAL;
>
> return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
> }
>
> EXPORT_SYMBOL(vfs_writev);
>
> -SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen)
> +static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret = -EBADF;
>
> if (f.file) {
> loff_t pos = file_pos_read(f.file);
> - ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> + ret = vfs_readv(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> file_pos_write(f.file, pos);
> fdput_pos(f);
> @@ -905,15 +909,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> return ret;
> }
>
> -SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen)
> +static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret = -EBADF;
>
> if (f.file) {
> loff_t pos = file_pos_read(f.file);
> - ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> + ret = vfs_writev(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> file_pos_write(f.file, pos);
> fdput_pos(f);
> @@ -931,10 +935,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
> return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
> }
>
> -SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, loff_t pos, int flags)
> {
> - loff_t pos = pos_from_hilo(pos_h, pos_l);
> struct fd f;
> ssize_t ret = -EBADF;
>
> @@ -945,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> if (f.file) {
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PREAD)
> - ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> + ret = vfs_readv(f.file, vec, vlen, &pos, flags);
> fdput(f);
> }
>
> @@ -955,10 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> return ret;
> }
>
> -SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, loff_t pos, int flags)
> {
> - loff_t pos = pos_from_hilo(pos_h, pos_l);
> struct fd f;
> ssize_t ret = -EBADF;
>
> @@ -969,7 +971,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> if (f.file) {
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PWRITE)
> - ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> + ret = vfs_writev(f.file, vec, vlen, &pos, flags);
> fdput(f);
> }
>
> @@ -979,11 +981,63 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> return ret;
> }
>
> +SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen)
> +{
> + return do_readv(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen)
> +{
> + return do_writev(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + return do_preadv(fd, vec, vlen, pos, 0);
> +}
> +
> +SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> + int, flags)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + if (pos == -1)
> + return do_readv(fd, vec, vlen, flags);
> +
> + return do_preadv(fd, vec, vlen, pos, flags);
> +}
> +
> +SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + return do_pwritev(fd, vec, vlen, pos, 0);
> +}
> +
> +SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> + int, flags)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + if (pos == -1)
> + return do_writev(fd, vec, vlen, flags);
> +
> + return do_pwritev(fd, vec, vlen, pos, flags);
> +}
> +
> #ifdef CONFIG_COMPAT
>
> static ssize_t compat_do_readv_writev(int type, struct file *file,
> const struct compat_iovec __user *uvector,
> - unsigned long nr_segs, loff_t *pos)
> + unsigned long nr_segs, loff_t *pos, int flags)
> {
> compat_ssize_t tot_len;
> struct iovec iovstack[UIO_FASTIOV];
> @@ -1017,7 +1071,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
>
> if (iter_fn)
> ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> - pos, iter_fn, 0);
> + pos, iter_fn, flags);
> else if (fnv)
> ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> pos, fnv);
> @@ -1041,7 +1095,7 @@ out:
>
> static size_t compat_readv(struct file *file,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t *pos)
> + unsigned long vlen, loff_t *pos, int flags)
> {
> ssize_t ret = -EBADF;
>
> @@ -1051,8 +1105,10 @@ static size_t compat_readv(struct file *file,
> ret = -EINVAL;
> if (!(file->f_mode & FMODE_CAN_READ))
> goto out;
> + if (flags & ~0)
> + goto out;
>
> - ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
> + ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
>
> out:
> if (ret > 0)
> @@ -1061,9 +1117,9 @@ out:
> return ret;
> }
>
> -COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> - const struct compat_iovec __user *,vec,
> - compat_ulong_t, vlen)
> +static size_t __compat_sys_readv(compat_ulong_t fd,
> + const struct compat_iovec __user *vec,
> + compat_ulong_t vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret;
> @@ -1072,16 +1128,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> if (!f.file)
> return -EBADF;
> pos = f.file->f_pos;
> - ret = compat_readv(f.file, vec, vlen, &pos);
> + ret = compat_readv(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> f.file->f_pos = pos;
> fdput_pos(f);
> return ret;
> +
> +}
> +
> +COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> + const struct compat_iovec __user *,vec,
> + compat_ulong_t, vlen)
> +{
> + return __compat_sys_readv(fd, vec, vlen, 0);
> }
>
> static long __compat_sys_preadv64(unsigned long fd,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t pos)
> + unsigned long vlen, loff_t pos, int flags)
> {
> struct fd f;
> ssize_t ret;
> @@ -1093,7 +1157,7 @@ static long __compat_sys_preadv64(unsigned long fd,
> return -EBADF;
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PREAD)
> - ret = compat_readv(f.file, vec, vlen, &pos);
> + ret = compat_readv(f.file, vec, vlen, &pos, flags);
> fdput(f);
> return ret;
> }
> @@ -1103,7 +1167,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
> const struct compat_iovec __user *,vec,
> unsigned long, vlen, loff_t, pos)
> {
> - return __compat_sys_preadv64(fd, vec, vlen, pos);
> + return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
> }
> #endif
>
> @@ -1113,12 +1177,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
> {
> loff_t pos = ((loff_t)pos_high << 32) | pos_low;
>
> - return __compat_sys_preadv64(fd, vec, vlen, pos);
> + return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
> +}
> +
> +COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
> + const struct compat_iovec __user *,vec,
> + compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
> + int, flags)
> +{
> + loff_t pos = ((loff_t)pos_high << 32) | pos_low;
> +
> + if (pos == -1)
> + return __compat_sys_readv(fd, vec, vlen, flags);
> +
> + return __compat_sys_preadv64(fd, vec, vlen, pos, flags);
> }
>
> static size_t compat_writev(struct file *file,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t *pos)
> + unsigned long vlen, loff_t *pos, int flags)
> {
> ssize_t ret = -EBADF;
>
> @@ -1128,8 +1205,10 @@ static size_t compat_writev(struct file *file,
> ret = -EINVAL;
> if (!(file->f_mode & FMODE_CAN_WRITE))
> goto out;
> + if (flags & ~0)
> + goto out;
>
> - ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
> + ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, flags);
>
> out:
> if (ret > 0)
> @@ -1138,9 +1217,9 @@ out:
> return ret;
> }
>
> -COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
> - const struct compat_iovec __user *, vec,
> - compat_ulong_t, vlen)
> +static size_t __compat_sys_writev(compat_ulong_t fd,
> + const struct compat_iovec __user* vec,
> + compat_ulong_t vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret;
> @@ -1149,28 +1228,36 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
> if (!f.file)
> return -EBADF;
> pos = f.file->f_pos;
> - ret = compat_writev(f.file, vec, vlen, &pos);
> + ret = compat_writev(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> f.file->f_pos = pos;
> fdput_pos(f);
> return ret;
> }
>
> +COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
> + const struct compat_iovec __user *, vec,
> + compat_ulong_t, vlen)
> +{
> + return __compat_sys_writev(fd, vec, vlen, 0);
> +}
> +
> static long __compat_sys_pwritev64(unsigned long fd,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t pos)
> + unsigned long vlen, loff_t pos, int flags)
> {
> struct fd f;
> ssize_t ret;
>
> if (pos < 0)
> return -EINVAL;
> +
> f = fdget(fd);
> if (!f.file)
> return -EBADF;
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PWRITE)
> - ret = compat_writev(f.file, vec, vlen, &pos);
> + ret = compat_writev(f.file, vec, vlen, &pos, flags);
> fdput(f);
> return ret;
> }
> @@ -1180,7 +1267,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
> const struct compat_iovec __user *,vec,
> unsigned long, vlen, loff_t, pos)
> {
> - return __compat_sys_pwritev64(fd, vec, vlen, pos);
> + return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
> }
> #endif
>
> @@ -1190,8 +1277,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
> {
> loff_t pos = ((loff_t)pos_high << 32) | pos_low;
>
> - return __compat_sys_pwritev64(fd, vec, vlen, pos);
> + return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
> +}
> +
> +COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
> + const struct compat_iovec __user *,vec,
> + compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
> +{
> + loff_t pos = ((loff_t)pos_high << 32) | pos_low;
> +
> + if (pos == -1)
> + return __compat_sys_writev(fd, vec, vlen, flags);
> +
> + return __compat_sys_pwritev64(fd, vec, vlen, pos, flags);
> }
> +
> #endif
>
> static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index e649426..63a94e2 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
> asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
> const struct compat_iovec __user *vec,
> compat_ulong_t vlen, u32 pos_low, u32 pos_high);
> +asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
> + const struct compat_iovec __user *vec,
> + compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
> +asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
> + const struct compat_iovec __user *vec,
> + compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
>
> #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
> asmlinkage long compat_sys_preadv64(unsigned long fd,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index bda9b81..cedc22e 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -571,8 +571,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
> size_t count, loff_t pos);
> asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
> unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> + int flags);
> asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
> unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> + int flags);
> asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
> asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
> asmlinkage long sys_chdir(const char __user *filename);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 22749c1..9406018 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
> __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
> #define __NR_pwritev 70
> __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
> +#define __NR_preadv2 281
> +__SC_COMP(__NR_preadv2, sys_preadv2, compat_sys_preadv2)
> +#define __NR_pwritev2 282
> +__SC_COMP(__NR_pwritev2, sys_pwritev2, compat_sys_pwritev2)
>
> /* fs/sendfile.c */
> #define __NR3264_sendfile 71
> @@ -709,7 +713,7 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
> __SYSCALL(__NR_bpf, sys_bpf)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 281
> +#define __NR_syscalls 283
>
> /*
> * All syscalls below here should go away really,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 14b4642..530c263 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1457,6 +1457,7 @@ static void shrink_readahead_size_eio(struct file *filp,
> * @ppos: current file position
> * @iter: data destination
> * @written: already copied
> + * @flags: optional flags
> *
> * This is a generic file read routine, and uses the
> * mapping->a_ops->readpage() function for the actual low-level stuff.
> @@ -1465,7 +1466,7 @@ static void shrink_readahead_size_eio(struct file *filp,
> * of the logic when it comes to error handling etc.
> */
> static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
> - struct iov_iter *iter, ssize_t written)
> + struct iov_iter *iter, ssize_t written, int flags)
> {
> struct address_space *mapping = filp->f_mapping;
> struct inode *inode = mapping->host;
> @@ -1735,7 +1736,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> }
> }
>
> - retval = do_generic_file_read(file, ppos, iter, retval);
> + retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
> out:
> return retval;
> }

2014-11-11 21:40:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Nov 11, 2014 at 11:02:00AM -0500, Milosz Tanski wrote:
> On Tue, Nov 11, 2014 at 1:44 AM, Dave Chinner <[email protected]> wrote:
> > On Mon, Nov 10, 2014 at 11:40:23AM -0500, Milosz Tanski wrote:
> >> This patcheset introduces an ability to perform a non-blocking read from
> >> regular files in buffered IO mode. This works by only for those filesystems
> >> that have data in the page cache.
> >>
> >> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> >> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> >> extra flag argument (RWF_NONBLOCK).
> >>
> >> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> >> perform buffered IO operations. They submit the work form another thread
> >> that performs network IO and epoll or other threads that perform CPU work. This
> >> leads to increased latency for processing, esp. in the case of data that's
> >> already cached in the page cache.
> >>
> >> With the new interface the applications will now be able to fetch the data in
> >> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> >> there. In our own application (VLDB) we've observed a decrease in latency for
> >> "fast" request by avoiding unnecessary queuing and having to swap out current
> >> tasks in IO bound work threads.
> >
> > Can you write a test (or set of) for fstests that exercises this new
> > functionality? I'm not worried about performance, just
> > correctness....
>
> Sure thing. Can you point me at the fstests repo? A quick google
> search reveals lots of projects named fstests, most of them abandoned.

git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-11 21:43:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Nov 11, 2014 at 12:03:14PM -0500, Jeff Moyer wrote:
> Milosz Tanski <[email protected]> writes:
>
> >> Can you write a test (or set of) for fstests that exercises this new
> >> functionality? I'm not worried about performance, just
> >> correctness....
> >
> > Sure thing. Can you point me at the fstests repo? A quick google
> > search reveals lots of projects named fstests, most of them abandoned.
>
> I think he's referring to xfstests. Still, I think that's the wrong
> place for functional testing. ltp would be better, imo.

I don't follow. Can you explain why is xfstests be the wrong place
to exercise this functionality and what makes ltp a better choice?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-11-11 22:50:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Nov 11, 2014 at 12:03:14PM -0500, Jeff Moyer wrote:
>
> I think he's referring to xfstests. Still, I think that's the wrong
> place for functional testing. ltp would be better, imo.

I disagree; xfstests is the right place for adding these tests, because
these patches seem to require fs-specific support, and file system
developers are already using xfstests when checking for regressions.
Using xfstests is already part of most of the file system developers'
workflow; ltp is not.

So if you want to make sure we notice regressions, it really needs to
go into xfstests. If you insist on putting it in ltp, then one of us
will then have to make a copy of the tests and put it in xfstests.

Cheers,

- Ted

2014-11-11 23:22:10

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

Dave Chinner <[email protected]> writes:

> On Tue, Nov 11, 2014 at 12:03:14PM -0500, Jeff Moyer wrote:
>> Milosz Tanski <[email protected]> writes:
>>
>> >> Can you write a test (or set of) for fstests that exercises this new
>> >> functionality? I'm not worried about performance, just
>> >> correctness....
>> >
>> > Sure thing. Can you point me at the fstests repo? A quick google
>> > search reveals lots of projects named fstests, most of them abandoned.
>>
>> I think he's referring to xfstests. Still, I think that's the wrong
>> place for functional testing. ltp would be better, imo.
>
> I don't follow. Can you explain why is xfstests be the wrong place
> to exercise this functionality and what makes ltp a better choice?

Right, I should have made a case for that. ltp already has test cases
for system calls such as readv/writev (though they are woefully
inadequate). It simply looked like a better fit to me. For some reason
I view xfstests as a regression test suite, but I know that isn't
strictly true.

If you feel xfstests is a better place, and Ted makes a good case for
that choice, then that's fine with me. I'm not, as Ted worried,
insisting on putting test cases into ltp. :) I was expressing my
opinion, and am happy for the dialog.

Cheers,
Jeff

2014-11-11 23:28:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, 11 Nov 2014, Theodore Ts'o wrote:
> On Tue, Nov 11, 2014 at 12:03:14PM -0500, Jeff Moyer wrote:
> >
> > I think he's referring to xfstests. Still, I think that's the wrong
> > place for functional testing. ltp would be better, imo.
>
> I disagree; xfstests is the right place for adding these tests, because
> these patches seem to require fs-specific support, and file system
> developers are already using xfstests when checking for regressions.
> Using xfstests is already part of most of the file system developers'
> workflow; ltp is not.
>
> So if you want to make sure we notice regressions, it really needs to
> go into xfstests. If you insist on putting it in ltp, then one of us
> will then have to make a copy of the tests and put it in xfstests.

No. LTP needs to pull in the latest changes from xfstests simply
because LTP is a conglomerate of domain specific tests. And it never
can become more than a conglomerate for obvious reasons.

xfstests is what the FS developers use and update. If LTP has some
extra magic tests developed then is should send patches against
xfstests and reintegrate the result.

Thanks,

tglx

2014-11-14 16:33:50

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

Dave Chinner <[email protected]> writes:

> On Mon, Nov 10, 2014 at 11:40:23AM -0500, Milosz Tanski wrote:
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
>> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
>> extra flag argument (RWF_NONBLOCK).
>>
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>
> Can you write a test (or set of) for fstests that exercises this new
> functionality? I'm not worried about performance, just
> correctness....

On the subject of testing, I added support to trinity (attached,
untested). That did raise one question. Do we expect applications to
#include <linux/fs.h> to get the RWF_NONBLOCK definition?

Cheers,
Jeff

diff --git a/include/syscalls-i386.h b/include/syscalls-i386.h
index 767be6e..3125064 100644
--- a/include/syscalls-i386.h
+++ b/include/syscalls-i386.h
@@ -365,4 +365,6 @@ struct syscalltable syscalls_i386[] = {
{ .entry = &syscall_getrandom },
{ .entry = &syscall_memfd_create },
{ .entry = &syscall_bpf },
+ { .entry = &syscall_preadv2 },
+ { .entry = &syscall_pwritev2 },
};
diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
index cb609ad..8d32571 100644
--- a/include/syscalls-x86_64.h
+++ b/include/syscalls-x86_64.h
@@ -329,4 +329,6 @@ struct syscalltable syscalls_x86_64[] = {
{ .entry = &syscall_memfd_create },
{ .entry = &syscall_kexec_file_load },
{ .entry = &syscall_bpf },
+ { .entry = &syscall_preadv2 },
+ { .entry = &syscall_pwritev2 },
};
diff --git a/syscalls/read.c b/syscalls/read.c
index e0948a2..adbf146 100644
--- a/syscalls/read.c
+++ b/syscalls/read.c
@@ -3,6 +3,7 @@
*/
#include <stdlib.h>
#include <string.h>
+#include <linux/fs.h>
#include "arch.h"
#include "maps.h"
#include "random.h"
@@ -94,3 +95,29 @@ struct syscallentry syscall_preadv = {
.arg5name = "pos_h",
.flags = NEED_ALARM,
};
+
+/*
+ * SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+ */
+
+struct syscallentry syscall_preadv2 = {
+ .name = "preadv2",
+ .num_args = 5,
+ .arg1name = "fd",
+ .arg1type = ARG_FD,
+ .arg2name = "vec",
+ .arg2type = ARG_IOVEC,
+ .arg3name = "vlen",
+ .arg3type = ARG_IOVECLEN,
+ .arg4name = "pos_l",
+ .arg5name = "pos_h",
+ .arg6name = "flags",
+ .arg6type = ARG_OP,
+ .arg6list = {
+ .num = 1,
+ .values = { RWF_NONBLOCK, },
+ },
+ .flags = NEED_ALARM,
+};
diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
index 5a7748b..04400dd 100644
--- a/syscalls/syscalls.h
+++ b/syscalls/syscalls.h
@@ -375,5 +375,7 @@ extern struct syscallentry syscall_seccomp;
extern struct syscallentry syscall_memfd_create;
extern struct syscallentry syscall_kexec_file_load;
extern struct syscallentry syscall_bpf;
+extern struct syscallentry syscall_preadv2;
+extern struct syscallentry syscall_pwritev2;

unsigned int random_fcntl_setfl_flags(void);
diff --git a/syscalls/write.c b/syscalls/write.c
index f37e760..4218ccc 100644
--- a/syscalls/write.c
+++ b/syscalls/write.c
@@ -2,6 +2,7 @@
* SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count)
*/
#include <stdlib.h>
+#include <linux/fs.h>
#include "arch.h" // page_size
#include "maps.h"
#include "random.h"
@@ -95,3 +96,30 @@ struct syscallentry syscall_pwritev = {
.arg5name = "pos_h",
.flags = NEED_ALARM,
};
+
+
+/*
+ * SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+ */
+
+struct syscallentry syscall_pwritev2 = {
+ .name = "pwritev2",
+ .num_args = 6,
+ .arg1name = "fd",
+ .arg1type = ARG_FD,
+ .arg2name = "vec",
+ .arg2type = ARG_IOVEC,
+ .arg3name = "vlen",
+ .arg3type = ARG_IOVECLEN,
+ .arg4name = "pos_l",
+ .arg5name = "pos_h",
+ .arg6name = "flags",
+ .arg6type = ARG_OP,
+ .arg6list = {
+ .num = 1,
+ .values = { RWF_NONBLOCK, },
+ },
+ .flags = NEED_ALARM,
+};

2014-11-14 16:40:06

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:

> > Can you write a test (or set of) for fstests that exercises this new
> > functionality? I'm not worried about performance, just
> > correctness....
>
> On the subject of testing, I added support to trinity (attached,
> untested). That did raise one question. Do we expect applications to
> #include <linux/fs.h> to get the RWF_NONBLOCK definition?

Trinity will at least need an addition to include/compat.h for
older headers that won't have the definition. Looks ok otherwise.

Also, I usually sit on stuff like this until the syscall numbers are
in Linus tree. This is 3.19 stuff I presume ?
istr akpm picked up execveat recently, so if that goes in first, we'll
need to respin this anyway..

Dave

2014-11-14 16:52:20

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

Dave Jones <[email protected]> writes:

> On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:
>
> > > Can you write a test (or set of) for fstests that exercises this new
> > > functionality? I'm not worried about performance, just
> > > correctness....
> >
> > On the subject of testing, I added support to trinity (attached,
> > untested). That did raise one question. Do we expect applications to
> > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
>
> Trinity will at least need an addition to include/compat.h for
> older headers that won't have the definition. Looks ok otherwise.

OK, I'll add that.

> Also, I usually sit on stuff like this until the syscall numbers are
> in Linus tree. This is 3.19 stuff I presume ?
> istr akpm picked up execveat recently, so if that goes in first, we'll
> need to respin this anyway..

Sure. I just wanted to test with trinity *before* it makes it into the
kernel. Crazy, I know. ;-)

Cheers,
Jeff

2014-11-14 18:45:09

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Fri, Nov 14, 2014 at 11:39 AM, Dave Jones <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:
>
> > > Can you write a test (or set of) for fstests that exercises this new
> > > functionality? I'm not worried about performance, just
> > > correctness....
> >
> > On the subject of testing, I added support to trinity (attached,
> > untested). That did raise one question. Do we expect applications to
> > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
>
> Trinity will at least need an addition to include/compat.h for
> older headers that won't have the definition. Looks ok otherwise.
>
> Also, I usually sit on stuff like this until the syscall numbers are
> in Linus tree. This is 3.19 stuff I presume ?
> istr akpm picked up execveat recently, so if that goes in first, we'll
> need to respin this anyway..

Yes, I am hoping to get it into 3.19. It's a large pain having to deal
with other changes to the syscall code.

On a unrelated note I just back to figuring out how to add this to
xfstests. I got busy with other things the last few days. I'm still
not quite sure how to write a test using the framework, the
documentation (README) seams very XFS specific and otherwise the test
seam to be be split between many different files / directories / C
code / shell code. I might be me being slow... but it's just not
obvious for me how to glue the whole thing together.

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-14 18:46:42

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Fri, Nov 14, 2014 at 11:51 AM, Jeff Moyer <[email protected]> wrote:
> Dave Jones <[email protected]> writes:
>
>> On Fri, Nov 14, 2014 at 11:32:53AM -0500, Jeff Moyer wrote:
>>
>> > > Can you write a test (or set of) for fstests that exercises this new
>> > > functionality? I'm not worried about performance, just
>> > > correctness....
>> >
>> > On the subject of testing, I added support to trinity (attached,
>> > untested). That did raise one question. Do we expect applications to
>> > #include <linux/fs.h> to get the RWF_NONBLOCK definition?
>>
>> Trinity will at least need an addition to include/compat.h for
>> older headers that won't have the definition. Looks ok otherwise.
>
> OK, I'll add that.
>
>> Also, I usually sit on stuff like this until the syscall numbers are
>> in Linus tree. This is 3.19 stuff I presume ?
>> istr akpm picked up execveat recently, so if that goes in first, we'll
>> need to respin this anyway..
>
> Sure. I just wanted to test with trinity *before* it makes it into the
> kernel. Crazy, I know. ;-)

I am happy to help out to make sure it's solid... although deep down
inside I secretly wish that now wasn't the time we decided to start
doing it :)

>
> Cheers,
> Jeff



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-14 18:53:32

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

Milosz Tanski <[email protected]> writes:

> On a unrelated note I just back to figuring out how to add this to
> xfstests. I got busy with other things the last few days. I'm still
> not quite sure how to write a test using the framework, the
> documentation (README) seams very XFS specific and otherwise the test
> seam to be be split between many different files / directories / C
> code / shell code. I might be me being slow... but it's just not
> obvious for me how to glue the whole thing together.

I can help you get started with this. I'll send email off-list.

Cheers,
Jeff

2014-11-24 09:53:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)


Al or Andrew,

can someone pick this up? The end of the 3.19 merge window is near,
and I'd really love to get this series in.

2014-11-25 23:00:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Mon, 10 Nov 2014 11:40:23 -0500 Milosz Tanski <[email protected]> wrote:

> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.

It would be extremely useful if we could get input from the developers
of "samba, libuv, etc.." about this. Do they think it will be useful,
will they actually use it, can they identify any shortcomings, etc.

Because it would be terrible if we were to merge this then discover
that major applications either don't use it, or require
userspace-visible changes.

Ideally, someone would whip up pread2() support into those apps and
report on the result.

> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.

I haven't read the patches yet, but I'm scratching my head over
pwritev2(). There's much talk and testing results here about
preadv2(), but nothing about how pwritev() works, what its semantics
are, testing results, etc.

> Version 6 highlight:
> - Compat syscall flag checks, per. Jeff.
> - Minor stylistic suggestions.
>
> Version 5 highlight:
> - XFS support for RWF_NONBLOCK. from Christoph.
> - RWF_DSYNC flag and support for pwritev2, from Christoph.
> - Implemented compat syscalls, per. Jeff.
> - Missing nfs, ceph changes from older patchset.
>
> Version 4 highlight:
> - Updated for 3.18-rc1.
> - Performance data from our application.
> - First stab at man page with Jeff's help. Patch is in-reply to.

I can't find that manpage. It is important. Please include it in the
patch series.

I'm particularly interested in details regarding

- behaviour and userspace return values when data is not found in pagecache

- how it handles partially uptodate pages (blocksize < pagesize).
For both reads and writes. This sort of thing gets intricate so
let's spell the design out with great specificity.

- behaviour at EOF.

- details regarding handling of file holes.

> RFC Version 3 highlights:
> - Down to 2 syscalls from 4; can user fp or argument position.
> - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>
> RFC Version 2 highlights:
> - Put the flags argument into kiocb (less noise), per. Al Viro
> - O_DIRECT checking early in the process, per. Jeff Moyer
> - Resolved duplicate (c&p) code in syscall code, per. Jeff
> - Included perf data in thread cover letter, per. Jeff
> - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>
>
> Some perf data generated using fio comparing the posix aio engine to a version
> of the posix AIO engine that attempts to performs "fast" reads before
> submitting the operations to the queue. This workflow is on ext4 partition on
> raid0 (test / build-rig.) Simulating our database access patern workload using
> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> does the same thing.)
>
> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> f3: ~9% seq-read over large dataset
>
> before:
>
> f1:
> bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
> lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
> lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> f2:
> bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
> lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
> lat (msec) : >=2000=4.33%
> f3:
> bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10,
> stdev=34526.89
> lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
> lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
> lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
> lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> total:
> READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
> mint=600001msec, maxt=600113msec
>
> after (with fast read using preadv2 before submit):
>
> f1:
> bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
> lat (usec) : 2=70.63%, 4=0.01%
> lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> f2:
> bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
> lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
> lat (msec) : >=2000=9.99%
> f3:
> bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50,
> stdev=35995.60
> lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
> lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
> lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
> lat (msec) : 100=0.05%, 250=0.02%
> total:
> READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
> mint=600020msec, maxt=600178msec
>
> Interpreting the results you can see total bandwidth stays the same but overall
> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> workloads. There is a slight bump in latency for since it's random data that's

s/for/for f2/

> unlikely to be cached but we're always trying "fast read".
>
> In our application we have starting keeping track of "fast read" hits/misses
> and for files / requests that have a lot hit ratio we don't do "fast reads"
> mostly getting rid of extra latency in the uncached cases. In our real world
> work load we were able to reduce average response time by 20 to 30% (depends
> on amount of IO done by request).
>
> I've performed other benchmarks and I have no observed any perf regressions in
> any of the normal (old) code paths.
>
> I have co-developed these changes with Christoph Hellwig.
>

There have been several incomplete attempts to implement fincore(). If
we were to complete those attempts, preadv2() could be implemented
using fincore()+pread(). Plus we get fincore(), which is useful for
other (but probably similar) reasons. Probably fincore()+pwrite() could
be used to implement pwritev2(), but I don't know what pwritev2() does
yet.

Implementing fincore() is more flexible, requires less code and is less
likely to have bugs. So why not go that way? Yes, it's more CPU
intensive, but how much? Is the difference sufficient to justify the
preadv2()/pwritev2() approach?

2014-12-02 22:17:50

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Nov 25, 2014 at 6:01 PM, Andrew Morton
<[email protected]> wrote:
>
> On Mon, 10 Nov 2014 11:40:23 -0500 Milosz Tanski <[email protected]> wrote:
>
> > This patcheset introduces an ability to perform a non-blocking read from
> > regular files in buffered IO mode. This works by only for those filesystems
> > that have data in the page cache.
> >
> > It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> > new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> > extra flag argument (RWF_NONBLOCK).
> >
> > It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> > perform buffered IO operations. They submit the work form another thread
> > that performs network IO and epoll or other threads that perform CPU work. This
> > leads to increased latency for processing, esp. in the case of data that's
> > already cached in the page cache.
>
> It would be extremely useful if we could get input from the developers
> of "samba, libuv, etc.." about this. Do they think it will be useful,
> will they actually use it, can they identify any shortcomings, etc.
>
> Because it would be terrible if we were to merge this then discover
> that major applications either don't use it, or require
> userspace-visible changes.
>
> Ideally, someone would whip up pread2() support into those apps and
> report on the result.

The Samba folks did express an interest in the functionality when I
originally brought up the idea of having a non-blocking page cache
only when I was getting my mind around the concept. This was
unsolicited on my part. https://lkml.org/lkml/2014/9/7/103 It should
be good enough at this point to enable a "fast path" read without
deferring to their AIO pool.

>
> > With the new interface the applications will now be able to fetch the data in
> > their network / cpu bound thread(s) and only defer to a threadpool if it's not
> > there. In our own application (VLDB) we've observed a decrease in latency for
> > "fast" request by avoiding unnecessary queuing and having to swap out current
> > tasks in IO bound work threads.
>
> I haven't read the patches yet, but I'm scratching my head over
> pwritev2(). There's much talk and testing results here about
> preadv2(), but nothing about how pwritev() works, what its semantics
> are, testing results, etc.

Essentially preadv2 and pwritev2 are same syscalls as preadv/writrev
but support an extra flags argument. With preadv2 the only flag
implemented right now is RWF_NONBLOCK, that allows perform a page
cache only read on a per call basis. Christoph, implemented the
RWF_DSYNC flag for pwritev2 which has the same effect as O_DSYNC but
on per write call basis.

Christoph has included example of the usage of this pwritev2 with
RWF_DSYNC in the commit msg in patch #7. I am currently working on
wiring up test cases as part of xfstests for both pwritev2/preadv2
functionality.

>
> > Version 6 highlight:
> > - Compat syscall flag checks, per. Jeff.
> > - Minor stylistic suggestions.
> >
> > Version 5 highlight:
> > - XFS support for RWF_NONBLOCK. from Christoph.
> > - RWF_DSYNC flag and support for pwritev2, from Christoph.
> > - Implemented compat syscalls, per. Jeff.
> > - Missing nfs, ceph changes from older patchset.
> >
> > Version 4 highlight:
> > - Updated for 3.18-rc1.
> > - Performance data from our application.
> > - First stab at man page with Jeff's help. Patch is in-reply to.
>
> I can't find that manpage. It is important. Please include it in the
> patch series.
>
> I'm particularly interested in details regarding
>
> - behaviour and userspace return values when data is not found in pagecache
>
> - how it handles partially uptodate pages (blocksize < pagesize).
> For both reads and writes. This sort of thing gets intricate so
> let's spell the design out with great specificity.
>
> - behaviour at EOF.
>
> - details regarding handling of file holes.

I have replied with the man page update patches with the last two
submissions. Here's an archive link:
https://lkml.org/lkml/2014/11/6/447. I'll re-reply it to the parent
thread of the latest as well. The man page updates cover
preadv2/pwritev2 and their new flags RWF_NONBLOCK/RWF_DSYNC
respectively.

- Behavior on data not in page cache is documented in man page (EAGAIN).
- Since we defer to normal preadv (and thus read) behavior things like
end of file (0 length return value), partial up to date pages, and
hole behavior is the same as in those calls.

Further, the behavior of the logic for RWF_NONBLOCK is primarily
located is mostly contained do_generic_file_read in filemap.c. It does
is bail early if we have to make a call to aops->readpage() returning
a full or partial read if there's data in the page cache EAGAIN if
there's nothing starting at offset in the page cache. So it makes why
it behaves like regular preadv at the end of file file / holes /
etc...

>
> > RFC Version 3 highlights:
> > - Down to 2 syscalls from 4; can user fp or argument position.
> > - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
> >
> > RFC Version 2 highlights:
> > - Put the flags argument into kiocb (less noise), per. Al Viro
> > - O_DIRECT checking early in the process, per. Jeff Moyer
> > - Resolved duplicate (c&p) code in syscall code, per. Jeff
> > - Included perf data in thread cover letter, per. Jeff
> > - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
> >
> >
> > Some perf data generated using fio comparing the posix aio engine to a version
> > of the posix AIO engine that attempts to performs "fast" reads before
> > submitting the operations to the queue. This workflow is on ext4 partition on
> > raid0 (test / build-rig.) Simulating our database access patern workload using
> > 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> > does the same thing.)
> >
> > f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> > f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> > f3: ~9% seq-read over large dataset
> >
> > before:
> >
> > f1:
> > bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
> > lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
> > lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> > f2:
> > bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
> > lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
> > lat (msec) : >=2000=4.33%
> > f3:
> > bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10,
> > stdev=34526.89
> > lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
> > lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
> > lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
> > lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> > total:
> > READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
> > mint=600001msec, maxt=600113msec
> >
> > after (with fast read using preadv2 before submit):
> >
> > f1:
> > bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
> > lat (usec) : 2=70.63%, 4=0.01%
> > lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> > f2:
> > bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
> > lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
> > lat (msec) : >=2000=9.99%
> > f3:
> > bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50,
> > stdev=35995.60
> > lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
> > lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
> > lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
> > lat (msec) : 100=0.05%, 250=0.02%
> > total:
> > READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
> > mint=600020msec, maxt=600178msec
> >
> > Interpreting the results you can see total bandwidth stays the same but overall
> > request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> > workloads. There is a slight bump in latency for since it's random data that's
>
> s/for/for f2/
>
> > unlikely to be cached but we're always trying "fast read".
> >
> > In our application we have starting keeping track of "fast read" hits/misses
> > and for files / requests that have a lot hit ratio we don't do "fast reads"
> > mostly getting rid of extra latency in the uncached cases. In our real world
> > work load we were able to reduce average response time by 20 to 30% (depends
> > on amount of IO done by request).
> >
> > I've performed other benchmarks and I have no observed any perf regressions in
> > any of the normal (old) code paths.
> >
> > I have co-developed these changes with Christoph Hellwig.
> >
>
> There have been several incomplete attempts to implement fincore(). If
> we were to complete those attempts, preadv2() could be implemented
> using fincore()+pread(). Plus we get fincore(), which is useful for
> other (but probably similar) reasons. Probably fincore()+pwrite() could
> be used to implement pwritev2(), but I don't know what pwritev2() does
> yet.
>
> Implementing fincore() is more flexible, requires less code and is less
> likely to have bugs. So why not go that way? Yes, it's more CPU
> intensive, but how much? Is the difference sufficient to justify the
> preadv2()/pwritev2() approach?

I would like to see a fincore() functionality (for other reasons) I
don't think it does the job here. fincore() + preadv() is inherently
racy as there's no guarantee that the data becomes uncached between
the two calls. This may not matter in some cases, but in others (ones
that I'm trying to solve) it will introduce unexpected latency.

There's no overlap between prwritev2 and fincore() functionality.

Sorry that this took longer then expected to reply. Got busy with
holidays / unrelated things. Let me know if I missed anything.

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-12-02 22:42:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, 2 Dec 2014 17:17:42 -0500 Milosz Tanski <[email protected]> wrote:

> > There have been several incomplete attempts to implement fincore(). If
> > we were to complete those attempts, preadv2() could be implemented
> > using fincore()+pread(). Plus we get fincore(), which is useful for
> > other (but probably similar) reasons. Probably fincore()+pwrite() could
> > be used to implement pwritev2(), but I don't know what pwritev2() does
> > yet.
> >
> > Implementing fincore() is more flexible, requires less code and is less
> > likely to have bugs. So why not go that way? Yes, it's more CPU
> > intensive, but how much? Is the difference sufficient to justify the
> > preadv2()/pwritev2() approach?
>
> I would like to see a fincore() functionality (for other reasons) I
> don't think it does the job here. fincore() + preadv() is inherently
> racy as there's no guarantee that the data becomes uncached between
> the two calls.

There will always be holes. For example find_get_page() could block on
lock_page() while some other process is doing IO.
page_cache_async_readahead() does lots of memory allocation which can
get blocked for long periods in the page allocator.
page_cache_async_readahead() can block on synchronous metadata reads,
etc.

The question is whether a simpler approach such as fincore() will be
sufficient.

> This may not matter in some cases, but in others (ones
> that I'm trying to solve) it will introduce unexpected latency.

Details?

> There's no overlap between prwritev2 and fincore() functionality.

Do we actually need pwritev2()? What's the justification for that?


Please let's examine the alternative(s) seriously. It would be mistake
to add preadv2/pwritev2 if fincore+pread would have sufficed.

2014-12-03 09:44:59

by Volker Lendecke

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Dec 02, 2014 at 02:42:00PM -0800, Andrew Morton wrote:
> The question is whether a simpler approach such as fincore() will be
> sufficient.

For many use cases in Samba, fincore will probably be
enough. But Windows clients become more and more
multi-threaded, so Samba sees multiple parallel active read
requests. Samba's core SMB processing engine is single
threaded, and if due to that race we get blocked, more than
one data stream will be affected. We might make it an option
to use the fincore alternative, but I don't see it as a
default. The default will be the strict threadpool.

With best regards,

Volker Lendecke

--
SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:[email protected]

2014-12-03 16:48:33

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Tue, Dec 2, 2014 at 5:42 PM, Andrew Morton <[email protected]> wrote:
>
> On Tue, 2 Dec 2014 17:17:42 -0500 Milosz Tanski <[email protected]> wrote:
>
> > > There have been several incomplete attempts to implement fincore(). If
> > > we were to complete those attempts, preadv2() could be implemented
> > > using fincore()+pread(). Plus we get fincore(), which is useful for
> > > other (but probably similar) reasons. Probably fincore()+pwrite() could
> > > be used to implement pwritev2(), but I don't know what pwritev2() does
> > > yet.
> > >
> > > Implementing fincore() is more flexible, requires less code and is less
> > > likely to have bugs. So why not go that way? Yes, it's more CPU
> > > intensive, but how much? Is the difference sufficient to justify the
> > > preadv2()/pwritev2() approach?
> >
> > I would like to see a fincore() functionality (for other reasons) I
> > don't think it does the job here. fincore() + preadv() is inherently
> > racy as there's no guarantee that the data becomes uncached between
> > the two calls.
>
> There will always be holes. For example find_get_page() could block on
> lock_page() while some other process is doing IO.
> page_cache_async_readahead() does lots of memory allocation which can
> get blocked for long periods in the page allocator.
> page_cache_async_readahead() can block on synchronous metadata reads,
> etc.

Andrew I think it would helpful if you did read through the patches.
The first 3 are somewhat uninteresting as it's just wiring up the new
syscalls and plumbing the flag argument through. The core of the
RWF_NONBLOCK is patch 4: https://lkml.org/lkml/2014/11/10/463 and if
you strip away the fs specific changes the core of it is very simple.

The core is mostly contained in do_generic_file_read() in filemap.c,
and is very short and easy to understand. It boils down to we read as
much data as we can given what's in the page cache. There's no
fallback to diskio for readpage() in case of missing pages and we bail
before any calls to page_cache_async_readahead(). And to the best of
my knowledge lock_page() does not lock the page, all it does is call
pagecache_get_page() without the FGP_LOCK flag.

I've spent time a decent amount of time looking at this to make sure
we cover all our major bases. It's possible I missed something but the
biggest offenders should be covered and if I missed something I'd love
to cover that as well.

>
> The question is whether a simpler approach such as fincore() will be
> sufficient.
>
> > This may not matter in some cases, but in others (ones
> > that I'm trying to solve) it will introduce unexpected latency.
>
> Details?


Please read my points below and

>
>
> > There's no overlap between prwritev2 and fincore() functionality.
>
> Do we actually need pwritev2()? What's the justification for that?


I'm okay with splitting up the pwritev2 and preadv2 into two
independent patchsets to be considered on their own merits.

>
>
>
> Please let's examine the alternative(s) seriously. It would be mistake
> to add preadv2/pwritev2 if fincore+pread would have sufficed.


What the motivation for my change and also approach is a very common
pattern to async buffered disk IO in userspace server applications. It
comes down to having one thread to handle the network and a thread
pool to perform IO requests. Why a threadpool and not something like a
sendfile() for reads? Many non-trivial applications perform additional
processing (ssl, checksuming, transformation). Unfortunately this has
a inherent increase in average latency due to increased
synchronization penalties (enqueue and notify) but primarily due to
fast requests (already in cache) behind stuck behind slow request.

Here's the illustration of the common architecture:
http://i.imgur.com/f8Pla7j.png. In fact, most apps are even simpler
where they replace the request queue, task worker with a single thread
doing network IO using epoll or such.

preadv2 with RWF_NONBLOCK is analogous to the kernel recvmsg() with
the MSG_NOWAIT flag. It's really frustrating that such capacity
doesn't exist today. As with the user space application design we can
skip the io threadpool and decrease average request latency in many
common workloads (linear reads or zipf data accesses).

preadv2 with RWF_NONBLOCK as implemented does not suffer the same
eviction races as fincore + pread because it's not implemented as two
syscalls. It also has a much lower surface of possible blocking /
locking then fincore + pread because it cannot fallback to reading
from disk, it does not trigger read-ahead, and does not wait for page
lock.

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-12-04 23:11:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Wed, 3 Dec 2014 11:48:28 -0500 Milosz Tanski <[email protected]> wrote:

> On Tue, Dec 2, 2014 at 5:42 PM, Andrew Morton <[email protected]> wrote:
> >
> > On Tue, 2 Dec 2014 17:17:42 -0500 Milosz Tanski <[email protected]> wrote:
> >
> > > > There have been several incomplete attempts to implement fincore(). If
> > > > we were to complete those attempts, preadv2() could be implemented
> > > > using fincore()+pread(). Plus we get fincore(), which is useful for
> > > > other (but probably similar) reasons. Probably fincore()+pwrite() could
> > > > be used to implement pwritev2(), but I don't know what pwritev2() does
> > > > yet.
> > > >
> > > > Implementing fincore() is more flexible, requires less code and is less
> > > > likely to have bugs. So why not go that way? Yes, it's more CPU
> > > > intensive, but how much? Is the difference sufficient to justify the
> > > > preadv2()/pwritev2() approach?
> > >
> > > I would like to see a fincore() functionality (for other reasons) I
> > > don't think it does the job here. fincore() + preadv() is inherently
> > > racy as there's no guarantee that the data becomes uncached between
> > > the two calls.
> >
> > There will always be holes. For example find_get_page() could block on
> > lock_page() while some other process is doing IO.
> > page_cache_async_readahead() does lots of memory allocation which can
> > get blocked for long periods in the page allocator.
> > page_cache_async_readahead() can block on synchronous metadata reads,
> > etc.
>
> Andrew I think it would helpful if you did read through the patches.
> The first 3 are somewhat uninteresting as it's just wiring up the new
> syscalls and plumbing the flag argument through. The core of the
> RWF_NONBLOCK is patch 4: https://lkml.org/lkml/2014/11/10/463 and if
> you strip away the fs specific changes the core of it is very simple.
>
> The core is mostly contained in do_generic_file_read() in filemap.c,
> and is very short and easy to understand. It boils down to we read as
> much data as we can given what's in the page cache. There's no
> fallback to diskio for readpage() in case of missing pages and we bail
> before any calls to page_cache_async_readahead(). And to the best of
> my knowledge lock_page() does not lock the page, all it does is call
> pagecache_get_page() without the FGP_LOCK flag.
>
> I've spent time a decent amount of time looking at this to make sure
> we cover all our major bases. It's possible I missed something but the
> biggest offenders should be covered and if I missed something I'd love
> to cover that as well.

OK.

> >
> >
> > > There's no overlap between prwritev2 and fincore() functionality.
> >
> > Do we actually need pwritev2()? What's the justification for that?
>
>
> I'm okay with splitting up the pwritev2 and preadv2 into two
> independent patchsets to be considered on their own merits.

Well, we can do both together if both are wanted. The changelogs are
very skimpy on pwritev2(). A full description and careful
justification in the [0/n] changelog would be useful - something that
tells us "what's wrong with O_DSYNC+pwrite".


> >
> >
> >
> > Please let's examine the alternative(s) seriously. It would be mistake
> > to add preadv2/pwritev2 if fincore+pread would have sufficed.
>
>
> What the motivation for my change and also approach is a very common
> pattern to async buffered disk IO in userspace server applications. It
> comes down to having one thread to handle the network and a thread
> pool to perform IO requests. Why a threadpool and not something like a
> sendfile() for reads? Many non-trivial applications perform additional
> processing (ssl, checksuming, transformation). Unfortunately this has
> a inherent increase in average latency due to increased
> synchronization penalties (enqueue and notify) but primarily due to
> fast requests (already in cache) behind stuck behind slow request.
>
> Here's the illustration of the common architecture:
> http://i.imgur.com/f8Pla7j.png. In fact, most apps are even simpler
> where they replace the request queue, task worker with a single thread
> doing network IO using epoll or such.
>
> preadv2 with RWF_NONBLOCK is analogous to the kernel recvmsg() with
> the MSG_NOWAIT flag. It's really frustrating that such capacity
> doesn't exist today. As with the user space application design we can
> skip the io threadpool and decrease average request latency in many
> common workloads (linear reads or zipf data accesses).
>
> preadv2 with RWF_NONBLOCK as implemented does not suffer the same
> eviction races as fincore + pread because it's not implemented as two
> syscalls. It also has a much lower surface of possible blocking /
> locking then fincore + pread because it cannot fallback to reading
> from disk, it does not trigger read-ahead, and does not wait for page
> lock.

I can see all that, but it's handwaving. Yes, preadv2() will perform
better in some circumstances than fincore+pread. But how much better?
Enough to justify this approach, or not?

Alas, the only way to really settle that is to implement fincore() and
to subject it to a decent amount of realistic quantitative testing.

Ho hum.

Could you please hunt down some libuv developers, see if we can solicit
some quality input from them? As I said, we really don't want to merge
this then find that people don't use it for some reason, or that it
needs changes.

2014-12-05 08:19:03

by Volker Lendecke

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Thu, Dec 04, 2014 at 03:11:02PM -0800, Andrew Morton wrote:
> I can see all that, but it's handwaving. Yes, preadv2() will perform
> better in some circumstances than fincore+pread. But how much better?
> Enough to justify this approach, or not?
>
> Alas, the only way to really settle that is to implement fincore() and
> to subject it to a decent amount of realistic quantitative testing.
>
> Ho hum.
>
> Could you please hunt down some libuv developers, see if we can solicit
> some quality input from them? As I said, we really don't want to merge
> this then find that people don't use it for some reason, or that it
> needs changes.

All I can say from a Samba perspective is that none of the ARM based
Storage boxes I have seen so far do AIO because of the base footprint
for every read. For sequential reads kernel-level readahead could kick
in properly and we should be able to give them the best of both worlds:
No context switches in the default case but also good parallel behaviour
for other workloads. The most important benchmark for those guys is to
read a DVD image, whether it makes sense or not.

Volker

--
SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:[email protected]