2014-11-05 21:15:03

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 0/7] vfs: Non-blockling buffered fs read (page cache only)

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
extra flag argument (RWF_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

Version 5 highlight:
- XFS support for RWF_NONBLOCK. from Christoph.
- RWF_DSYNC flag and support for pwritev2, from Christoph.
- Implemented compat syscalls, per. Jeff.
- Missing nfs, ceph changes from older patchset.

Version 4 highlight:
- Updated for 3.18-rc1.
- Performance data from our application.
- First stab at man page with Jeff's help. Patch is in-reply to.

RFC Version 3 highlights:
- Down to 2 syscalls from 4; can user fp or argument position.
- RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.

RFC Version 2 highlights:
- Put the flags argument into kiocb (less noise), per. Al Viro
- O_DIRECT checking early in the process, per. Jeff Moyer
- Resolved duplicate (c&p) code in syscall code, per. Jeff
- Included perf data in thread cover letter, per. Jeff
- Created a new flag (not O_NONBLOCK) for readv2, perf Jeff


Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
lat (msec) : >=2000=4.33%
f3:
bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10,
stdev=34526.89
lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
lat (usec) : 2=70.63%, 4=0.01%
lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
lat (msec) : >=2000=9.99%
f3:
bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50,
stdev=35995.60
lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
lat (msec) : 100=0.05%, 250=0.02%
total:
READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases. In our real world
work load we were able to reduce average response time by 20 to 30% (depends
on amount of IO done by request).

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.

I have co-developed these changes with Christoph Hellwig.

Christoph Hellwig (3):
xfs: add RWF_NONBLOCK support
fs: pass iocb to generic_write_sync
fs: add a flag for per-operation O_DSYNC semantics

Milosz Tanski (4):
vfs: Prepare for adding a new preadv/pwritev with user flags.
vfs: Define new syscalls preadv2,pwritev2
x86: wire up preadv2 and pwritev2
vfs: RWF_NONBLOCK flag for preadv2

arch/x86/syscalls/syscall_32.tbl | 2 +
arch/x86/syscalls/syscall_64.tbl | 2 +
drivers/target/target_core_file.c | 6 +-
fs/block_dev.c | 8 +-
fs/btrfs/file.c | 7 +-
fs/ceph/file.c | 6 +-
fs/cifs/file.c | 14 +--
fs/direct-io.c | 8 +-
fs/ext4/file.c | 8 +-
fs/fuse/file.c | 2 +
fs/gfs2/file.c | 9 +-
fs/nfs/file.c | 15 ++-
fs/nfsd/vfs.c | 4 +-
fs/ntfs/file.c | 8 +-
fs/ocfs2/file.c | 12 +-
fs/pipe.c | 3 +-
fs/read_write.c | 239 +++++++++++++++++++++++++++++---------
fs/splice.c | 2 +-
fs/udf/file.c | 11 +-
fs/xfs/xfs_file.c | 36 ++++--
include/linux/aio.h | 2 +
include/linux/compat.h | 6 +
include/linux/fs.h | 16 ++-
include/linux/syscalls.h | 6 +
include/uapi/asm-generic/unistd.h | 6 +-
mm/filemap.c | 55 +++++++--
mm/shmem.c | 4 +
27 files changed, 346 insertions(+), 151 deletions(-)

--
1.9.1


2014-11-05 21:15:14

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 3/7] x86: wire up preadv2 and pwritev2

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 2 ++
arch/x86/syscalls/syscall_64.tbl | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 9fe1b5d..d592d87 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,5 @@
355 i386 getrandom sys_getrandom
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
+358 i386 preadv2 sys_preadv2
+359 i386 pwritev2 sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 281150b..7be2447 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,8 @@
319 common memfd_create sys_memfd_create
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
+322 64 preadv2 sys_preadv2
+323 64 pwritev2 sys_pwritev2

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
1.9.1

2014-11-05 21:15:20

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 2/7] vfs: Define new syscalls preadv2,pwritev2

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/read_write.c | 176 ++++++++++++++++++++++++++++++--------
include/linux/compat.h | 6 ++
include/linux/syscalls.h | 6 ++
include/uapi/asm-generic/unistd.h | 6 +-
mm/filemap.c | 5 +-
5 files changed, 158 insertions(+), 41 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 94b2d34..907735c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;
+ if (flags & ~0)
+ return -EINVAL;

return do_readv_writev(READ, file, vec, vlen, pos, flags);
}
@@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
+ if (flags & ~0)
+ return -EINVAL;

return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
}

EXPORT_SYMBOL(vfs_writev);

-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+ ret = vfs_readv(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -905,15 +909,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+ ret = vfs_writev(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -931,10 +935,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
}

-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, loff_t pos, int flags)
{
- loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
ssize_t ret = -EBADF;

@@ -945,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+ ret = vfs_readv(f.file, vec, vlen, &pos, flags);
fdput(f);
}

@@ -955,10 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, loff_t pos, int flags)
{
- loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
ssize_t ret = -EBADF;

@@ -969,7 +971,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+ ret = vfs_writev(f.file, vec, vlen, &pos, flags);
fdput(f);
}

@@ -979,11 +981,63 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}

+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen)
+{
+ return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen)
+{
+ return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ if (pos == -1)
+ return do_readv(fd, vec, vlen, flags);
+
+ return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ if (pos == -1)
+ return do_writev(fd, vec, vlen, flags);
+
+ return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
#ifdef CONFIG_COMPAT

static ssize_t compat_do_readv_writev(int type, struct file *file,
const struct compat_iovec __user *uvector,
- unsigned long nr_segs, loff_t *pos)
+ unsigned long nr_segs, loff_t *pos, int flags)
{
compat_ssize_t tot_len;
struct iovec iovstack[UIO_FASTIOV];
@@ -1017,7 +1071,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,

if (iter_fn)
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
- pos, iter_fn, 0);
+ pos, iter_fn, flags);
else if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
pos, fnv);
@@ -1041,7 +1095,7 @@ out:

static size_t compat_readv(struct file *file,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
ssize_t ret = -EBADF;

@@ -1052,7 +1106,7 @@ static size_t compat_readv(struct file *file,
if (!(file->f_mode & FMODE_CAN_READ))
goto out;

- ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);

out:
if (ret > 0)
@@ -1061,9 +1115,9 @@ out:
return ret;
}

-COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
- const struct compat_iovec __user *,vec,
- compat_ulong_t, vlen)
+static size_t __compat_sys_readv(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret;
@@ -1072,28 +1126,34 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
if (!f.file)
return -EBADF;
pos = f.file->f_pos;
- ret = compat_readv(f.file, vec, vlen, &pos);
+ ret = compat_readv(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
f.file->f_pos = pos;
fdput_pos(f);
return ret;
+
+}
+
+COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen)
+{
+ return __compat_sys_readv(fd, vec, vlen, 0);
}

static long __compat_sys_preadv64(unsigned long fd,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t pos)
+ unsigned long vlen, loff_t pos, int flags)
{
struct fd f;
ssize_t ret;

- if (pos < 0)
- return -EINVAL;
f = fdget(fd);
if (!f.file)
return -EBADF;
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = compat_readv(f.file, vec, vlen, &pos);
+ ret = compat_readv(f.file, vec, vlen, &pos, flags);
fdput(f);
return ret;
}
@@ -1103,7 +1163,10 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
const struct compat_iovec __user *,vec,
unsigned long, vlen, loff_t, pos)
{
- return __compat_sys_preadv64(fd, vec, vlen, pos);
+ if (pos < 0)
+ return -EINVAL;
+
+ return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
}
#endif

@@ -1113,12 +1176,28 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;

- return __compat_sys_preadv64(fd, vec, vlen, pos);
+ if (pos < 0)
+ return -EINVAL;
+
+ return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
+ int, flags)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+ if (pos == -1)
+ return __compat_sys_readv(fd, vec, vlen, flags);
+
+ return __compat_sys_preadv64(fd, vec, vlen, pos, flags);
}

static size_t compat_writev(struct file *file,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
ssize_t ret = -EBADF;

@@ -1129,7 +1208,7 @@ static size_t compat_writev(struct file *file,
if (!(file->f_mode & FMODE_CAN_WRITE))
goto out;

- ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, flags);

out:
if (ret > 0)
@@ -1138,9 +1217,9 @@ out:
return ret;
}

-COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
- const struct compat_iovec __user *, vec,
- compat_ulong_t, vlen)
+static size_t __compat_sys_writev(compat_ulong_t fd,
+ const struct compat_iovec __user* vec,
+ compat_ulong_t vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret;
@@ -1149,28 +1228,36 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
if (!f.file)
return -EBADF;
pos = f.file->f_pos;
- ret = compat_writev(f.file, vec, vlen, &pos);
+ ret = compat_writev(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
f.file->f_pos = pos;
fdput_pos(f);
return ret;
}

+COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
+ const struct compat_iovec __user *, vec,
+ compat_ulong_t, vlen)
+{
+ return __compat_sys_writev(fd, vec, vlen, 0);
+}
+
static long __compat_sys_pwritev64(unsigned long fd,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t pos)
+ unsigned long vlen, loff_t pos, int flags)
{
struct fd f;
ssize_t ret;

if (pos < 0)
return -EINVAL;
+
f = fdget(fd);
if (!f.file)
return -EBADF;
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = compat_writev(f.file, vec, vlen, &pos);
+ ret = compat_writev(f.file, vec, vlen, &pos, flags);
fdput(f);
return ret;
}
@@ -1180,7 +1267,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
const struct compat_iovec __user *,vec,
unsigned long, vlen, loff_t, pos)
{
- return __compat_sys_pwritev64(fd, vec, vlen, pos);
+ return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
}
#endif

@@ -1190,8 +1277,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;

- return __compat_sys_pwritev64(fd, vec, vlen, pos);
+ return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
}
+
+COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+ if (pos == -1)
+ return __compat_sys_writev(fd, vec, vlen, flags);
+
+ return __compat_sys_pwritev64(fd, vec, vlen, pos, flags);
+}
+
#endif

static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e649426..63a94e2 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
const struct compat_iovec __user *vec,
compat_ulong_t vlen, u32 pos_low, u32 pos_high);
+asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
+asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);

#ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
asmlinkage long compat_sys_preadv64(unsigned long fd,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81..cedc22e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -571,8 +571,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
size_t count, loff_t pos);
asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+ int flags);
asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+ int flags);
asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 22749c1..9406018 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
__SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
#define __NR_pwritev 70
__SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 281
+__SC_COMP(__NR_preadv2, sys_preadv2, compat_sys_preadv2)
+#define __NR_pwritev2 282
+__SC_COMP(__NR_pwritev2, sys_pwritev2, compat_sys_pwritev2)

/* fs/sendfile.c */
#define __NR3264_sendfile 71
@@ -709,7 +713,7 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)

#undef __NR_syscalls
-#define __NR_syscalls 281
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..530c263 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1457,6 +1457,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* @ppos: current file position
* @iter: data destination
* @written: already copied
+ * @flags: optional flags
*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1465,7 +1466,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* of the logic when it comes to error handling etc.
*/
static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
- struct iov_iter *iter, ssize_t written)
+ struct iov_iter *iter, ssize_t written, int flags)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
@@ -1735,7 +1736,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
}
}

- retval = do_generic_file_read(file, ppos, iter, retval);
+ retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
out:
return retval;
}
--
1.9.1

2014-11-05 21:15:23

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 4/7] vfs: RWF_NONBLOCK flag for preadv2

generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
only want to read the data if it's already in the page cache.

Additionally, there are a few filesystems that we have to specifically
bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
contributed this code.

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
---
fs/ceph/file.c | 2 ++
fs/cifs/file.c | 6 ++++++
fs/nfs/file.c | 5 ++++-
fs/ocfs2/file.c | 6 ++++++
fs/pipe.c | 3 ++-
fs/read_write.c | 38 +++++++++++++++++++++++++-------------
fs/xfs/xfs_file.c | 4 ++++
include/linux/fs.h | 3 +++
mm/filemap.c | 18 ++++++++++++++++++
mm/shmem.c | 4 ++++
10 files changed, 74 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d7e0da8..b798b5c 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -822,6 +822,8 @@ again:
if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
(iocb->ki_filp->f_flags & O_DIRECT) ||
(fi->flags & CEPH_F_SYNC)) {
+ if (iocb->ki_rwflags & O_NONBLOCK)
+ return -EAGAIN;

dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 3e4d00a..c485afa 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
struct cifs_readdata *rdata, *tmp;
struct list_head rdata_list;

+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
len = iov_iter_count(to);
if (!len)
return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
return generic_file_read_iter(iocb, to);

+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
/*
* We need to hold the sem to be sure nobody modifies lock list
* with a brlock that prevents reading.
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2ab6f00..aa9046f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t result;

- if (iocb->ki_filp->f_flags & O_DIRECT)
+ if (iocb->ki_filp->f_flags & O_DIRECT) {
+ if (iocb->ki_rwflags & O_NONBLOCK)
+ return -EAGAIN;
return nfs_file_direct_read(iocb, to, iocb->ki_pos);
+ }

dprintk("NFS: read(%pD2, %zu@%lu)\n",
iocb->ki_filp,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 324dc93..bb66ca4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
filp->f_path.dentry->d_name.name,
to->nr_segs); /* GRRRRR */

+ /*
+ * No non-blocking reads for ocfs2 for now. Might be doable with
+ * non-blocking cluster lock helpers.
+ */
+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;

if (!inode) {
ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
*/
if (ret)
break;
- if (filp->f_flags & O_NONBLOCK) {
+ if ((filp->f_flags & O_NONBLOCK) ||
+ (iocb->ki_rwflags & RWF_NONBLOCK)) {
ret = -EAGAIN;
break;
}
diff --git a/fs/read_write.c b/fs/read_write.c
index 907735c..cba7d4c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
file_start_write(file);
}

- if (iter_fn)
+ if (iter_fn) {
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
pos, iter_fn, flags);
- else if (fnv)
- ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
- pos, fnv);
- else
- ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ } else {
+ if (type == READ && (flags & RWF_NONBLOCK))
+ return -EAGAIN;
+
+ if (fnv)
+ ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+ pos, fnv);
+ else
+ ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ }

if (type != READ)
file_end_write(file);
@@ -866,8 +871,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;
- if (flags & ~0)
+ if (flags & ~RWF_NONBLOCK)
return -EINVAL;
+ if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+ return -EAGAIN;

return do_readv_writev(READ, file, vec, vlen, pos, flags);
}
@@ -1069,14 +1076,19 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
file_start_write(file);
}

- if (iter_fn)
+ if (iter_fn) {
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
pos, iter_fn, flags);
- else if (fnv)
- ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
- pos, fnv);
- else
- ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ } else {
+ if (type == READ && (flags & RWF_NONBLOCK))
+ return -EAGAIN;
+
+ if (fnv)
+ ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+ pos, fnv);
+ else
+ ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+ }

if (type != READ)
file_end_write(file);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index eb596b4..b1f6334 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(

XFS_STATS_INC(xs_read_calls);

+ /* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
if (unlikely(file->f_flags & O_DIRECT))
ioflags |= XFS_IO_ISDIRECT;
if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ed5711..eaebd99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1459,6 +1459,9 @@ struct block_device_operations;
#define HAVE_COMPAT_IOCTL 1
#define HAVE_UNLOCKED_IOCTL 1

+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
+
struct iov_iter;

struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index 530c263..09d3af3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1494,6 +1494,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
find_page:
page = find_get_page(mapping, index);
if (!page) {
+ if (flags & RWF_NONBLOCK)
+ goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
@@ -1585,6 +1587,11 @@ page_ok:
continue;

page_not_up_to_date:
+ if (flags & RWF_NONBLOCK) {
+ page_cache_release(page);
+ goto would_block;
+ }
+
/* Get exclusive access to the page ... */
error = lock_page_killable(page);
if (unlikely(error))
@@ -1604,6 +1611,12 @@ page_not_up_to_date_locked:
goto page_ok;
}

+ if (flags & RWF_NONBLOCK) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto would_block;
+ }
+
readpage:
/*
* A previous I/O error may have been due to temporary
@@ -1674,6 +1687,8 @@ no_cached_page:
goto readpage;
}

+would_block:
+ error = -EAGAIN;
out:
ra->prev_pos = prev_index;
ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1707,6 +1722,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
size_t count = iov_iter_count(iter);
loff_t size;

+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
if (!count)
goto out; /* skip atime */
size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index cd6fc75..5c30f04 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
ssize_t retval = 0;
loff_t *ppos = &iocb->ki_pos;

+ /* XXX: should be easily supportable */
+ if (iocb->ki_rwflags & RWF_NONBLOCK)
+ return -EAGAIN;
+
/*
* Might this read be for a stacking filesystem? Then when reading
* holes of a sparse file, we actually need to allocate those pages,
--
1.9.1

2014-11-05 21:16:21

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics

From: Christoph Hellwig <[email protected]>

With the new read/write with flags syscalls we can support a flag
to enable O_DSYNC semantics on a per-operation basis. This іs
useful to implement protocols like SMB, NFS or SCSI that have such
per-operation flags.

Example program below:

cat > pwritev2.c << EOF

(off_t) val, \
(off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))

static ssize_t
pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
{
return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
flags);
}

int main(int argc, char **argv)
{
int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666);
char buf[1024];
struct iovec iov = { .iov_base = buf, .iov_len = 1024 };
int ret;

if (fd < 0) {
perror("open");
return 0;
}

memset(buf, 0xfe, sizeof(buf));

ret = pwritev2(fd, &iov, 1, 0, RWF_DSYNC);
if (ret < 0)
perror("pwritev2");
else
printf("ret = %d\n", ret);

return 0;
}
EOF

Signed-off-by: Christoph Hellwig <[email protected]>
[[email protected]: added flag check to compat_do_readv_writev()]
Signed-off-by: Milosz Tanski <[email protected]>
---
fs/ceph/file.c | 4 +++-
fs/fuse/file.c | 2 ++
fs/nfs/file.c | 10 ++++++----
fs/ocfs2/file.c | 6 ++++--
fs/read_write.c | 20 +++++++++++++++-----
include/linux/fs.h | 3 ++-
mm/filemap.c | 4 +++-
7 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index b798b5c..2d4e15a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -983,7 +983,9 @@ retry_snap:
ceph_put_cap_refs(ci, got);

if (written >= 0 &&
- ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
+ ((file->f_flags & O_SYNC) ||
+ IS_SYNC(file->f_mapping->host) ||
+ (iocb->ki_rwflags & RWF_DSYNC) ||
ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
err = vfs_fsync_range(file, pos, pos + written - 1, 1);
if (err < 0)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index caa8d95..bb4fb23 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
written += written_buffered;
iocb->ki_pos = pos + written_buffered;
} else {
+ if (iocb->ki_rwflags & RWF_DSYNC)
+ return -EINVAL;
written = fuse_perform_write(file, mapping, from, pos);
if (written >= 0)
iocb->ki_pos = pos + written;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index aa9046f..c59b0b7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -652,13 +652,15 @@ static const struct vm_operations_struct nfs_file_vm_ops = {
.remap_pages = generic_file_remap_pages,
};

-static int nfs_need_sync_write(struct file *filp, struct inode *inode)
+static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode)
{
struct nfs_open_context *ctx;

- if (IS_SYNC(inode) || (filp->f_flags & O_DSYNC))
+ if (IS_SYNC(inode) ||
+ (iocb->ki_filp->f_flags & O_DSYNC) ||
+ (iocb->ki_rwflags & RWF_DSYNC))
return 1;
- ctx = nfs_file_open_context(filp);
+ ctx = nfs_file_open_context(iocb->ki_filp);
if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags) ||
nfs_ctx_key_to_expire(ctx))
return 1;
@@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
written = result;

/* Return error values for O_DSYNC and IS_SYNC() */
- if (result >= 0 && nfs_need_sync_write(file, inode)) {
+ if (result >= 0 && nfs_need_sync_write(iocb, inode)) {
int err = vfs_fsync(file, 0);
if (err < 0)
result = err;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index bb66ca4..8f9a86b 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2374,8 +2374,10 @@ out_dio:
/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));

- if (((file->f_flags & O_DSYNC) && !direct_io) || IS_SYNC(inode) ||
- ((file->f_flags & O_DIRECT) && !direct_io)) {
+ if (((file->f_flags & O_DSYNC) && !direct_io) ||
+ IS_SYNC(inode) ||
+ ((file->f_flags & O_DIRECT) && !direct_io) ||
+ (iocb->ki_rwflags & RWF_DSYNC)) {
ret = filemap_fdatawrite_range(file->f_mapping, *ppos,
*ppos + count - 1);
if (ret < 0)
diff --git a/fs/read_write.c b/fs/read_write.c
index cba7d4c..3443265 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -839,8 +839,13 @@ static ssize_t do_readv_writev(int type, struct file *file,
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
pos, iter_fn, flags);
} else {
- if (type == READ && (flags & RWF_NONBLOCK))
- return -EAGAIN;
+ if (type == READ) {
+ if (flags & RWF_NONBLOCK)
+ return -EAGAIN;
+ } else {
+ if (flags & RWF_DSYNC)
+ return -EINVAL;
+ }

if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
@@ -888,7 +893,7 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
- if (flags & ~0)
+ if (flags & ~RWF_DSYNC)
return -EINVAL;

return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
@@ -1080,8 +1085,13 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
pos, iter_fn, flags);
} else {
- if (type == READ && (flags & RWF_NONBLOCK))
- return -EAGAIN;
+ if (type == READ) {
+ if (flags & RWF_NONBLOCK)
+ return -EAGAIN;
+ } else {
+ if (flags & RWF_DSYNC)
+ return -EINVAL;
+ }

if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7d0e116..7786b88 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1460,7 +1460,8 @@ struct block_device_operations;
#define HAVE_UNLOCKED_IOCTL 1

/* These flags are used for the readv/writev syscalls with flags. */
-#define RWF_NONBLOCK 0x00000001
+#define RWF_NONBLOCK 0x00000001
+#define RWF_DSYNC 0x00000002

struct iov_iter;

diff --git a/mm/filemap.c b/mm/filemap.c
index 6107058..4fbef99 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2669,7 +2669,9 @@ int generic_write_sync(struct kiocb *iocb, loff_t count)
struct file *file = iocb->ki_filp;

if (count > 0 &&
- ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
+ ((file->f_flags & O_DSYNC) ||
+ (iocb->ki_rwflags & RWF_DSYNC) ||
+ IS_SYNC(file->f_mapping->host))) {
bool fdatasync = !(file->f_flags & __O_SYNC);
ssize_t ret = 0;

--
1.9.1

2014-11-05 21:16:41

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 6/7] fs: pass iocb to generic_write_sync

From: Christoph Hellwig <[email protected]>

Clean up the generic_write_sync by just passing an iocb and a bytes
written / negative errno argument. In addition to simplifying the
callers this also prepares for passing a per-operation O_DSYNC
flag. Two callers didn't quite fit that scheme:

- dio_complete didn't both to update ki_pos as we don't need it
on a iocb that is about to be freed, so we had to add it. Additionally
it also synced out written data in the error case, which has been
changed to operate like the other callers.
- gfs2 also used generic_write_sync to implement a crude version
of fallocate. It has been switched to use an open coded variant
instead.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/block_dev.c | 8 +-------
fs/btrfs/file.c | 7 ++-----
fs/cifs/file.c | 8 +-------
fs/direct-io.c | 8 ++------
fs/ext4/file.c | 8 +-------
fs/gfs2/file.c | 9 +++++++--
fs/ntfs/file.c | 8 ++------
fs/udf/file.c | 11 ++---------
fs/xfs/xfs_file.c | 8 +-------
include/linux/fs.h | 8 +-------
mm/filemap.c | 30 ++++++++++++++++++++----------
11 files changed, 40 insertions(+), 73 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index cc9d411..c529b1c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
*/
ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct file *file = iocb->ki_filp;
struct blk_plug plug;
ssize_t ret;

blk_start_plug(&plug);
ret = __generic_file_write_iter(iocb, from);
- if (ret > 0) {
- ssize_t err;
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
+ ret = generic_write_sync(iocb, ret);
blk_finish_plug(&plug);
return ret;
}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a18ceab..4f4a6f7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
*/
BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
BTRFS_I(inode)->last_sub_trans = root->log_transid;
- if (num_written > 0) {
- err = generic_write_sync(file, pos, num_written);
- if (err < 0)
- num_written = err;
- }
+
+ num_written = generic_write_sync(iocb, num_written);

if (sync)
atomic_dec(&BTRFS_I(inode)->sync_writers);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c485afa..32359de 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
rc = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (rc > 0) {
- ssize_t err;
-
- err = generic_write_sync(file, iocb->ki_pos - rc, rc);
- if (err < 0)
- rc = err;
- }
+ rc = generic_write_sync(iocb, rc);
} else {
mutex_unlock(&inode->i_mutex);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..b72ac83 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
inode_dio_done(dio->inode);
if (is_async) {
if (dio->rw & WRITE) {
- int err;
-
- err = generic_write_sync(dio->iocb->ki_filp, offset,
- transferred);
- if (err < 0 && ret > 0)
- ret = err;
+ dio->iocb->ki_pos = offset + transferred;
+ ret = generic_write_sync(dio->iocb, ret);
}

aio_complete(dio->iocb, ret, 0);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..79b000c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (ret > 0) {
- ssize_t err;
-
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
+ ret = generic_write_sync(iocb, ret);
if (o_direct)
blk_finish_plug(&plug);

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 80dd44d..3fafeca 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -895,8 +895,13 @@ retry:
gfs2_quota_unlock(ip);
}

- if (error == 0)
- error = generic_write_sync(file, pos, count);
+ if (error)
+ goto out_unlock;
+
+ if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) {
+ error = vfs_fsync_range(file, pos, pos + count - 1,
+ (file->f_flags & __O_SYNC) ? 0 : 1);
+ }
goto out_unlock;

out_trans_fail:
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 643faa4..4f3d664 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2127,12 +2127,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
mutex_lock(&inode->i_mutex);
ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
- if (ret > 0) {
- int err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
- return ret;
+
+ return generic_write_sync(iocb, ret);
}

/**
diff --git a/fs/udf/file.c b/fs/udf/file.c
index bb15771..1cdabd0 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -155,16 +155,9 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
retval = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (retval > 0) {
- ssize_t err;
-
+ if (retval > 0)
mark_inode_dirty(inode);
- err = generic_write_sync(file, iocb->ki_pos - retval, retval);
- if (err < 0)
- retval = err;
- }
-
- return retval;
+ return generic_write_sync(iocb, retval);
}

long udf_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0655915..a8cab66 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -792,14 +792,8 @@ xfs_file_write_iter(
ret = xfs_file_buffered_aio_write(iocb, from);

if (ret > 0) {
- ssize_t err;
-
XFS_STATS_ADD(xs_write_bytes, ret);
-
- /* Handle various SYNC-type writes */
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
+ ret = generic_write_sync(iocb, ret);
}
return ret;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eaebd99..7d0e116 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2242,13 +2242,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
int datasync);
extern int vfs_fsync(struct file *file, int datasync);
-static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
-{
- if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
- return 0;
- return vfs_fsync_range(file, pos, pos + count - 1,
- (file->f_flags & __O_SYNC) ? 0 : 1);
-}
+extern int generic_write_sync(struct kiocb *iocb, loff_t count);
extern void emergency_sync(void);
extern void emergency_remount(void);
#ifdef CONFIG_BLOCK
diff --git a/mm/filemap.c b/mm/filemap.c
index 09d3af3..6107058 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2664,6 +2664,24 @@ out:
}
EXPORT_SYMBOL(__generic_file_write_iter);

+int generic_write_sync(struct kiocb *iocb, loff_t count)
+{
+ struct file *file = iocb->ki_filp;
+
+ if (count > 0 &&
+ ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
+ bool fdatasync = !(file->f_flags & __O_SYNC);
+ ssize_t ret = 0;
+
+ ret = vfs_fsync_range(file, iocb->ki_pos - count,
+ iocb->ki_pos - 1, fdatasync);
+ if (ret < 0)
+ return ret;
+ }
+ return count;
+}
+EXPORT_SYMBOL(generic_write_sync);
+
/**
* generic_file_write_iter - write data to a file
* @iocb: IO state structure
@@ -2675,22 +2693,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
*/
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
ssize_t ret;

mutex_lock(&inode->i_mutex);
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);

- if (ret > 0) {
- ssize_t err;
-
- err = generic_write_sync(file, iocb->ki_pos - ret, ret);
- if (err < 0)
- ret = err;
- }
- return ret;
+ return generic_write_sync(iocb, ret);
}
EXPORT_SYMBOL(generic_file_write_iter);

--
1.9.1

2014-11-05 21:17:12

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 5/7] xfs: add RWF_NONBLOCK support

From: Christoph Hellwig <[email protected]>

Add support for non-blocking reads. The guts are handled by the generic
code, the only addition is a non-blocking variant of xfs_rw_ilock.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_file.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b1f6334..0655915 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -61,6 +61,25 @@ xfs_rw_ilock(
xfs_ilock(ip, type);
}

+static inline bool
+xfs_rw_ilock_nowait(
+ struct xfs_inode *ip,
+ int type)
+{
+ if (type & XFS_IOLOCK_EXCL) {
+ if (!mutex_trylock(&VFS_I(ip)->i_mutex))
+ return false;
+ if (!xfs_ilock_nowait(ip, type)) {
+ mutex_unlock(&VFS_I(ip)->i_mutex);
+ return false;
+ }
+ } else {
+ if (!xfs_ilock_nowait(ip, type))
+ return false;
+ }
+ return true;
+}
+
static inline void
xfs_rw_iunlock(
struct xfs_inode *ip,
@@ -246,10 +265,6 @@ xfs_file_read_iter(

XFS_STATS_INC(xs_read_calls);

- /* XXX: need a non-blocking iolock helper, shouldn't be too hard */
- if (iocb->ki_rwflags & RWF_NONBLOCK)
- return -EAGAIN;
-
if (unlikely(file->f_flags & O_DIRECT))
ioflags |= XFS_IO_ISDIRECT;
if (file->f_mode & FMODE_NOCMTIME)
@@ -287,7 +302,14 @@ xfs_file_read_iter(
* This allows the normal direct IO case of no page cache pages to
* proceeed concurrently without serialisation.
*/
- xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ if (iocb->ki_rwflags & RWF_NONBLOCK) {
+ if (ioflags & XFS_IO_ISDIRECT)
+ return -EAGAIN;
+ if (!xfs_rw_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+ return -EAGAIN;
+ } else {
+ xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ }
if ((ioflags & XFS_IO_ISDIRECT) && inode->i_mapping->nrpages) {
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
--
1.9.1

2014-11-05 21:18:13

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v5 1/7] vfs: Prepare for adding a new preadv/pwritev with user flags.

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
---
drivers/target/target_core_file.c | 6 +++---
fs/nfsd/vfs.c | 4 ++--
fs/read_write.c | 27 +++++++++++++++------------
fs/splice.c | 2 +-
include/linux/aio.h | 2 ++
include/linux/fs.h | 4 ++--
6 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
set_fs(get_ds());

if (is_write)
- ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+ ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
else
- ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+ ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);

set_fs(old_fs);

@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)

old_fs = get_fs();
set_fs(get_ds());
- rc = vfs_writev(f, &iov[0], iov_num, &pos);
+ rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
set_fs(old_fs);

vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 989129e..ef01c78 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -872,7 +872,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,

oldfs = get_fs();
set_fs(KERNEL_DS);
- host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+ host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
set_fs(oldfs);
return nfsd_finish_read(file, count, host_err);
}
@@ -960,7 +960,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,

/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
- host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+ host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
set_fs(oldfs);
if (host_err < 0)
goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 7d9318c..94b2d34 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -653,7 +653,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
EXPORT_SYMBOL(iov_shorten);

static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
- unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+ unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+ int flags)
{
struct kiocb kiocb;
struct iov_iter iter;
@@ -662,6 +663,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
kiocb.ki_nbytes = len;
+ kiocb.ki_rwflags = flags;

iov_iter_init(&iter, rw, iov, nr_segs, len);
ret = fn(&kiocb, &iter);
@@ -800,7 +802,8 @@ out:

static ssize_t do_readv_writev(int type, struct file *file,
const struct iovec __user * uvector,
- unsigned long nr_segs, loff_t *pos)
+ unsigned long nr_segs, loff_t *pos,
+ int flags)
{
size_t tot_len;
struct iovec iovstack[UIO_FASTIOV];
@@ -834,7 +837,7 @@ static ssize_t do_readv_writev(int type, struct file *file,

if (iter_fn)
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
- pos, iter_fn);
+ pos, iter_fn, flags);
else if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
pos, fnv);
@@ -857,27 +860,27 @@ out:
}

ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
if (!(file->f_mode & FMODE_READ))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;

- return do_readv_writev(READ, file, vec, vlen, pos);
+ return do_readv_writev(READ, file, vec, vlen, pos, flags);
}

EXPORT_SYMBOL(vfs_readv);

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;

- return do_readv_writev(WRITE, file, vec, vlen, pos);
+ return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
}

EXPORT_SYMBOL(vfs_writev);
@@ -890,7 +893,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_readv(f.file, vec, vlen, &pos);
+ ret = vfs_readv(f.file, vec, vlen, &pos, 0);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -910,7 +913,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,

if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_writev(f.file, vec, vlen, &pos);
+ ret = vfs_writev(f.file, vec, vlen, &pos, 0);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -942,7 +945,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = vfs_readv(f.file, vec, vlen, &pos);
+ ret = vfs_readv(f.file, vec, vlen, &pos, 0);
fdput(f);
}

@@ -966,7 +969,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = vfs_writev(f.file, vec, vlen, &pos);
+ ret = vfs_writev(f.file, vec, vlen, &pos, 0);
fdput(f);
}

@@ -1014,7 +1017,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,

if (iter_fn)
ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
- pos, iter_fn);
+ pos, iter_fn, 0);
else if (fnv)
ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
pos, fnv);
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
old_fs = get_fs();
set_fs(get_ds());
/* The cast to a user pointer is valid due to the set_fs() */
- res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+ res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
set_fs(old_fs);

return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
* this is the underlying eventfd context to deliver events to.
*/
struct eventfd_ctx *ki_eventfd;
+
+ int ki_rwflags;
};

static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a957d43..9ed5711 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1538,9 +1538,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
- unsigned long, loff_t *);
+ unsigned long, loff_t *, int);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
- unsigned long, loff_t *);
+ unsigned long, loff_t *, int);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
--
1.9.1

2014-11-06 07:56:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] vfs: Non-blockling buffered fs read (page cache only)

This series looks good, do you also have a man page sniplet to document
the new syscalls?

2014-11-06 10:19:30

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH v5 6/7] fs: pass iocb to generic_write_sync

Hi,

On 05/11/14 21:14, Milosz Tanski wrote:
> From: Christoph Hellwig <[email protected]>
>
> Clean up the generic_write_sync by just passing an iocb and a bytes
> written / negative errno argument. In addition to simplifying the
> callers this also prepares for passing a per-operation O_DSYNC
> flag. Two callers didn't quite fit that scheme:
>
> - dio_complete didn't both to update ki_pos as we don't need it
> on a iocb that is about to be freed, so we had to add it. Additionally
> it also synced out written data in the error case, which has been
> changed to operate like the other callers.
> - gfs2 also used generic_write_sync to implement a crude version
> of fallocate. It has been switched to use an open coded variant
> instead.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
GFS2 bits:

Acked-by: Steven Whitehouse <[email protected]>

I know that Andy Price has some work in this area too, so in due course
we'll have to be careful not to create a merge conflict here. Copying in
Andy so he can see the changes,

Steve.

> ---
> fs/block_dev.c | 8 +-------
> fs/btrfs/file.c | 7 ++-----
> fs/cifs/file.c | 8 +-------
> fs/direct-io.c | 8 ++------
> fs/ext4/file.c | 8 +-------
> fs/gfs2/file.c | 9 +++++++--
> fs/ntfs/file.c | 8 ++------
> fs/udf/file.c | 11 ++---------
> fs/xfs/xfs_file.c | 8 +-------
> include/linux/fs.h | 8 +-------
> mm/filemap.c | 30 ++++++++++++++++++++----------
> 11 files changed, 40 insertions(+), 73 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index cc9d411..c529b1c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> */
> ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct file *file = iocb->ki_filp;
> struct blk_plug plug;
> ssize_t ret;
>
> blk_start_plug(&plug);
> ret = __generic_file_write_iter(iocb, from);
> - if (ret > 0) {
> - ssize_t err;
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
> blk_finish_plug(&plug);
> return ret;
> }
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index a18ceab..4f4a6f7 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> */
> BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
> BTRFS_I(inode)->last_sub_trans = root->log_transid;
> - if (num_written > 0) {
> - err = generic_write_sync(file, pos, num_written);
> - if (err < 0)
> - num_written = err;
> - }
> +
> + num_written = generic_write_sync(iocb, num_written);
>
> if (sync)
> atomic_dec(&BTRFS_I(inode)->sync_writers);
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index c485afa..32359de 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
> rc = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (rc > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - rc, rc);
> - if (err < 0)
> - rc = err;
> - }
> + rc = generic_write_sync(iocb, rc);
> } else {
> mutex_unlock(&inode->i_mutex);
> }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..b72ac83 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
> inode_dio_done(dio->inode);
> if (is_async) {
> if (dio->rw & WRITE) {
> - int err;
> -
> - err = generic_write_sync(dio->iocb->ki_filp, offset,
> - transferred);
> - if (err < 0 && ret > 0)
> - ret = err;
> + dio->iocb->ki_pos = offset + transferred;
> + ret = generic_write_sync(dio->iocb, ret);
> }
>
> aio_complete(dio->iocb, ret, 0);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index aca7b24..79b000c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ret = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (ret > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
> if (o_direct)
> blk_finish_plug(&plug);
>
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 80dd44d..3fafeca 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -895,8 +895,13 @@ retry:
> gfs2_quota_unlock(ip);
> }
>
> - if (error == 0)
> - error = generic_write_sync(file, pos, count);
> + if (error)
> + goto out_unlock;
> +
> + if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) {
> + error = vfs_fsync_range(file, pos, pos + count - 1,
> + (file->f_flags & __O_SYNC) ? 0 : 1);
> + }
> goto out_unlock;
>
> out_trans_fail:
> diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
> index 643faa4..4f3d664 100644
> --- a/fs/ntfs/file.c
> +++ b/fs/ntfs/file.c
> @@ -2127,12 +2127,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> mutex_lock(&inode->i_mutex);
> ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
> mutex_unlock(&inode->i_mutex);
> - if (ret > 0) {
> - int err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> - return ret;
> +
> + return generic_write_sync(iocb, ret);
> }
>
> /**
> diff --git a/fs/udf/file.c b/fs/udf/file.c
> index bb15771..1cdabd0 100644
> --- a/fs/udf/file.c
> +++ b/fs/udf/file.c
> @@ -155,16 +155,9 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> retval = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (retval > 0) {
> - ssize_t err;
> -
> + if (retval > 0)
> mark_inode_dirty(inode);
> - err = generic_write_sync(file, iocb->ki_pos - retval, retval);
> - if (err < 0)
> - retval = err;
> - }
> -
> - return retval;
> + return generic_write_sync(iocb, retval);
> }
>
> long udf_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0655915..a8cab66 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -792,14 +792,8 @@ xfs_file_write_iter(
> ret = xfs_file_buffered_aio_write(iocb, from);
>
> if (ret > 0) {
> - ssize_t err;
> -
> XFS_STATS_ADD(xs_write_bytes, ret);
> -
> - /* Handle various SYNC-type writes */
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> + ret = generic_write_sync(iocb, ret);
> }
> return ret;
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index eaebd99..7d0e116 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2242,13 +2242,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
> extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
> int datasync);
> extern int vfs_fsync(struct file *file, int datasync);
> -static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
> -{
> - if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
> - return 0;
> - return vfs_fsync_range(file, pos, pos + count - 1,
> - (file->f_flags & __O_SYNC) ? 0 : 1);
> -}
> +extern int generic_write_sync(struct kiocb *iocb, loff_t count);
> extern void emergency_sync(void);
> extern void emergency_remount(void);
> #ifdef CONFIG_BLOCK
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 09d3af3..6107058 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2664,6 +2664,24 @@ out:
> }
> EXPORT_SYMBOL(__generic_file_write_iter);
>
> +int generic_write_sync(struct kiocb *iocb, loff_t count)
> +{
> + struct file *file = iocb->ki_filp;
> +
> + if (count > 0 &&
> + ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
> + bool fdatasync = !(file->f_flags & __O_SYNC);
> + ssize_t ret = 0;
> +
> + ret = vfs_fsync_range(file, iocb->ki_pos - count,
> + iocb->ki_pos - 1, fdatasync);
> + if (ret < 0)
> + return ret;
> + }
> + return count;
> +}
> +EXPORT_SYMBOL(generic_write_sync);
> +
> /**
> * generic_file_write_iter - write data to a file
> * @iocb: IO state structure
> @@ -2675,22 +2693,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
> */
> ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> ssize_t ret;
>
> mutex_lock(&inode->i_mutex);
> ret = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (ret > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> - return ret;
> + return generic_write_sync(iocb, ret);
> }
> EXPORT_SYMBOL(generic_file_write_iter);
>

2014-11-06 10:53:15

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [Linux-NTFS-Dev] [PATCH v5 6/7] fs: pass iocb to generic_write_sync

Hi,

> On 5 Nov 2014, at 23:14, Milosz Tanski <[email protected]> wrote:
>
> From: Christoph Hellwig <[email protected]>
>
> Clean up the generic_write_sync by just passing an iocb and a bytes
> written / negative errno argument. In addition to simplifying the
> callers this also prepares for passing a per-operation O_DSYNC
> flag. Two callers didn't quite fit that scheme:
>
> - dio_complete didn't both to update ki_pos as we don't need it
> on a iocb that is about to be freed, so we had to add it. Additionally
> it also synced out written data in the error case, which has been
> changed to operate like the other callers.
> - gfs2 also used generic_write_sync to implement a crude version
> of fallocate. It has been switched to use an open coded variant
> instead.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/block_dev.c | 8 +-------
> fs/btrfs/file.c | 7 ++-----
> fs/cifs/file.c | 8 +-------
> fs/direct-io.c | 8 ++------
> fs/ext4/file.c | 8 +-------
> fs/gfs2/file.c | 9 +++++++--
> fs/ntfs/file.c | 8 ++------
> fs/udf/file.c | 11 ++---------
> fs/xfs/xfs_file.c | 8 +-------
> include/linux/fs.h | 8 +-------
> mm/filemap.c | 30 ++++++++++++++++++++----------
> 11 files changed, 40 insertions(+), 73 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index cc9d411..c529b1c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> */
> ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct file *file = iocb->ki_filp;
> struct blk_plug plug;
> ssize_t ret;
>
> blk_start_plug(&plug);
> ret = __generic_file_write_iter(iocb, from);
> - if (ret > 0) {
> - ssize_t err;
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
> blk_finish_plug(&plug);
> return ret;
> }
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index a18ceab..4f4a6f7 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> */
> BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
> BTRFS_I(inode)->last_sub_trans = root->log_transid;
> - if (num_written > 0) {
> - err = generic_write_sync(file, pos, num_written);
> - if (err < 0)
> - num_written = err;
> - }
> +
> + num_written = generic_write_sync(iocb, num_written);
>
> if (sync)
> atomic_dec(&BTRFS_I(inode)->sync_writers);
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index c485afa..32359de 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
> rc = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (rc > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - rc, rc);
> - if (err < 0)
> - rc = err;
> - }
> + rc = generic_write_sync(iocb, rc);
> } else {
> mutex_unlock(&inode->i_mutex);
> }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..b72ac83 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
> inode_dio_done(dio->inode);
> if (is_async) {
> if (dio->rw & WRITE) {
> - int err;
> -
> - err = generic_write_sync(dio->iocb->ki_filp, offset,
> - transferred);
> - if (err < 0 && ret > 0)
> - ret = err;
> + dio->iocb->ki_pos = offset + transferred;
> + ret = generic_write_sync(dio->iocb, ret);
> }
>
> aio_complete(dio->iocb, ret, 0);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index aca7b24..79b000c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ret = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (ret > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
> if (o_direct)
> blk_finish_plug(&plug);
>
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 80dd44d..3fafeca 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -895,8 +895,13 @@ retry:
> gfs2_quota_unlock(ip);
> }
>
> - if (error == 0)
> - error = generic_write_sync(file, pos, count);
> + if (error)
> + goto out_unlock;
> +
> + if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) {
> + error = vfs_fsync_range(file, pos, pos + count - 1,
> + (file->f_flags & __O_SYNC) ? 0 : 1);
> + }
> goto out_unlock;
>
> out_trans_fail:
> diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
> index 643faa4..4f3d664 100644
> --- a/fs/ntfs/file.c
> +++ b/fs/ntfs/file.c
> @@ -2127,12 +2127,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> mutex_lock(&inode->i_mutex);
> ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
> mutex_unlock(&inode->i_mutex);
> - if (ret > 0) {
> - int err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> - return ret;
> +
> + return generic_write_sync(iocb, ret);
> }
>
> /**
> diff --git a/fs/udf/file.c b/fs/udf/file.c
> index bb15771..1cdabd0 100644
> --- a/fs/udf/file.c
> +++ b/fs/udf/file.c
> @@ -155,16 +155,9 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> retval = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (retval > 0) {
> - ssize_t err;
> -
> + if (retval > 0)
> mark_inode_dirty(inode);
> - err = generic_write_sync(file, iocb->ki_pos - retval, retval);
> - if (err < 0)
> - retval = err;
> - }
> -
> - return retval;
> + return generic_write_sync(iocb, retval);
> }
>
> long udf_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0655915..a8cab66 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -792,14 +792,8 @@ xfs_file_write_iter(
> ret = xfs_file_buffered_aio_write(iocb, from);
>
> if (ret > 0) {
> - ssize_t err;
> -
> XFS_STATS_ADD(xs_write_bytes, ret);
> -
> - /* Handle various SYNC-type writes */
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> + ret = generic_write_sync(iocb, ret);
> }
> return ret;
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index eaebd99..7d0e116 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2242,13 +2242,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
> extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
> int datasync);
> extern int vfs_fsync(struct file *file, int datasync);
> -static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
> -{
> - if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
> - return 0;
> - return vfs_fsync_range(file, pos, pos + count - 1,
> - (file->f_flags & __O_SYNC) ? 0 : 1);
> -}
> +extern int generic_write_sync(struct kiocb *iocb, loff_t count);
> extern void emergency_sync(void);
> extern void emergency_remount(void);
> #ifdef CONFIG_BLOCK
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 09d3af3..6107058 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2664,6 +2664,24 @@ out:
> }
> EXPORT_SYMBOL(__generic_file_write_iter);
>
> +int generic_write_sync(struct kiocb *iocb, loff_t count)
> +{
> + struct file *file = iocb->ki_filp;
> +
> + if (count > 0 &&
> + ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
> + bool fdatasync = !(file->f_flags & __O_SYNC);
> + ssize_t ret = 0;

That "= 0" is pointless. "ret" is overwritten unconditionally on the following line...

Other than that the NTFS bits are:

Acked-by: Anton Altaparmakov <[email protected]>

Best regards,

Anton

> +
> + ret = vfs_fsync_range(file, iocb->ki_pos - count,
> + iocb->ki_pos - 1, fdatasync);
> + if (ret < 0)
> + return ret;
> + }
> + return count;
> +}
> +EXPORT_SYMBOL(generic_write_sync);
> +
> /**
> * generic_file_write_iter - write data to a file
> * @iocb: IO state structure
> @@ -2675,22 +2693,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
> */
> ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> ssize_t ret;
>
> mutex_lock(&inode->i_mutex);
> ret = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (ret > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> - return ret;
> + return generic_write_sync(iocb, ret);
> }
> EXPORT_SYMBOL(generic_file_write_iter);

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
University of Cambridge Information Services, Roger Needham Building
7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK

2014-11-06 12:04:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 6/7] fs: pass iocb to generic_write_sync

On Wed 05-11-14 16:14:52, Milosz Tanski wrote:
> From: Christoph Hellwig <[email protected]>
>
> Clean up the generic_write_sync by just passing an iocb and a bytes
> written / negative errno argument. In addition to simplifying the
> callers this also prepares for passing a per-operation O_DSYNC
> flag. Two callers didn't quite fit that scheme:
>
> - dio_complete didn't both to update ki_pos as we don't need it
> on a iocb that is about to be freed, so we had to add it. Additionally
> it also synced out written data in the error case, which has been
> changed to operate like the other callers.
> - gfs2 also used generic_write_sync to implement a crude version
> of fallocate. It has been switched to use an open coded variant
> instead.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/block_dev.c | 8 +-------
> fs/btrfs/file.c | 7 ++-----
> fs/cifs/file.c | 8 +-------
> fs/direct-io.c | 8 ++------
> fs/ext4/file.c | 8 +-------
> fs/gfs2/file.c | 9 +++++++--
> fs/ntfs/file.c | 8 ++------
> fs/udf/file.c | 11 ++---------
> fs/xfs/xfs_file.c | 8 +-------
> include/linux/fs.h | 8 +-------
> mm/filemap.c | 30 ++++++++++++++++++++----------
> 11 files changed, 40 insertions(+), 73 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index cc9d411..c529b1c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> */
> ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct file *file = iocb->ki_filp;
> struct blk_plug plug;
> ssize_t ret;
>
> blk_start_plug(&plug);
> ret = __generic_file_write_iter(iocb, from);
> - if (ret > 0) {
> - ssize_t err;
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
> blk_finish_plug(&plug);
> return ret;
> }
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index a18ceab..4f4a6f7 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> */
> BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
> BTRFS_I(inode)->last_sub_trans = root->log_transid;
> - if (num_written > 0) {
> - err = generic_write_sync(file, pos, num_written);
> - if (err < 0)
> - num_written = err;
> - }
> +
> + num_written = generic_write_sync(iocb, num_written);
>
> if (sync)
> atomic_dec(&BTRFS_I(inode)->sync_writers);
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index c485afa..32359de 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
> rc = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (rc > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - rc, rc);
> - if (err < 0)
> - rc = err;
> - }
> + rc = generic_write_sync(iocb, rc);
> } else {
> mutex_unlock(&inode->i_mutex);
> }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..b72ac83 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
> inode_dio_done(dio->inode);
> if (is_async) {
> if (dio->rw & WRITE) {
> - int err;
> -
> - err = generic_write_sync(dio->iocb->ki_filp, offset,
> - transferred);
> - if (err < 0 && ret > 0)
> - ret = err;
> + dio->iocb->ki_pos = offset + transferred;
> + ret = generic_write_sync(dio->iocb, ret);
> }
>
> aio_complete(dio->iocb, ret, 0);
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index aca7b24..79b000c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ret = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (ret > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> + ret = generic_write_sync(iocb, ret);
> if (o_direct)
> blk_finish_plug(&plug);
>
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 80dd44d..3fafeca 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -895,8 +895,13 @@ retry:
> gfs2_quota_unlock(ip);
> }
>
> - if (error == 0)
> - error = generic_write_sync(file, pos, count);
> + if (error)
> + goto out_unlock;
> +
> + if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) {
> + error = vfs_fsync_range(file, pos, pos + count - 1,
> + (file->f_flags & __O_SYNC) ? 0 : 1);
> + }
> goto out_unlock;
>
> out_trans_fail:
> diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
> index 643faa4..4f3d664 100644
> --- a/fs/ntfs/file.c
> +++ b/fs/ntfs/file.c
> @@ -2127,12 +2127,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> mutex_lock(&inode->i_mutex);
> ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
> mutex_unlock(&inode->i_mutex);
> - if (ret > 0) {
> - int err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> - return ret;
> +
> + return generic_write_sync(iocb, ret);
> }
>
> /**
> diff --git a/fs/udf/file.c b/fs/udf/file.c
> index bb15771..1cdabd0 100644
> --- a/fs/udf/file.c
> +++ b/fs/udf/file.c
> @@ -155,16 +155,9 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> retval = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (retval > 0) {
> - ssize_t err;
> -
> + if (retval > 0)
> mark_inode_dirty(inode);
> - err = generic_write_sync(file, iocb->ki_pos - retval, retval);
> - if (err < 0)
> - retval = err;
> - }
> -
> - return retval;
> + return generic_write_sync(iocb, retval);
> }
>
> long udf_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0655915..a8cab66 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -792,14 +792,8 @@ xfs_file_write_iter(
> ret = xfs_file_buffered_aio_write(iocb, from);
>
> if (ret > 0) {
> - ssize_t err;
> -
> XFS_STATS_ADD(xs_write_bytes, ret);
> -
> - /* Handle various SYNC-type writes */
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> + ret = generic_write_sync(iocb, ret);
> }
> return ret;
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index eaebd99..7d0e116 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2242,13 +2242,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
> extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
> int datasync);
> extern int vfs_fsync(struct file *file, int datasync);
> -static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
> -{
> - if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
> - return 0;
> - return vfs_fsync_range(file, pos, pos + count - 1,
> - (file->f_flags & __O_SYNC) ? 0 : 1);
> -}
> +extern int generic_write_sync(struct kiocb *iocb, loff_t count);
> extern void emergency_sync(void);
> extern void emergency_remount(void);
> #ifdef CONFIG_BLOCK
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 09d3af3..6107058 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2664,6 +2664,24 @@ out:
> }
> EXPORT_SYMBOL(__generic_file_write_iter);
>
> +int generic_write_sync(struct kiocb *iocb, loff_t count)
> +{
> + struct file *file = iocb->ki_filp;
> +
> + if (count > 0 &&
> + ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
> + bool fdatasync = !(file->f_flags & __O_SYNC);
> + ssize_t ret = 0;
> +
> + ret = vfs_fsync_range(file, iocb->ki_pos - count,
> + iocb->ki_pos - 1, fdatasync);
> + if (ret < 0)
> + return ret;
> + }
> + return count;
> +}
> +EXPORT_SYMBOL(generic_write_sync);
> +
> /**
> * generic_file_write_iter - write data to a file
> * @iocb: IO state structure
> @@ -2675,22 +2693,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
> */
> ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> + struct inode *inode = iocb->ki_filp->f_mapping->host;
> ssize_t ret;
>
> mutex_lock(&inode->i_mutex);
> ret = __generic_file_write_iter(iocb, from);
> mutex_unlock(&inode->i_mutex);
>
> - if (ret > 0) {
> - ssize_t err;
> -
> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
> - if (err < 0)
> - ret = err;
> - }
> - return ret;
> + return generic_write_sync(iocb, ret);
> }
> EXPORT_SYMBOL(generic_file_write_iter);
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-06 15:44:36

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v2 1/2] Add preadv2/pwritev2 documentation.

New syscalls that are a variation on the preadv/pwritev but support an extra
flag argument.

Signed-off-by: Milosz Tanski <[email protected]>
Suggested-by: Jeff Moyer <[email protected]>
Fixes: Jeff Moyer <[email protected]>
---
man2/readv.2 | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 61 insertions(+), 10 deletions(-)

diff --git a/man2/readv.2 b/man2/readv.2
index 8748efa..31b3870 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -45,6 +45,12 @@ readv, writev, preadv, pwritev \- read or write data into multiple buffers
.sp
.BI "ssize_t pwritev(int " fd ", const struct iovec *" iov ", int " iovcnt ,
.BI " off_t " offset );
+.sp
+.BI "ssize_t preadv2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI " off_t " offset ", int " flags );
+.sp
+.BI "ssize_t pwritev2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI " off_t " offset ", int " flags );
.fi
.sp
.in -4n
@@ -162,9 +168,9 @@ The
system call combines the functionality of
.BR writev ()
and
-.BR pwrite (2).
+.BR pwrite (2) "."
It performs the same task as
-.BR writev (),
+.BR writev () ","
but adds a fourth argument,
.IR offset ,
which specifies the file offset at which the output operation
@@ -174,15 +180,41 @@ The file offset is not changed by these system calls.
The file referred to by
.I fd
must be capable of seeking.
+.SS preadv2() and pwritev2()
+
+This pair of system calls has similar functionality to the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, but adds a fifth argument, \fIflags\fP, which modifies the behavior on a per call basis.
+
+Like the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, they accept an \fIoffset\fP argument. Unlike those calls, if the \fIoffset\fP argument is set to -1 then the current file offset is used and updated.
+
+The \fIflags\fP arguments to
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+contains a bitwise OR of one or more of the following flags:
+.TP
+.BR RWF_NONBLOCK " (only " preadv2() " since Linux 3.19)"
+Performs a non-blocking operation for regular files (not sockets) opened in buffered mode (not
+.BR O_DIRECT ")."
+
.SH RETURN VALUE
On success,
-.BR readv ()
-and
+.BR readv () ","
.BR preadv ()
-return the number of bytes read;
-.BR writev ()
and
+.BR preadv2 ()
+return the number of bytes read;
+.BR writev () ","
.BR pwritev ()
+and
+.BR pwritev2 ()
return the number of bytes written.
On error, \-1 is returned, and \fIerrno\fP is set appropriately.
.SH ERRORS
@@ -191,12 +223,22 @@ The errors are as given for
and
.BR write (2).
Furthermore,
-.BR preadv ()
-and
+.BR preadv () ","
+.BR preadv2 () ","
.BR pwritev ()
+and
+.BR pwritev2 ()
can also fail for the same reasons as
.BR lseek (2).
-Additionally, the following error is defined:
+Additionally, the following errors are defined:
+.TP
+.B EAGAIN
+The operation would block. This is possible if the file descriptor \fIfd\fP refers to a socket and has been marked nonblocking
+.RB ( O_NONBLOCK ),
+or the operation is a
+.BR preadv2
+and the \fIflags\fP argument is set to
+.BR RWF_NONBLOCK.
.TP
.B EINVAL
The sum of the
@@ -205,12 +247,17 @@ values overflows an
.I ssize_t
value.
Or, the vector count \fIiovcnt\fP is less than zero or greater than the
-permitted maximum.
+permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
.SH VERSIONS
.BR preadv ()
and
.BR pwritev ()
first appeared in Linux 2.6.30; library support was added in glibc 2.10.
+.sp
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+first appeared in Linux 3.19 (if we're lucky);
.SH CONFORMING TO
.BR readv (),
.BR writev ():
@@ -223,6 +270,10 @@ first appeared in Linux 2.6.30; library support was added in glibc 2.10.
.BR preadv (),
.BR pwritev ():
nonstandard, but present also on the modern BSDs.
+.sp
+.BR preadv2 (),
+.BR pwritev2 ():
+nonstandard, Linux extension.
.SH NOTES
.SS C library/kernel ABI differences
POSIX.1-2001 allows an implementation to place a limit on
--
1.9.1

2014-11-06 15:44:57

by Milosz Tanski

[permalink] [raw]
Subject: [PATCH v2 2/2] RWF_ODSYNC flag for pwritev2

Document RWF_ODSYNC flag for pwritev2 as implemented by Christoph Hellwig.

Signed-off-by: Milosz Tanski <[email protected]>
---
man2/readv.2 | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/man2/readv.2 b/man2/readv.2
index 31b3870..ff1405c 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -203,6 +203,15 @@ contains a bitwise OR of one or more of the following flags:
.BR RWF_NONBLOCK " (only " preadv2() " since Linux 3.19)"
Performs a non-blocking operation for regular files (not sockets) opened in buffered mode (not
.BR O_DIRECT ")."
+.TP
+.BR RWF_DSYNC " (only " pwritev2() " since Linux 3.19)"
+Write operation will complete according to the requirements of synchronized I/O
+.I data
+integrity completion. This has the same effect on the operation as if the file handle was created by
+.BR open(2)
+with the
+.BR O_DSYNC
+flag.

.SH RETURN VALUE
On success,
@@ -333,6 +342,7 @@ nwritten = writev(STDOUT_FILENO, iov, 2);
.fi
.in
.SH SEE ALSO
+.BR open(2),
.BR pread (2),
.BR read (2),
.BR write (2)
--
1.9.1

2014-11-06 15:46:19

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] vfs: Non-blockling buffered fs read (page cache only)

On Thu, Nov 6, 2014 at 2:56 AM, Christoph Hellwig <[email protected]> wrote:
> This series looks good, do you also have a man page sniplet to document
> the new syscalls?
>

I just send out the two patches for the man pages, ran out of time
yesterday to update it for RWF_ODSYNC.

--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-06 16:14:40

by Milosz Tanski

[permalink] [raw]
Subject: Re: [Linux-NTFS-Dev] [PATCH v5 6/7] fs: pass iocb to generic_write_sync

On Thu, Nov 6, 2014 at 5:52 AM, Anton Altaparmakov <[email protected]> wrote:
> Hi,
>
>> On 5 Nov 2014, at 23:14, Milosz Tanski <[email protected]> wrote:
>>
>> From: Christoph Hellwig <[email protected]>
>>
>> Clean up the generic_write_sync by just passing an iocb and a bytes
>> written / negative errno argument. In addition to simplifying the
>> callers this also prepares for passing a per-operation O_DSYNC
>> flag. Two callers didn't quite fit that scheme:
>>
>> - dio_complete didn't both to update ki_pos as we don't need it
>> on a iocb that is about to be freed, so we had to add it. Additionally
>> it also synced out written data in the error case, which has been
>> changed to operate like the other callers.
>> - gfs2 also used generic_write_sync to implement a crude version
>> of fallocate. It has been switched to use an open coded variant
>> instead.
>>
>> Signed-off-by: Christoph Hellwig <[email protected]>
>> ---
>> fs/block_dev.c | 8 +-------
>> fs/btrfs/file.c | 7 ++-----
>> fs/cifs/file.c | 8 +-------
>> fs/direct-io.c | 8 ++------
>> fs/ext4/file.c | 8 +-------
>> fs/gfs2/file.c | 9 +++++++--
>> fs/ntfs/file.c | 8 ++------
>> fs/udf/file.c | 11 ++---------
>> fs/xfs/xfs_file.c | 8 +-------
>> include/linux/fs.h | 8 +-------
>> mm/filemap.c | 30 ++++++++++++++++++++----------
>> 11 files changed, 40 insertions(+), 73 deletions(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index cc9d411..c529b1c 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1568,18 +1568,12 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>> */
>> ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> {
>> - struct file *file = iocb->ki_filp;
>> struct blk_plug plug;
>> ssize_t ret;
>>
>> blk_start_plug(&plug);
>> ret = __generic_file_write_iter(iocb, from);
>> - if (ret > 0) {
>> - ssize_t err;
>> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
>> - if (err < 0)
>> - ret = err;
>> - }
>> + ret = generic_write_sync(iocb, ret);
>> blk_finish_plug(&plug);
>> return ret;
>> }
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index a18ceab..4f4a6f7 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -1820,11 +1820,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>> */
>> BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
>> BTRFS_I(inode)->last_sub_trans = root->log_transid;
>> - if (num_written > 0) {
>> - err = generic_write_sync(file, pos, num_written);
>> - if (err < 0)
>> - num_written = err;
>> - }
>> +
>> + num_written = generic_write_sync(iocb, num_written);
>>
>> if (sync)
>> atomic_dec(&BTRFS_I(inode)->sync_writers);
>> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
>> index c485afa..32359de 100644
>> --- a/fs/cifs/file.c
>> +++ b/fs/cifs/file.c
>> @@ -2706,13 +2706,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
>> rc = __generic_file_write_iter(iocb, from);
>> mutex_unlock(&inode->i_mutex);
>>
>> - if (rc > 0) {
>> - ssize_t err;
>> -
>> - err = generic_write_sync(file, iocb->ki_pos - rc, rc);
>> - if (err < 0)
>> - rc = err;
>> - }
>> + rc = generic_write_sync(iocb, rc);
>> } else {
>> mutex_unlock(&inode->i_mutex);
>> }
>> diff --git a/fs/direct-io.c b/fs/direct-io.c
>> index e181b6b..b72ac83 100644
>> --- a/fs/direct-io.c
>> +++ b/fs/direct-io.c
>> @@ -257,12 +257,8 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
>> inode_dio_done(dio->inode);
>> if (is_async) {
>> if (dio->rw & WRITE) {
>> - int err;
>> -
>> - err = generic_write_sync(dio->iocb->ki_filp, offset,
>> - transferred);
>> - if (err < 0 && ret > 0)
>> - ret = err;
>> + dio->iocb->ki_pos = offset + transferred;
>> + ret = generic_write_sync(dio->iocb, ret);
>> }
>>
>> aio_complete(dio->iocb, ret, 0);
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index aca7b24..79b000c 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -175,13 +175,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> ret = __generic_file_write_iter(iocb, from);
>> mutex_unlock(&inode->i_mutex);
>>
>> - if (ret > 0) {
>> - ssize_t err;
>> -
>> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
>> - if (err < 0)
>> - ret = err;
>> - }
>> + ret = generic_write_sync(iocb, ret);
>> if (o_direct)
>> blk_finish_plug(&plug);
>>
>> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
>> index 80dd44d..3fafeca 100644
>> --- a/fs/gfs2/file.c
>> +++ b/fs/gfs2/file.c
>> @@ -895,8 +895,13 @@ retry:
>> gfs2_quota_unlock(ip);
>> }
>>
>> - if (error == 0)
>> - error = generic_write_sync(file, pos, count);
>> + if (error)
>> + goto out_unlock;
>> +
>> + if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host)) {
>> + error = vfs_fsync_range(file, pos, pos + count - 1,
>> + (file->f_flags & __O_SYNC) ? 0 : 1);
>> + }
>> goto out_unlock;
>>
>> out_trans_fail:
>> diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
>> index 643faa4..4f3d664 100644
>> --- a/fs/ntfs/file.c
>> +++ b/fs/ntfs/file.c
>> @@ -2127,12 +2127,8 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>> mutex_lock(&inode->i_mutex);
>> ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
>> mutex_unlock(&inode->i_mutex);
>> - if (ret > 0) {
>> - int err = generic_write_sync(file, iocb->ki_pos - ret, ret);
>> - if (err < 0)
>> - ret = err;
>> - }
>> - return ret;
>> +
>> + return generic_write_sync(iocb, ret);
>> }
>>
>> /**
>> diff --git a/fs/udf/file.c b/fs/udf/file.c
>> index bb15771..1cdabd0 100644
>> --- a/fs/udf/file.c
>> +++ b/fs/udf/file.c
>> @@ -155,16 +155,9 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> retval = __generic_file_write_iter(iocb, from);
>> mutex_unlock(&inode->i_mutex);
>>
>> - if (retval > 0) {
>> - ssize_t err;
>> -
>> + if (retval > 0)
>> mark_inode_dirty(inode);
>> - err = generic_write_sync(file, iocb->ki_pos - retval, retval);
>> - if (err < 0)
>> - retval = err;
>> - }
>> -
>> - return retval;
>> + return generic_write_sync(iocb, retval);
>> }
>>
>> long udf_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 0655915..a8cab66 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -792,14 +792,8 @@ xfs_file_write_iter(
>> ret = xfs_file_buffered_aio_write(iocb, from);
>>
>> if (ret > 0) {
>> - ssize_t err;
>> -
>> XFS_STATS_ADD(xs_write_bytes, ret);
>> -
>> - /* Handle various SYNC-type writes */
>> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
>> - if (err < 0)
>> - ret = err;
>> + ret = generic_write_sync(iocb, ret);
>> }
>> return ret;
>> }
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index eaebd99..7d0e116 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2242,13 +2242,7 @@ extern int filemap_fdatawrite_range(struct address_space *mapping,
>> extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
>> int datasync);
>> extern int vfs_fsync(struct file *file, int datasync);
>> -static inline int generic_write_sync(struct file *file, loff_t pos, loff_t count)
>> -{
>> - if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
>> - return 0;
>> - return vfs_fsync_range(file, pos, pos + count - 1,
>> - (file->f_flags & __O_SYNC) ? 0 : 1);
>> -}
>> +extern int generic_write_sync(struct kiocb *iocb, loff_t count);
>> extern void emergency_sync(void);
>> extern void emergency_remount(void);
>> #ifdef CONFIG_BLOCK
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 09d3af3..6107058 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2664,6 +2664,24 @@ out:
>> }
>> EXPORT_SYMBOL(__generic_file_write_iter);
>>
>> +int generic_write_sync(struct kiocb *iocb, loff_t count)
>> +{
>> + struct file *file = iocb->ki_filp;
>> +
>> + if (count > 0 &&
>> + ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
>> + bool fdatasync = !(file->f_flags & __O_SYNC);
>> + ssize_t ret = 0;
>
> That "= 0" is pointless. "ret" is overwritten unconditionally on the following line...

I have fixed this change; it will be in the next patch series / pull
request. The branch for the pull is at:
https://bitbucket.org/adfin/linux-fs.git read_call_6

>
> Other than that the NTFS bits are:
>
> Acked-by: Anton Altaparmakov <[email protected]>
>
> Best regards,
>
> Anton
>
>> +
>> + ret = vfs_fsync_range(file, iocb->ki_pos - count,
>> + iocb->ki_pos - 1, fdatasync);
>> + if (ret < 0)
>> + return ret;
>> + }
>> + return count;
>> +}
>> +EXPORT_SYMBOL(generic_write_sync);
>> +
>> /**
>> * generic_file_write_iter - write data to a file
>> * @iocb: IO state structure
>> @@ -2675,22 +2693,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
>> */
>> ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> {
>> - struct file *file = iocb->ki_filp;
>> - struct inode *inode = file->f_mapping->host;
>> + struct inode *inode = iocb->ki_filp->f_mapping->host;
>> ssize_t ret;
>>
>> mutex_lock(&inode->i_mutex);
>> ret = __generic_file_write_iter(iocb, from);
>> mutex_unlock(&inode->i_mutex);
>>
>> - if (ret > 0) {
>> - ssize_t err;
>> -
>> - err = generic_write_sync(file, iocb->ki_pos - ret, ret);
>> - if (err < 0)
>> - ret = err;
>> - }
>> - return ret;
>> + return generic_write_sync(iocb, ret);
>> }
>> EXPORT_SYMBOL(generic_file_write_iter);
>
> --
> Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
> University of Cambridge Information Services, Roger Needham Building
> 7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK
>



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-06 16:16:32

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v5 0/7] vfs: Non-blockling buffered fs read (page cache only)

The pull request for these changes is at:

https://bitbucket.org/adfin/linux-fs.git read_call_6

I've updated it so far with various Ack-by from different maintainers
(and a small stylistic fix).

On Wed, Nov 5, 2014 at 4:14 PM, Milosz Tanski <[email protected]> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
> Version 5 highlight:
> - XFS support for RWF_NONBLOCK. from Christoph.
> - RWF_DSYNC flag and support for pwritev2, from Christoph.
> - Implemented compat syscalls, per. Jeff.
> - Missing nfs, ceph changes from older patchset.
>
> Version 4 highlight:
> - Updated for 3.18-rc1.
> - Performance data from our application.
> - First stab at man page with Jeff's help. Patch is in-reply to.
>
> RFC Version 3 highlights:
> - Down to 2 syscalls from 4; can user fp or argument position.
> - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>
> RFC Version 2 highlights:
> - Put the flags argument into kiocb (less noise), per. Al Viro
> - O_DIRECT checking early in the process, per. Jeff Moyer
> - Resolved duplicate (c&p) code in syscall code, per. Jeff
> - Included perf data in thread cover letter, per. Jeff
> - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>
>
> Some perf data generated using fio comparing the posix aio engine to a version
> of the posix AIO engine that attempts to performs "fast" reads before
> submitting the operations to the queue. This workflow is on ext4 partition on
> raid0 (test / build-rig.) Simulating our database access patern workload using
> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> does the same thing.)
>
> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> f3: ~9% seq-read over large dataset
>
> before:
>
> f1:
> bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
> lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
> lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> f2:
> bw (KB /s): min= 2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
> lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
> lat (msec) : >=2000=4.33%
> f3:
> bw (KB /s): min= 0, max=265568, per=99.95%, avg=174575.10,
> stdev=34526.89
> lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
> lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
> lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
> lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> total:
> READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
> mint=600001msec, maxt=600113msec
>
> after (with fast read using preadv2 before submit):
>
> f1:
> bw (KB /s): min= 3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
> lat (usec) : 2=70.63%, 4=0.01%
> lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> f2:
> bw (KB /s): min= 2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
> lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
> lat (msec) : >=2000=9.99%
> f3:
> bw (KB /s): min= 1, max=245448, per=100.00%, avg=177366.50,
> stdev=35995.60
> lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
> lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
> lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
> lat (msec) : 100=0.05%, 250=0.02%
> total:
> READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
> mint=600020msec, maxt=600178msec
>
> Interpreting the results you can see total bandwidth stays the same but overall
> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> workloads. There is a slight bump in latency for since it's random data that's
> unlikely to be cached but we're always trying "fast read".
>
> In our application we have starting keeping track of "fast read" hits/misses
> and for files / requests that have a lot hit ratio we don't do "fast reads"
> mostly getting rid of extra latency in the uncached cases. In our real world
> work load we were able to reduce average response time by 20 to 30% (depends
> on amount of IO done by request).
>
> I've performed other benchmarks and I have no observed any perf regressions in
> any of the normal (old) code paths.
>
> I have co-developed these changes with Christoph Hellwig.
>
> Christoph Hellwig (3):
> xfs: add RWF_NONBLOCK support
> fs: pass iocb to generic_write_sync
> fs: add a flag for per-operation O_DSYNC semantics
>
> Milosz Tanski (4):
> vfs: Prepare for adding a new preadv/pwritev with user flags.
> vfs: Define new syscalls preadv2,pwritev2
> x86: wire up preadv2 and pwritev2
> vfs: RWF_NONBLOCK flag for preadv2
>
> arch/x86/syscalls/syscall_32.tbl | 2 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> drivers/target/target_core_file.c | 6 +-
> fs/block_dev.c | 8 +-
> fs/btrfs/file.c | 7 +-
> fs/ceph/file.c | 6 +-
> fs/cifs/file.c | 14 +--
> fs/direct-io.c | 8 +-
> fs/ext4/file.c | 8 +-
> fs/fuse/file.c | 2 +
> fs/gfs2/file.c | 9 +-
> fs/nfs/file.c | 15 ++-
> fs/nfsd/vfs.c | 4 +-
> fs/ntfs/file.c | 8 +-
> fs/ocfs2/file.c | 12 +-
> fs/pipe.c | 3 +-
> fs/read_write.c | 239 +++++++++++++++++++++++++++++---------
> fs/splice.c | 2 +-
> fs/udf/file.c | 11 +-
> fs/xfs/xfs_file.c | 36 ++++--
> include/linux/aio.h | 2 +
> include/linux/compat.h | 6 +
> include/linux/fs.h | 16 ++-
> include/linux/syscalls.h | 6 +
> include/uapi/asm-generic/unistd.h | 6 +-
> mm/filemap.c | 55 +++++++--
> mm/shmem.c | 4 +
> 27 files changed, 346 insertions(+), 151 deletions(-)
>
> --
> 1.9.1
>



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-06 23:26:39

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] vfs: Define new syscalls preadv2,pwritev2

Milosz Tanski <[email protected]> writes:

> New syscalls that take an flag argument. This change does not add any specific
> flags.
>
> Signed-off-by: Milosz Tanski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> fs/read_write.c | 176 ++++++++++++++++++++++++++++++--------
> include/linux/compat.h | 6 ++
> include/linux/syscalls.h | 6 ++
> include/uapi/asm-generic/unistd.h | 6 +-
> mm/filemap.c | 5 +-
> 5 files changed, 158 insertions(+), 41 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 94b2d34..907735c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_READ))
> return -EINVAL;
> + if (flags & ~0)
> + return -EINVAL;
>
> return do_readv_writev(READ, file, vec, vlen, pos, flags);
> }
> @@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_WRITE))
> return -EINVAL;
> + if (flags & ~0)
> + return -EINVAL;
>
> return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
> }

Hi, Milosz,

You've checked for invalid flags for the normal system calls, but not
for the compat variants. Can you add that in, please?

Thanks!
Jeff

2014-11-06 23:47:07

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics

Milosz Tanski <[email protected]> writes:

> - if (type == READ && (flags & RWF_NONBLOCK))
> - return -EAGAIN;
> + if (type == READ) {
> + if (flags & RWF_NONBLOCK)
> + return -EAGAIN;
> + } else {
> + if (flags & RWF_DSYNC)
> + return -EINVAL;
> + }

Minor nit, but I'd rather read something that looks like this:

if (type == READ && (flags & RWF_NONBLOCK))
return -EAGAIN;
else if (type == WRITE && (flags & RWF_DSYNC))
return -EINVAL;

I won't lose sleep over it, though.

Reviewed-by: Jeff Moyer <[email protected]>

2014-11-07 04:23:07

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] add a flag for per-operation O_DSYNC semantics

Hi Jeff,

> On 7 Nov 2014, at 01:46, Jeff Moyer <[email protected]> wrote:
>
> Milosz Tanski <[email protected]> writes:
>
>> - if (type == READ && (flags & RWF_NONBLOCK))
>> - return -EAGAIN;
>> + if (type == READ) {
>> + if (flags & RWF_NONBLOCK)
>> + return -EAGAIN;
>> + } else {
>> + if (flags & RWF_DSYNC)
>> + return -EINVAL;
>> + }
>
> Minor nit, but I'd rather read something that looks like this:
>
> if (type == READ && (flags & RWF_NONBLOCK))
> return -EAGAIN;
> else if (type == WRITE && (flags & RWF_DSYNC))
> return -EINVAL;

But your version is less logically efficient for the case where "type == READ" is true and "flags & RWF_NONBLOCK" is false because your version then has to do the "if (type == WRITE" check before discovering it does not need to take that branch either, whilst the original version does not have to do such a test at all.

Best regards,

Anton

> I won't lose sleep over it, though.
>
> Reviewed-by: Jeff Moyer <[email protected]>

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
University of Cambridge Information Services, Roger Needham Building
7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK

2014-11-07 16:28:22

by Milosz Tanski

[permalink] [raw]
Subject: Re: [PATCH v5 2/7] vfs: Define new syscalls preadv2,pwritev2

On Thu, Nov 6, 2014 at 6:25 PM, Jeff Moyer <[email protected]> wrote:
> Milosz Tanski <[email protected]> writes:
>
>> New syscalls that take an flag argument. This change does not add any specific
>> flags.
>>
>> Signed-off-by: Milosz Tanski <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> ---
>> fs/read_write.c | 176 ++++++++++++++++++++++++++++++--------
>> include/linux/compat.h | 6 ++
>> include/linux/syscalls.h | 6 ++
>> include/uapi/asm-generic/unistd.h | 6 +-
>> mm/filemap.c | 5 +-
>> 5 files changed, 158 insertions(+), 41 deletions(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 94b2d34..907735c 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -866,6 +866,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
>> return -EBADF;
>> if (!(file->f_mode & FMODE_CAN_READ))
>> return -EINVAL;
>> + if (flags & ~0)
>> + return -EINVAL;
>>
>> return do_readv_writev(READ, file, vec, vlen, pos, flags);
>> }
>> @@ -879,21 +881,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
>> return -EBADF;
>> if (!(file->f_mode & FMODE_CAN_WRITE))
>> return -EINVAL;
>> + if (flags & ~0)
>> + return -EINVAL;
>>
>> return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
>> }
>
> Hi, Milosz,
>
> You've checked for invalid flags for the normal system calls, but not
> for the compat variants. Can you add that in, please?
>
> Thanks!
> Jeff

That's a good catch Jeff I'll fix this and it'll be in the next
version of the patch series.

- M



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: [email protected]

2014-11-10 16:07:17

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH v5 4/7] vfs: RWF_NONBLOCK flag for preadv2

On Wed, 5 Nov 2014, Milosz Tanski wrote:
> generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
> only want to read the data if it's already in the page cache.
>
> Additionally, there are a few filesystems that we have to specifically
> bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
> contributed this code.
>
> Signed-off-by: Milosz Tanski <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Jeff Moyer <[email protected]>

Ceph bits

Acked-by: Sage Weil <[email protected]>


> ---
> fs/ceph/file.c | 2 ++
> fs/cifs/file.c | 6 ++++++
> fs/nfs/file.c | 5 ++++-
> fs/ocfs2/file.c | 6 ++++++
> fs/pipe.c | 3 ++-
> fs/read_write.c | 38 +++++++++++++++++++++++++-------------
> fs/xfs/xfs_file.c | 4 ++++
> include/linux/fs.h | 3 +++
> mm/filemap.c | 18 ++++++++++++++++++
> mm/shmem.c | 4 ++++
> 10 files changed, 74 insertions(+), 15 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d7e0da8..b798b5c 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -822,6 +822,8 @@ again:
> if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> (iocb->ki_filp->f_flags & O_DIRECT) ||
> (fi->flags & CEPH_F_SYNC)) {
> + if (iocb->ki_rwflags & O_NONBLOCK)
> + return -EAGAIN;
>
> dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
> inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 3e4d00a..c485afa 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
> struct cifs_readdata *rdata, *tmp;
> struct list_head rdata_list;
>
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
> len = iov_iter_count(to);
> if (!len)
> return 0;
> @@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
> ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
> return generic_file_read_iter(iocb, to);
>
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
> /*
> * We need to hold the sem to be sure nobody modifies lock list
> * with a brlock that prevents reading.
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 2ab6f00..aa9046f 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
> struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t result;
>
> - if (iocb->ki_filp->f_flags & O_DIRECT)
> + if (iocb->ki_filp->f_flags & O_DIRECT) {
> + if (iocb->ki_rwflags & O_NONBLOCK)
> + return -EAGAIN;
> return nfs_file_direct_read(iocb, to, iocb->ki_pos);
> + }
>
> dprintk("NFS: read(%pD2, %zu@%lu)\n",
> iocb->ki_filp,
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 324dc93..bb66ca4 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
> filp->f_path.dentry->d_name.name,
> to->nr_segs); /* GRRRRR */
>
> + /*
> + * No non-blocking reads for ocfs2 for now. Might be doable with
> + * non-blocking cluster lock helpers.
> + */
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
>
> if (!inode) {
> ret = -EINVAL;
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 21981e5..212bf68 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> */
> if (ret)
> break;
> - if (filp->f_flags & O_NONBLOCK) {
> + if ((filp->f_flags & O_NONBLOCK) ||
> + (iocb->ki_rwflags & RWF_NONBLOCK)) {
> ret = -EAGAIN;
> break;
> }
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 907735c..cba7d4c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
> file_start_write(file);
> }
>
> - if (iter_fn)
> + if (iter_fn) {
> ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> pos, iter_fn, flags);
> - else if (fnv)
> - ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> - pos, fnv);
> - else
> - ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
> + } else {
> + if (type == READ && (flags & RWF_NONBLOCK))
> + return -EAGAIN;
> +
> + if (fnv)
> + ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> + pos, fnv);
> + else
> + ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
> + }
>
> if (type != READ)
> file_end_write(file);
> @@ -866,8 +871,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_READ))
> return -EINVAL;
> - if (flags & ~0)
> + if (flags & ~RWF_NONBLOCK)
> return -EINVAL;
> + if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
> + return -EAGAIN;
>
> return do_readv_writev(READ, file, vec, vlen, pos, flags);
> }
> @@ -1069,14 +1076,19 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
> file_start_write(file);
> }
>
> - if (iter_fn)
> + if (iter_fn) {
> ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> pos, iter_fn, flags);
> - else if (fnv)
> - ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> - pos, fnv);
> - else
> - ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
> + } else {
> + if (type == READ && (flags & RWF_NONBLOCK))
> + return -EAGAIN;
> +
> + if (fnv)
> + ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> + pos, fnv);
> + else
> + ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
> + }
>
> if (type != READ)
> file_end_write(file);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index eb596b4..b1f6334 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -246,6 +246,10 @@ xfs_file_read_iter(
>
> XFS_STATS_INC(xs_read_calls);
>
> + /* XXX: need a non-blocking iolock helper, shouldn't be too hard */
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
> if (unlikely(file->f_flags & O_DIRECT))
> ioflags |= XFS_IO_ISDIRECT;
> if (file->f_mode & FMODE_NOCMTIME)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ed5711..eaebd99 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1459,6 +1459,9 @@ struct block_device_operations;
> #define HAVE_COMPAT_IOCTL 1
> #define HAVE_UNLOCKED_IOCTL 1
>
> +/* These flags are used for the readv/writev syscalls with flags. */
> +#define RWF_NONBLOCK 0x00000001
> +
> struct iov_iter;
>
> struct file_operations {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 530c263..09d3af3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1494,6 +1494,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
> find_page:
> page = find_get_page(mapping, index);
> if (!page) {
> + if (flags & RWF_NONBLOCK)
> + goto would_block;
> page_cache_sync_readahead(mapping,
> ra, filp,
> index, last_index - index);
> @@ -1585,6 +1587,11 @@ page_ok:
> continue;
>
> page_not_up_to_date:
> + if (flags & RWF_NONBLOCK) {
> + page_cache_release(page);
> + goto would_block;
> + }
> +
> /* Get exclusive access to the page ... */
> error = lock_page_killable(page);
> if (unlikely(error))
> @@ -1604,6 +1611,12 @@ page_not_up_to_date_locked:
> goto page_ok;
> }
>
> + if (flags & RWF_NONBLOCK) {
> + unlock_page(page);
> + page_cache_release(page);
> + goto would_block;
> + }
> +
> readpage:
> /*
> * A previous I/O error may have been due to temporary
> @@ -1674,6 +1687,8 @@ no_cached_page:
> goto readpage;
> }
>
> +would_block:
> + error = -EAGAIN;
> out:
> ra->prev_pos = prev_index;
> ra->prev_pos <<= PAGE_CACHE_SHIFT;
> @@ -1707,6 +1722,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> size_t count = iov_iter_count(iter);
> loff_t size;
>
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
> if (!count)
> goto out; /* skip atime */
> size = i_size_read(inode);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index cd6fc75..5c30f04 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> ssize_t retval = 0;
> loff_t *ppos = &iocb->ki_pos;
>
> + /* XXX: should be easily supportable */
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
> /*
> * Might this read be for a stacking filesystem? Then when reading
> * holes of a sparse file, we actually need to allocate those pages,
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2014-11-10 16:08:03

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics

On Wed, 5 Nov 2014, Milosz Tanski wrote:
> From: Christoph Hellwig <[email protected]>
>
> With the new read/write with flags syscalls we can support a flag
> to enable O_DSYNC semantics on a per-operation basis. This ?s
> useful to implement protocols like SMB, NFS or SCSI that have such
> per-operation flags.
>
> Example program below:
>
> cat > pwritev2.c << EOF
>
> (off_t) val, \
> (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))
>
> static ssize_t
> pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
> {
> return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
> flags);
> }
>
> int main(int argc, char **argv)
> {
> int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666);
> char buf[1024];
> struct iovec iov = { .iov_base = buf, .iov_len = 1024 };
> int ret;
>
> if (fd < 0) {
> perror("open");
> return 0;
> }
>
> memset(buf, 0xfe, sizeof(buf));
>
> ret = pwritev2(fd, &iov, 1, 0, RWF_DSYNC);
> if (ret < 0)
> perror("pwritev2");
> else
> printf("ret = %d\n", ret);
>
> return 0;
> }
> EOF
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> [[email protected]: added flag check to compat_do_readv_writev()]
> Signed-off-by: Milosz Tanski <[email protected]>

Ceph bits

Acked-by: Sage Weil <[email protected]>

> ---
> fs/ceph/file.c | 4 +++-
> fs/fuse/file.c | 2 ++
> fs/nfs/file.c | 10 ++++++----
> fs/ocfs2/file.c | 6 ++++--
> fs/read_write.c | 20 +++++++++++++++-----
> include/linux/fs.h | 3 ++-
> mm/filemap.c | 4 +++-
> 7 files changed, 35 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index b798b5c..2d4e15a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -983,7 +983,9 @@ retry_snap:
> ceph_put_cap_refs(ci, got);
>
> if (written >= 0 &&
> - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
> + ((file->f_flags & O_SYNC) ||
> + IS_SYNC(file->f_mapping->host) ||
> + (iocb->ki_rwflags & RWF_DSYNC) ||
> ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
> err = vfs_fsync_range(file, pos, pos + written - 1, 1);
> if (err < 0)
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index caa8d95..bb4fb23 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> written += written_buffered;
> iocb->ki_pos = pos + written_buffered;
> } else {
> + if (iocb->ki_rwflags & RWF_DSYNC)
> + return -EINVAL;
> written = fuse_perform_write(file, mapping, from, pos);
> if (written >= 0)
> iocb->ki_pos = pos + written;
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index aa9046f..c59b0b7 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -652,13 +652,15 @@ static const struct vm_operations_struct nfs_file_vm_ops = {
> .remap_pages = generic_file_remap_pages,
> };
>
> -static int nfs_need_sync_write(struct file *filp, struct inode *inode)
> +static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode)
> {
> struct nfs_open_context *ctx;
>
> - if (IS_SYNC(inode) || (filp->f_flags & O_DSYNC))
> + if (IS_SYNC(inode) ||
> + (iocb->ki_filp->f_flags & O_DSYNC) ||
> + (iocb->ki_rwflags & RWF_DSYNC))
> return 1;
> - ctx = nfs_file_open_context(filp);
> + ctx = nfs_file_open_context(iocb->ki_filp);
> if (test_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags) ||
> nfs_ctx_key_to_expire(ctx))
> return 1;
> @@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
> written = result;
>
> /* Return error values for O_DSYNC and IS_SYNC() */
> - if (result >= 0 && nfs_need_sync_write(file, inode)) {
> + if (result >= 0 && nfs_need_sync_write(iocb, inode)) {
> int err = vfs_fsync(file, 0);
> if (err < 0)
> result = err;
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index bb66ca4..8f9a86b 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2374,8 +2374,10 @@ out_dio:
> /* buffered aio wouldn't have proper lock coverage today */
> BUG_ON(ret == -EIOCBQUEUED && !(file->f_flags & O_DIRECT));
>
> - if (((file->f_flags & O_DSYNC) && !direct_io) || IS_SYNC(inode) ||
> - ((file->f_flags & O_DIRECT) && !direct_io)) {
> + if (((file->f_flags & O_DSYNC) && !direct_io) ||
> + IS_SYNC(inode) ||
> + ((file->f_flags & O_DIRECT) && !direct_io) ||
> + (iocb->ki_rwflags & RWF_DSYNC)) {
> ret = filemap_fdatawrite_range(file->f_mapping, *ppos,
> *ppos + count - 1);
> if (ret < 0)
> diff --git a/fs/read_write.c b/fs/read_write.c
> index cba7d4c..3443265 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -839,8 +839,13 @@ static ssize_t do_readv_writev(int type, struct file *file,
> ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> pos, iter_fn, flags);
> } else {
> - if (type == READ && (flags & RWF_NONBLOCK))
> - return -EAGAIN;
> + if (type == READ) {
> + if (flags & RWF_NONBLOCK)
> + return -EAGAIN;
> + } else {
> + if (flags & RWF_DSYNC)
> + return -EINVAL;
> + }
>
> if (fnv)
> ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> @@ -888,7 +893,7 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
> return -EBADF;
> if (!(file->f_mode & FMODE_CAN_WRITE))
> return -EINVAL;
> - if (flags & ~0)
> + if (flags & ~RWF_DSYNC)
> return -EINVAL;
>
> return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
> @@ -1080,8 +1085,13 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
> ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> pos, iter_fn, flags);
> } else {
> - if (type == READ && (flags & RWF_NONBLOCK))
> - return -EAGAIN;
> + if (type == READ) {
> + if (flags & RWF_NONBLOCK)
> + return -EAGAIN;
> + } else {
> + if (flags & RWF_DSYNC)
> + return -EINVAL;
> + }
>
> if (fnv)
> ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7d0e116..7786b88 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1460,7 +1460,8 @@ struct block_device_operations;
> #define HAVE_UNLOCKED_IOCTL 1
>
> /* These flags are used for the readv/writev syscalls with flags. */
> -#define RWF_NONBLOCK 0x00000001
> +#define RWF_NONBLOCK 0x00000001
> +#define RWF_DSYNC 0x00000002
>
> struct iov_iter;
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6107058..4fbef99 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2669,7 +2669,9 @@ int generic_write_sync(struct kiocb *iocb, loff_t count)
> struct file *file = iocb->ki_filp;
>
> if (count > 0 &&
> - ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))) {
> + ((file->f_flags & O_DSYNC) ||
> + (iocb->ki_rwflags & RWF_DSYNC) ||
> + IS_SYNC(file->f_mapping->host))) {
> bool fdatasync = !(file->f_flags & __O_SYNC);
> ssize_t ret = 0;
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>