2011-03-03 06:34:08

by Sage Weil

[permalink] [raw]
Subject: [RFC] introduce sys_syncat to sync a single file system

It is frequently useful to sync a single file system, instead of all
mounted file systems via sync(2):

- On machines with many of mounts, it is not at all uncommon for some of
them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
those and may never get to the one you do care about (e.g., /).
- Some applications (Ceph, dpkg) write lots of data to the file system and
then want to make sure it is flushed to disk. Calling fsync(2) on each
file introduces unnecessary ordering constraints that result in a large
amount of sub-optimal writeback/flush/commit behavior by the file
system.

There are currently two ways (that I know of) to sync a single super_block:

- BLKFLSBUF ioctl on the block device: That also invalidates the bdev
mapping, which isn't usually desirable, and doesn't work for non-block
file systems.
- 'mount -o remount,rw' will call sync_filesystem as an artifact of the
current implemention. Relying on this little-known side effect for
something like data safety sounds foolish.

Both of these approaches require root privileges, which some applications
do not have (nor should they need?) given that sync(2) is an unprivileged
operation.

This patch introduces a new system call syncat(2) that mimics the existing
*at() interfaces by taking an fd and/or path. The fd can be either an
open file descriptor or AT_FDCWD, and the pathname can be either a path or
(unlike the usual *at() style interface) NULL. Only the file system for
the referenced file is synced.

The syscall approach is motivated by comments by Al and Christoph at the
last LSF. A simpler ioctl was also proposed a while back, see
http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2

Is this a reasonable approach? (Patch below is compile tested only. :)


---
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 3 +-
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
fs/sync.c | 43 ++++++++++++++++++++++++++++++++++++
5 files changed, 49 insertions(+), 1 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 518bb99..1d610e4 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -851,4 +851,5 @@ ia32_sys_call_table:
.quad sys_fanotify_init
.quad sys32_fanotify_mark
.quad sys_prlimit64 /* 340 */
+ .quad sys_syncat
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index b766a5e..350bf94 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,11 @@
#define __NR_fanotify_init 338
#define __NR_fanotify_mark 339
#define __NR_prlimit64 340
+#define __NR_syncat 341

#ifdef __KERNEL__

-#define NR_syscalls 341
+#define NR_syscalls 342

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 363e9b8..1ea0953 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -669,6 +669,8 @@ __SYSCALL(__NR_fanotify_init, sys_fanotify_init)
__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
#define __NR_prlimit64 302
__SYSCALL(__NR_prlimit64, sys_prlimit64)
+#define __NR_syncat 303
+__SYSCALL(__NR_syncat, sys_syncat)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index b35786d..12de607 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,4 @@ ENTRY(sys_call_table)
.long sys_fanotify_init
.long sys_fanotify_mark
.long sys_prlimit64 /* 340 */
+ .long sys_syncat
diff --git a/fs/sync.c b/fs/sync.c
index ba76b96..a57dfe4 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -7,6 +7,7 @@
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/module.h>
+#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
@@ -128,6 +129,48 @@ void emergency_sync(void)
}
}

+/*
+ * sync a single super
+ */
+SYSCALL_DEFINE3(syncat, int, dfd, const char __user *, filename, int, flags)
+{
+ struct path path;
+ struct file *file = 0;
+ struct super_block *sb;
+ int ret = -EINVAL;
+ int lookup_flags = 0;
+ int fput_needed = 0;
+
+ if ((flags & ~AT_SYMLINK_NOFOLLOW) != 0)
+ goto out;
+
+ if (!(flags & AT_SYMLINK_NOFOLLOW))
+ lookup_flags |= LOOKUP_FOLLOW;
+
+ if (filename) {
+ ret = user_path_at(dfd, filename, 0, &path);
+ if (ret)
+ goto out;
+ sb = path.dentry->d_sb;
+ } else {
+ file = fget_light(dfd, &fput_needed);
+ ret = -EBADF;
+ if (!file)
+ goto out;
+ sb = file->f_dentry->d_sb;
+ }
+
+ down_read(&sb->s_umount);
+ ret = sync_filesystem(sb);
+ up_read(&sb->s_umount);
+
+ if (filename)
+ path_put(&path);
+ fput_light(file, fput_needed);
+out:
+ return ret;
+}
+
/**
* vfs_fsync_range - helper to sync a range of data & metadata to disk
* @file: file to sync
--
1.7.0.4


2011-03-03 07:22:34

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [RFC] introduce sys_syncat to sync a single file system

Hi,

Sage Weil wrote:

> - On machines with many of mounts, it is not at all uncommon for some of
> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> those and may never get to the one you do care about (e.g., /).

Fun to see this again.

> - Some applications (Ceph, dpkg) write lots of data to the file system and
> then want to make sure it is flushed to disk. Calling fsync(2) on each
> file introduces unnecessary ordering constraints that result in a large
> amount of sub-optimal writeback/flush/commit behavior by the file
> system.

FWIW dpkg uses sync_file_range(2) and only syncs the files it needs to
nowadays. Other apps in the same position should probably do the
same.[1][2]

> This patch introduces a new system call syncat(2) that mimics the existing
> *at() interfaces by taking an fd and/or path. The fd can be either an
> open file descriptor or AT_FDCWD, and the pathname can be either a path or
> (unlike the usual *at() style interface) NULL. Only the file system for
> the referenced file is synced.

Sounds like overengineering. The openat(2) family of calls are meant
to add flexibility to familiar calls that perform an operation with a
path relative to the cwd. To maintain familiarity, they include some
complication (AT_FDCWD, taking a relative path, and so on).

Since sync_one_filesystem(2) is new, why not just take a file or
directory fd (and perhaps flags for future expansion)? I can use
open(".", O_NONBLOCK) to get a file descriptor for the cwd.

> Is this a reasonable approach? (Patch below is compile tested only. :)

Sounds reasonably sane.

As for the patch: without the pathname arg it becomes much simpler.
To my inexpert eyes, aside from that it looks good.

Thanks,
Jonathan

[1] http://thread.gmane.org/gmane.comp.file-systems.ext4/22190
[2] http://lists.debian.org/debian-dpkg/2010/11/threads.html#00075

2011-03-03 08:55:09

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC] introduce sys_syncat to sync a single file system

On Thu, 3 Mar 2011 01:22:24 -0600, Jonathan Nieder <[email protected]> wrote:
> Hi,
>
> Sage Weil wrote:
>
> > - On machines with many of mounts, it is not at all uncommon for some of
> > them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> > those and may never get to the one you do care about (e.g., /).
>
> Fun to see this again.
>
> > - Some applications (Ceph, dpkg) write lots of data to the file system and
> > then want to make sure it is flushed to disk. Calling fsync(2) on each
> > file introduces unnecessary ordering constraints that result in a large
> > amount of sub-optimal writeback/flush/commit behavior by the file
> > system.

This would be useful for 9p server in qemu

http://article.gmane.org/gmane.comp.emulators.qemu/95497

>
> FWIW dpkg uses sync_file_range(2) and only syncs the files it needs to
> nowadays. Other apps in the same position should probably do the
> same.[1][2]
>
> > This patch introduces a new system call syncat(2) that mimics the existing
> > *at() interfaces by taking an fd and/or path. The fd can be either an
> > open file descriptor or AT_FDCWD, and the pathname can be either a path or
> > (unlike the usual *at() style interface) NULL. Only the file system for
> > the referenced file is synced.
>
> Sounds like overengineering. The openat(2) family of calls are meant
> to add flexibility to familiar calls that perform an operation with a
> path relative to the cwd. To maintain familiarity, they include some
> complication (AT_FDCWD, taking a relative path, and so on).


With some of the proposed changes for VFS [1] some of the *at calls also allows to
specify "" names. So i guess having syncat is useful because now we can
call sync with either an fd or with a name.

ie

syncat(fd, "");
or
syncat(AT_FDCWD, "a");

[1] http://article.gmane.org/gmane.linux.file-systems/50773

>
> Since sync_one_filesystem(2) is new, why not just take a file or
> directory fd (and perhaps flags for future expansion)? I can use
> open(".", O_NONBLOCK) to get a file descriptor for the cwd.
>
> > Is this a reasonable approach? (Patch below is compile tested only. :)
>
> Sounds reasonably sane.
>
> As for the patch: without the pathname arg it becomes much simpler.
> To my inexpert eyes, aside from that it looks good.
>

-aneesh

2011-03-07 23:15:54

by Sage Weil

[permalink] [raw]
Subject: [RFC] introduce sys_syncfs to sync a single file system (v2)

Changes:
v1->v2: Rename, simplify to just take an fd.

It is frequently useful to sync a single file system, instead of all
mounted file systems via sync(2):

- On machines with many mounts, it is not at all uncommon for some of
them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
those and may never get to the one you do care about (e.g., /).
- Some applications write lots of data to the file system and then
want to make sure it is flushed to disk. Calling fsync(2) on each
file introduces unnecessary ordering constraints that result in a large
amount of sub-optimal writeback/flush/commit behavior by the file
system.

There are currently two ways (that I know of) to sync a single super_block:

- BLKFLSBUF ioctl on the block device: That also invalidates the bdev
mapping, which isn't usually desirable, and doesn't work for non-block
file systems.
- 'mount -o remount,rw' will call sync_filesystem as an artifact of the
current implemention. Relying on this little-known side effect for
something like data safety sounds foolish.

Both of these approaches require root privileges, which some applications
do not have (nor should they need?) given that sync(2) is an unprivileged
operation.

This patch introduces a new system call syncfs(2) that takes an fd and
syncs only the file system it references. Maybe someday we can even

$ sync /some/path

and not get

sync: ignoring all arguments

The syscall is motivated by comments by Al and Christoph at the last LSF.
syncfs(2) seems like an appropriate name given statfs(2).

A similar ioctl was also proposed a while back, see
http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2

Signed-off-by: Sage Weil <[email protected]>
---
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 3 ++-
arch/x86/include/asm/unistd_64.h | 2 ++
arch/x86/kernel/syscall_table_32.S | 1 +
fs/sync.c | 24 ++++++++++++++++++++++++
5 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 518bb99..24082b8 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -851,4 +851,5 @@ ia32_sys_call_table:
.quad sys_fanotify_init
.quad sys32_fanotify_mark
.quad sys_prlimit64 /* 340 */
+ .quad sys_syncfs
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index b766a5e..da3903f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,11 @@
#define __NR_fanotify_init 338
#define __NR_fanotify_mark 339
#define __NR_prlimit64 340
+#define __NR_syncfs 341

#ifdef __KERNEL__

-#define NR_syscalls 341
+#define NR_syscalls 342

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 363e9b8..a2e7516 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -669,6 +669,8 @@ __SYSCALL(__NR_fanotify_init, sys_fanotify_init)
__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
#define __NR_prlimit64 302
__SYSCALL(__NR_prlimit64, sys_prlimit64)
+#define __NR_syncfs 303
+__SYSCALL(__NR_syncfs, sys_syncfs)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index b35786d..d40bd16 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,4 @@ ENTRY(sys_call_table)
.long sys_fanotify_init
.long sys_fanotify_mark
.long sys_prlimit64 /* 340 */
+ .long sys_syncfs
diff --git a/fs/sync.c b/fs/sync.c
index ba76b96..92ca208 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -7,6 +7,7 @@
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/module.h>
+#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
@@ -128,6 +129,29 @@ void emergency_sync(void)
}
}

+/*
+ * sync a single super
+ */
+SYSCALL_DEFINE1(syncfs, int, fd)
+{
+ struct file *file;
+ struct super_block *sb;
+ int ret;
+ int fput_needed;
+
+ file = fget_light(fd, &fput_needed);
+ if (!file)
+ return -EBADF;
+ sb = file->f_dentry->d_sb;
+
+ down_read(&sb->s_umount);
+ ret = sync_filesystem(sb);
+ up_read(&sb->s_umount);
+
+ fput_light(file, fput_needed);
+ return ret;
+}
+
/**
* vfs_fsync_range - helper to sync a range of data & metadata to disk
* @file: file to sync
--
1.7.0

2011-03-08 05:27:11

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC] introduce sys_syncfs to sync a single file system (v2)

On Mon, 7 Mar 2011 15:17:57 -0800 (PST), Sage Weil <[email protected]> wrote:
> Changes:
> v1->v2: Rename, simplify to just take an fd.
>
> It is frequently useful to sync a single file system, instead of all
> mounted file systems via sync(2):
>
> - On machines with many mounts, it is not at all uncommon for some of
> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> those and may never get to the one you do care about (e.g., /).
> - Some applications write lots of data to the file system and then
> want to make sure it is flushed to disk. Calling fsync(2) on each
> file introduces unnecessary ordering constraints that result in a large
> amount of sub-optimal writeback/flush/commit behavior by the file
> system.
>
> There are currently two ways (that I know of) to sync a single super_block:
>
> - BLKFLSBUF ioctl on the block device: That also invalidates the bdev
> mapping, which isn't usually desirable, and doesn't work for non-block
> file systems.
> - 'mount -o remount,rw' will call sync_filesystem as an artifact of the
> current implemention. Relying on this little-known side effect for
> something like data safety sounds foolish.
>
> Both of these approaches require root privileges, which some applications
> do not have (nor should they need?) given that sync(2) is an unprivileged
> operation.
>
> This patch introduces a new system call syncfs(2) that takes an fd and
> syncs only the file system it references. Maybe someday we can even
>
> $ sync /some/path
>
> and not get
>
> sync: ignoring all arguments
>
> The syscall is motivated by comments by Al and Christoph at the last LSF.
> syncfs(2) seems like an appropriate name given statfs(2).
>
> A similar ioctl was also proposed a while back, see
> http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2
>
> Signed-off-by: Sage Weil <[email protected]>
> ---
> arch/x86/ia32/ia32entry.S | 1 +
> arch/x86/include/asm/unistd_32.h | 3 ++-
> arch/x86/include/asm/unistd_64.h | 2 ++
> arch/x86/kernel/syscall_table_32.S | 1 +
> fs/sync.c | 24 ++++++++++++++++++++++++
> 5 files changed, 30 insertions(+), 1 deletions(-)


include/asm-generic/unistd.h may also need an update.


-aneesh

2011-03-10 14:56:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] introduce sys_syncfs to sync a single file system (v2)

On Tuesday 08 March 2011, Aneesh Kumar K. V wrote:
> > arch/x86/ia32/ia32entry.S | 1 +
> > arch/x86/include/asm/unistd_32.h | 3 ++-
> > arch/x86/include/asm/unistd_64.h | 2 ++
> > arch/x86/kernel/syscall_table_32.S | 1 +
> > fs/sync.c | 24 ++++++++++++++++++++++++
> > 5 files changed, 30 insertions(+), 1 deletions(-)
>
>
> include/asm-generic/unistd.h may also need an update.

I was just going to say the same.

Also, include/linux/syscalls.h should list the prototype, because
some architectures generate the system call table from C code.

Arnd

2011-03-10 19:26:44

by Sage Weil

[permalink] [raw]
Subject: Re: [RFC] introduce sys_syncfs to sync a single file system (v2)

On Thu, 10 Mar 2011, Arnd Bergmann wrote:
> On Tuesday 08 March 2011, Aneesh Kumar K. V wrote:
> > > arch/x86/ia32/ia32entry.S | 1 +
> > > arch/x86/include/asm/unistd_32.h | 3 ++-
> > > arch/x86/include/asm/unistd_64.h | 2 ++
> > > arch/x86/kernel/syscall_table_32.S | 1 +
> > > fs/sync.c | 24 ++++++++++++++++++++++++
> > > 5 files changed, 30 insertions(+), 1 deletions(-)
> >
> >
> > include/asm-generic/unistd.h may also need an update.
>
> I was just going to say the same.
>
> Also, include/linux/syscalls.h should list the prototype, because
> some architectures generate the system call table from C code.

Thanks! Sending an updated patch now.

sage

2011-03-10 19:29:21

by Sage Weil

[permalink] [raw]
Subject: [PATCH v3] introduce sys_syncfs to sync a single file system

It is frequently useful to sync a single file system, instead of all
mounted file systems via sync(2):

- On machines with many mounts, it is not at all uncommon for some of
them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
those and may never get to the one you do care about (e.g., /).
- Some applications write lots of data to the file system and then
want to make sure it is flushed to disk. Calling fsync(2) on each
file introduces unnecessary ordering constraints that result in a large
amount of sub-optimal writeback/flush/commit behavior by the file
system.

There are currently two ways (that I know of) to sync a single super_block:

- BLKFLSBUF ioctl on the block device: That also invalidates the bdev
mapping, which isn't usually desirable, and doesn't work for non-block
file systems.
- 'mount -o remount,rw' will call sync_filesystem as an artifact of the
current implemention. Relying on this little-known side effect for
something like data safety sounds foolish.

Both of these approaches require root privileges, which some applications
do not have (nor should they need?) given that sync(2) is an unprivileged
operation.

This patch introduces a new system call syncfs(2) that takes an fd and
syncs only the file system it references. Maybe someday we can

$ sync /some/path

and not get

sync: ignoring all arguments

The syscall is motivated by comments by Al and Christoph at the last LSF.
syncfs(2) seems like an appropriate name given statfs(2).

A similar ioctl was also proposed a while back, see
http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2

Signed-off-by: Sage Weil <[email protected]>
---
ChangeLog:
v3: Update include/linux/syscalls.h and asm-generic/unistd.h
v2: Rename to syncfs, simplify to just take an fd.
v1: syncat

arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 3 ++-
arch/x86/include/asm/unistd_64.h | 2 ++
arch/x86/kernel/syscall_table_32.S | 1 +
fs/sync.c | 24 ++++++++++++++++++++++++
include/asm-generic/unistd.h | 4 +++-
include/linux/syscalls.h | 1 +
7 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 518bb99..24082b8 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -851,4 +851,5 @@ ia32_sys_call_table:
.quad sys_fanotify_init
.quad sys32_fanotify_mark
.quad sys_prlimit64 /* 340 */
+ .quad sys_syncfs
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index b766a5e..da3903f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,11 @@
#define __NR_fanotify_init 338
#define __NR_fanotify_mark 339
#define __NR_prlimit64 340
+#define __NR_syncfs 341

#ifdef __KERNEL__

-#define NR_syscalls 341
+#define NR_syscalls 342

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 363e9b8..a2e7516 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -669,6 +669,8 @@ __SYSCALL(__NR_fanotify_init, sys_fanotify_init)
__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
#define __NR_prlimit64 302
__SYSCALL(__NR_prlimit64, sys_prlimit64)
+#define __NR_syncfs 303
+__SYSCALL(__NR_syncfs, sys_syncfs)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index b35786d..d40bd16 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,4 @@ ENTRY(sys_call_table)
.long sys_fanotify_init
.long sys_fanotify_mark
.long sys_prlimit64 /* 340 */
+ .long sys_syncfs
diff --git a/fs/sync.c b/fs/sync.c
index ba76b96..92ca208 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -7,6 +7,7 @@
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/module.h>
+#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
@@ -128,6 +129,29 @@ void emergency_sync(void)
}
}

+/*
+ * sync a single super
+ */
+SYSCALL_DEFINE1(syncfs, int, fd)
+{
+ struct file *file;
+ struct super_block *sb;
+ int ret;
+ int fput_needed;
+
+ file = fget_light(fd, &fput_needed);
+ if (!file)
+ return -EBADF;
+ sb = file->f_dentry->d_sb;
+
+ down_read(&sb->s_umount);
+ ret = sync_filesystem(sb);
+ up_read(&sb->s_umount);
+
+ fput_light(file, fput_needed);
+ return ret;
+}
+
/**
* vfs_fsync_range - helper to sync a range of data & metadata to disk
* @file: file to sync
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index b969770..3cf62eb 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -646,9 +646,11 @@ __SYSCALL(__NR_prlimit64, sys_prlimit64)
__SYSCALL(__NR_fanotify_init, sys_fanotify_init)
#define __NR_fanotify_mark 263
__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
+#define __NR_syncfs 264
+__SYSCALL(__NR_syncfs, sys_syncfs)

#undef __NR_syscalls
-#define __NR_syscalls 264
+#define __NR_syscalls 265

/*
* All syscalls below here should go away really,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 98664db..0ceed21 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -820,6 +820,7 @@ asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags
asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
u64 mask, int fd,
const char __user *pathname);
+asmlinkage long sys_syncfs(int fd);

int kernel_execve(const char *filename, const char *const argv[], const char *const envp[]);

--
1.7.0

2011-03-10 22:08:37

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Thursday 10 March 2011 20:31:30 Sage Weil wrote:
> It is frequently useful to sync a single file system, instead of all
> mounted file systems via sync(2):
>
> - On machines with many mounts, it is not at all uncommon for some of
> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> those and may never get to the one you do care about (e.g., /).
> - Some applications write lots of data to the file system and then
> want to make sure it is flushed to disk. Calling fsync(2) on each
> file introduces unnecessary ordering constraints that result in a large
> amount of sub-optimal writeback/flush/commit behavior by the file
> system.
>
> ...
> Signed-off-by: Sage Weil <[email protected]>

Reviewed-by: Arnd Bergmann <[email protected]>

2011-03-11 04:44:50

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Thu, 10 Mar 2011 11:31:30 -0800 (PST), Sage Weil <[email protected]> wrote:
> It is frequently useful to sync a single file system, instead of all
> mounted file systems via sync(2):
>
> - On machines with many mounts, it is not at all uncommon for some of
> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> those and may never get to the one you do care about (e.g., /).
> - Some applications write lots of data to the file system and then
> want to make sure it is flushed to disk. Calling fsync(2) on each
> file introduces unnecessary ordering constraints that result in a large
> amount of sub-optimal writeback/flush/commit behavior by the file
> system.
>
> There are currently two ways (that I know of) to sync a single super_block:
>
> - BLKFLSBUF ioctl on the block device: That also invalidates the bdev
> mapping, which isn't usually desirable, and doesn't work for non-block
> file systems.
> - 'mount -o remount,rw' will call sync_filesystem as an artifact of the
> current implemention. Relying on this little-known side effect for
> something like data safety sounds foolish.
>
> Both of these approaches require root privileges, which some applications
> do not have (nor should they need?) given that sync(2) is an unprivileged
> operation.
>
> This patch introduces a new system call syncfs(2) that takes an fd and
> syncs only the file system it references. Maybe someday we can
>
> $ sync /some/path
>
> and not get
>
> sync: ignoring all arguments
>
> The syscall is motivated by comments by Al and Christoph at the last LSF.
> syncfs(2) seems like an appropriate name given statfs(2).
>
> A similar ioctl was also proposed a while back, see
> http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2
>
> Signed-off-by: Sage Weil <[email protected]>

Reviewed-by: Aneesh Kumar <[email protected]>

2011-03-11 11:01:14

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

Hello,

On Thu, March 10, 2011 20:31, Sage Weil wrote:
> It is frequently useful to sync a single file system, instead of all
> mounted file systems via sync(2):
>
> - On machines with many mounts, it is not at all uncommon for some of
> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> those and may never get to the one you do care about (e.g., /).
> - Some applications write lots of data to the file system and then
> want to make sure it is flushed to disk. Calling fsync(2) on each
> file introduces unnecessary ordering constraints that result in a large
> amount of sub-optimal writeback/flush/commit behavior by the file
> system.
>
> There are currently two ways (that I know of) to sync a single super_block:
>
> - BLKFLSBUF ioctl on the block device: That also invalidates the bdev
> mapping, which isn't usually desirable, and doesn't work for non-block
> file systems.
> - 'mount -o remount,rw' will call sync_filesystem as an artifact of the
> current implemention. Relying on this little-known side effect for
> something like data safety sounds foolish.
>
> Both of these approaches require root privileges, which some applications
> do not have (nor should they need?) given that sync(2) is an unprivileged
> operation.
>
> This patch introduces a new system call syncfs(2) that takes an fd and
> syncs only the file system it references. Maybe someday we can
>
> $ sync /some/path
>
> and not get
>
> sync: ignoring all arguments
>
> The syscall is motivated by comments by Al and Christoph at the last LSF.
> syncfs(2) seems like an appropriate name given statfs(2).
>
> A similar ioctl was also proposed a while back, see
> http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2

The patch there seems much more reasonable than introducing a whole
new systemcall just for 20 lines of kernel code. New system calls are
added too easily nowadays.

As an alternative to the ioctl, I propose extending sync_file_range()
instead. E.g. add a SYNC_FILE_MOUNT flag and use that, either on any
fd on the mount or the root dir fd. That syscall is non-standard and
close enough that it can implement this behaviour too.

Greetings,

Indan

---

Something like:

diff --git a/fs/sync.c b/fs/sync.c
index ba76b96..9fa073c 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -18,7 +18,7 @@
#include "internal.h"

#define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
- SYNC_FILE_RANGE_WAIT_AFTER)
+ SYNC_FILE_RANGE_WAIT_AFTER|SYNC_FILE_MOUNT)

/*
* Do the filesystem syncing work. For simple filesystems
@@ -330,6 +330,15 @@ SYSCALL_DEFINE(sync_file_range)(int fd, loff_t offset, loff_t nbytes,
}

ret = 0;
+ if (flags & SYNC_FILE_MOUNT) {
+ struct super_block *sb;
+
+ sb = file->f_dentry->d_sb;
+ down_read(&sb->s_umount);
+ ret = sync_filesystem(sb);
+ up_read(&sb->s_umount);
+ goto out_put;
+ }
if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
ret = filemap_fdatawait_range(mapping, offset, endbyte);
if (ret < 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e38b50a..53e427e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -373,6 +373,7 @@ struct inodes_stat_t {
#define SYNC_FILE_RANGE_WAIT_BEFORE 1
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4
+#define SYNC_FILE_MOUNT 8

#ifdef __KERNEL__


2011-03-11 11:55:59

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Friday 11 March 2011, Indan Zupancic wrote:
> > http://marc.info/?l=linux-fsdevel&amp;m=127970513829285&amp;w=2
>
> The patch there seems much more reasonable than introducing a whole
> new systemcall just for 20 lines of kernel code. New system calls are
> added too easily nowadays.

The only problem with adding new system calls is that we are stuck with
the interface until the end of time, so we must be sure not to get it
wrong. The same thing is true for any other interface such as ioctl
or extensions to existing system calls. People usually get away with
adding new ioctls more easily because it is less obvious when they
are added.

Arnd

2011-03-11 23:46:03

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Fri, March 11, 2011 12:55, Arnd Bergmann wrote:
> On Friday 11 March 2011, Indan Zupancic wrote:
>> > http://marc.info/?l=linux-fsdevel&amp;m=127970513829285&amp;w=2
>>
>> The patch there seems much more reasonable than introducing a whole
>> new systemcall just for 20 lines of kernel code. New system calls are
>> added too easily nowadays.
>
> The only problem with adding new system calls is that we are stuck
> with the interface until the end of time, so we must be sure not
> to get it wrong. The same thing is true for any other interface
> such as ioctl or extensions to existing system calls. People usually
> get away with adding new ioctls more easily because it is less
> obvious when they are added.

Agreed.

I'm not sure this feature is important enough to add. I can't really
think of a regular use case where this would be useful, generally
it's transparent on which mount files are. Add symlinks, and you
give users a lot of rope. Any user has to make sure that all the
files they want to sync are on the same file system.

About the arguments against sync(2):

> - On machines with many mounts, it is not at all uncommon for some of
> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
> those and may never get to the one you do care about (e.g., /).

It would be better to fix NFS, or mount it with the fsc option (assuming
a sync will write to the local cache instead of hanging forever then).

> - Some applications write lots of data to the file system and then
> want to make sure it is flushed to disk. Calling fsync(2) on each
> file introduces unnecessary ordering constraints that result in a large
> amount of sub-optimal writeback/flush/commit behavior by the file
> system.

You can use sync_file_range() on those files to schedule the writes
and then do the fsync(2) as usual (both on files and dirs).

If there still is a good reason to implement this, please don't add it
as a new system call, but add it to sync_file_range(), as that seems
the best place for odd file synchronisation operations.

Greetings,

Indan

2011-03-11 23:56:25

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

Indan Zupancic wrote:

> If there still is a good reason to implement this, please don't add it
> as a new system call, but add it to sync_file_range(), as that seems
> the best place for odd file synchronisation operations.

I have no strong preference about how this is added (and in fact I'm
quite ignorant about the usual conventions), but:

- as a sysadmin, it really _would_ be nice to be able to say
"sync /usr" to sync /usr;

- the existing functionality of sync_file_range is about controlling
writeback behavior for files, not mounts.

So unless there is a shortage of syscall numbers or something, I find
the request to omit this or tack it onto sync_file_range odd. Could
you explain the benefit?

Thanks,
Jonathan

2011-03-12 00:40:42

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On 03/11/2011 06:45 PM, Indan Zupancic wrote:
> On Fri, March 11, 2011 12:55, Arnd Bergmann wrote:
>> On Friday 11 March 2011, Indan Zupancic wrote:
>>>> http://marc.info/?l=linux-fsdevel&amp;m=127970513829285&amp;w=2
>>> The patch there seems much more reasonable than introducing a whole
>>> new systemcall just for 20 lines of kernel code. New system calls are
>>> added too easily nowadays.
>> The only problem with adding new system calls is that we are stuck
>> with the interface until the end of time, so we must be sure not
>> to get it wrong. The same thing is true for any other interface
>> such as ioctl or extensions to existing system calls. People usually
>> get away with adding new ioctls more easily because it is less
>> obvious when they are added.
> Agreed.
>
> I'm not sure this feature is important enough to add. I can't really
> think of a regular use case where this would be useful, generally
> it's transparent on which mount files are. Add symlinks, and you
> give users a lot of rope. Any user has to make sure that all the
> files they want to sync are on the same file system.
>
> About the arguments against sync(2):
>
>> - On machines with many mounts, it is not at all uncommon for some of
>> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
>> those and may never get to the one you do care about (e.g., /).
> It would be better to fix NFS, or mount it with the fsc option (assuming
> a sync will write to the local cache instead of hanging forever then).
>
>> - Some applications write lots of data to the file system and then
>> want to make sure it is flushed to disk. Calling fsync(2) on each
>> file introduces unnecessary ordering constraints that result in a large
>> amount of sub-optimal writeback/flush/commit behavior by the file
>> system.
> You can use sync_file_range() on those files to schedule the writes
> and then do the fsync(2) as usual (both on files and dirs).
>
> If there still is a good reason to implement this, please don't add it
> as a new system call, but add it to sync_file_range(), as that seems
> the best place for odd file synchronisation operations.
>
> Greetings,
>
> Indan
>
>
> --

Hi Indan,

I think that you missed the point of the extension.

Ric

2011-03-12 01:33:46

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 01:40, Ric Wheeler wrote:
> On 03/11/2011 06:45 PM, Indan Zupancic wrote:
>> On Fri, March 11, 2011 12:55, Arnd Bergmann wrote:
>>> On Friday 11 March 2011, Indan Zupancic wrote:
>>>>> http://marc.info/?l=linux-fsdevel&amp;m=127970513829285&amp;w=2
>>>> The patch there seems much more reasonable than introducing a whole
>>>> new systemcall just for 20 lines of kernel code. New system calls are
>>>> added too easily nowadays.
>>> The only problem with adding new system calls is that we are stuck
>>> with the interface until the end of time, so we must be sure not
>>> to get it wrong. The same thing is true for any other interface
>>> such as ioctl or extensions to existing system calls. People usually
>>> get away with adding new ioctls more easily because it is less
>>> obvious when they are added.
>> Agreed.
>>
>> I'm not sure this feature is important enough to add. I can't really
>> think of a regular use case where this would be useful, generally
>> it's transparent on which mount files are. Add symlinks, and you
>> give users a lot of rope. Any user has to make sure that all the
>> files they want to sync are on the same file system.
>>
>> About the arguments against sync(2):
>>
>>> - On machines with many mounts, it is not at all uncommon for some of
>>> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
>>> those and may never get to the one you do care about (e.g., /).
>> It would be better to fix NFS, or mount it with the fsc option (assuming
>> a sync will write to the local cache instead of hanging forever then).
>>
>>> - Some applications write lots of data to the file system and then
>>> want to make sure it is flushed to disk. Calling fsync(2) on each
>>> file introduces unnecessary ordering constraints that result in a large
>>> amount of sub-optimal writeback/flush/commit behavior by the file
>>> system.
>> You can use sync_file_range() on those files to schedule the writes
>> and then do the fsync(2) as usual (both on files and dirs).
>>
>> If there still is a good reason to implement this, please don't add it
>> as a new system call, but add it to sync_file_range(), as that seems
>> the best place for odd file synchronisation operations.
>>
>> Greetings,
>>
>> Indan
>>
>>
>> --
>
> Hi Indan,
>
> I think that you missed the point of the extension.
>
> Ric

The point is clear, it's to synchronize a specific file system instead
of all of them.

But actually doing that from a program is harder than it looks, because
programs work with files, not file systems. To make this feature useful
the program needs meta information it can't easily get. That was my first
point.

The rest was just replying to the motivations given to add the feature.
If those motivations miss the point of the extension, don't blame me.

If you want to add a new one, "I'd like sync /mounpoint to work", then
do so, but please tell what practical use that has, if you're not going
to unmount the thing soon after anyway.

Other than this questionable use case, is there any other one that would
justify adding this extension?

Greetings,

Indan

2011-03-12 01:53:40

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 00:56, Jonathan Nieder wrote:
> Indan Zupancic wrote:
>
>> If there still is a good reason to implement this, please don't add it
>> as a new system call, but add it to sync_file_range(), as that seems
>> the best place for odd file synchronisation operations.
>
> I have no strong preference about how this is added (and in fact I'm
> quite ignorant about the usual conventions), but:

I'm not pushing for any official convention, just what seems good taste.

> - as a sysadmin, it really _would_ be nice to be able to say
> "sync /usr" to sync /usr;

This is independent of the implementation.

> - the existing functionality of sync_file_range is about controlling
> writeback behavior for files, not mounts.

True, but what happens when you sync a mount? In the end it's about odd,
non-standard syncing behaviour, so I think it fits sync_file_range well.
Like sync_file_range, it's trickier to use than it seems at first, so it
fits well with the other "keep this in mind before using it" requirements.

> So unless there is a shortage of syscall numbers or something, I find
> the request to omit this or tack it onto sync_file_range odd. Could
> you explain the benefit?

Less code added, less bloat. Architecture independent, no need to update
all system call tables everywhere (all archs, libc versions and strace).
Two files changed, instead of 7 (which only hooks up x86). The code ends
up near the code it calls, so it makes contextual sense.

As for the reason not to add it at all, well, if everything that seemed
a good idea at the time was added as a system call it would be an even
bigger mess than it already is. So it doesn't have to be a bad idea to
not make it, not being useful enough is sufficient.

In this case it's just a performance improvement over sync(2). It doesn't
add a new feature. Main argument given for the performance problem seems
to be "NFS can be slow". Anything else?

Greetings,

Indan

2011-03-12 02:10:11

by Jonathan Nieder

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

Indan Zupancic wrote:

> I'm not pushing for any official convention, just what seems good taste.

In cases like this, conventions (consistency and best practices) are
very important.

> Less code added, less bloat. Architecture independent, no need to update
> all system call tables everywhere (all archs, libc versions and strace).
> Two files changed, instead of 7 (which only hooks up x86).

Thanks for explaining. Those do seem like good reasons to use a ioctl
instead of a new syscall.

> In this case it's just a performance improvement over sync(2). It doesn't
> add a new feature. Main argument given for the performance problem seems
> to be "NFS can be slow". Anything else?

Huh? It is not just the speed of the sync --- unnecessary writeback
will cause wear on your thumbdrive, eat up your laptop battery, and
kill I/O performance in other tasks running at the same time.

I'm afraid I don't understand what you're saying here at all. Would
you say that fsync is superfluous, too?

Jonathan

2011-03-12 02:52:20

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On 03/11/2011 08:33 PM, Indan Zupancic wrote:
> On Sat, March 12, 2011 01:40, Ric Wheeler wrote:
>> On 03/11/2011 06:45 PM, Indan Zupancic wrote:
>>> On Fri, March 11, 2011 12:55, Arnd Bergmann wrote:
>>>> On Friday 11 March 2011, Indan Zupancic wrote:
>>>>>> http://marc.info/?l=linux-fsdevel&amp;m=127970513829285&amp;w=2
>>>>> The patch there seems much more reasonable than introducing a whole
>>>>> new systemcall just for 20 lines of kernel code. New system calls are
>>>>> added too easily nowadays.
>>>> The only problem with adding new system calls is that we are stuck
>>>> with the interface until the end of time, so we must be sure not
>>>> to get it wrong. The same thing is true for any other interface
>>>> such as ioctl or extensions to existing system calls. People usually
>>>> get away with adding new ioctls more easily because it is less
>>>> obvious when they are added.
>>> Agreed.
>>>
>>> I'm not sure this feature is important enough to add. I can't really
>>> think of a regular use case where this would be useful, generally
>>> it's transparent on which mount files are. Add symlinks, and you
>>> give users a lot of rope. Any user has to make sure that all the
>>> files they want to sync are on the same file system.
>>>
>>> About the arguments against sync(2):
>>>
>>>> - On machines with many mounts, it is not at all uncommon for some of
>>>> them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
>>>> those and may never get to the one you do care about (e.g., /).
>>> It would be better to fix NFS, or mount it with the fsc option (assuming
>>> a sync will write to the local cache instead of hanging forever then).
>>>
>>>> - Some applications write lots of data to the file system and then
>>>> want to make sure it is flushed to disk. Calling fsync(2) on each
>>>> file introduces unnecessary ordering constraints that result in a large
>>>> amount of sub-optimal writeback/flush/commit behavior by the file
>>>> system.
>>> You can use sync_file_range() on those files to schedule the writes
>>> and then do the fsync(2) as usual (both on files and dirs).
>>>
>>> If there still is a good reason to implement this, please don't add it
>>> as a new system call, but add it to sync_file_range(), as that seems
>>> the best place for odd file synchronisation operations.
>>>
>>> Greetings,
>>>
>>> Indan
>>>
>>>
>>> --
>> Hi Indan,
>>
>> I think that you missed the point of the extension.
>>
>> Ric
> The point is clear, it's to synchronize a specific file system instead
> of all of them.
>
> But actually doing that from a program is harder than it looks, because
> programs work with files, not file systems. To make this feature useful
> the program needs meta information it can't easily get. That was my first
> point.
>
> The rest was just replying to the motivations given to add the feature.
> If those motivations miss the point of the extension, don't blame me.
>
> If you want to add a new one, "I'd like sync /mounpoint to work", then
> do so, but please tell what practical use that has, if you're not going
> to unmount the thing soon after anyway.
>
> Other than this questionable use case, is there any other one that would
> justify adding this extension?
>
> Greetings,
>
> Indan
>

Sage was pretty clear in stating the motivation which is the use case you think
is questionable. Probably not interesting for consumer devices, but definitely
extremely interesting in large servers with multiple file systems.

In fact, we do it today as mentioned earlier in the thread - this simply exports
that useful capability in a clean way.

Ric

2011-03-12 03:51:10

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 03:52, Ric Wheeler wrote:
> Sage was pretty clear in stating the motivation which is the use case you
> think is questionable. Probably not interesting for consumer devices, but
> definitely extremely interesting in large servers with multiple file systems.

Not really, he just said "It is frequently useful to sync a single file system",
without giving any use cases. He then gave two situations where either sync or
fsync isn't sufficient, to which I replied earlier and you called missing the
point. But that's not the same as giving a use case.

>
> In fact, we do it today as mentioned earlier in the thread - this simply
> exports that useful capability in a clean way.

Did you use the remount trick or the ioctl? If the latter, is it sufficient
for your need? If the first, would guaranteeing that mount -o remount,rw
trick will keep working solve the problem for you?

When or why would you want to sync one specific filesystem? As you're doing
it, you could explain your use case better instead of telling me I'm missing
the point.

If sync(2) didn't exist and people wanted to add it I'd complain too. This
has all the problems of sync(2), but with the "not sure if all the files are
on the file system I think" problem added.

Greetings,

Indan

2011-03-12 04:22:43

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 03:10, Jonathan Nieder wrote:
> Indan Zupancic wrote:
>> I'm not pushing for any official convention, just what seems good taste.
>
> In cases like this, conventions (consistency and best practices) are
> very important.

I'm in no position to decide about this code's fate anyway, I only voice
my opinion in the hope people won't make the mistake of adding this as a
new system call.

>
>> Less code added, less bloat. Architecture independent, no need to update
>> all system call tables everywhere (all archs, libc versions and strace).
>> Two files changed, instead of 7 (which only hooks up x86).
>
> Thanks for explaining. Those do seem like good reasons to use a ioctl
> instead of a new syscall.

Ioctl or sync_file_range, it's obscure enough for an ioctl I guess.

>> In this case it's just a performance improvement over sync(2). It doesn't
>> add a new feature. Main argument given for the performance problem seems
>> to be "NFS can be slow". Anything else?
>
> Huh? It is not just the speed of the sync --- unnecessary writeback
> will cause wear on your thumbdrive, eat up your laptop battery, and
> kill I/O performance in other tasks running at the same time.

The writeback will happen sooner or later, so there is no unnecessary
writeback, except if you're overwriting/deleting just written data. If
you're worried about unnecessary writeback then don't do any synching.
You're actually arguing againt this feature and for fsync().

Syncfs won't be called frequently anyway, if it was then fsync could
be used instead. So it's pretty much a slightly better version of sync.

You could call it sync2() and add a path parameter. And after a few
years replace it with sync3 with a flag argument added. Then add a
sync4 with a sigmask too. That seems the new convention and would be
consistent...

I think all new system calls (or other highly visible ABI change) should
have half a year thinking time, and when they're in they're guaranteed to
be not stable for at least 2 years. If after that time they're still in,
they become stable and part of the official ABI. Removing something new
should be as easy as adding something new. But the current trend of easily
added, but hard to remove features is asking for long-term messiness. It
takes time for code to depend on a new feature, so removing bad new stuff
isn't as bad as removing oldstuff. Pity Linus didn't figure that out yet.

> I'm afraid I don't understand what you're saying here at all. Would
> you say that fsync is superfluous, too?

No, fsync actually makes sense.

Greetings,

Indan

2011-03-12 12:41:54

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On 03/11/2011 10:50 PM, Indan Zupancic wrote:
> On Sat, March 12, 2011 03:52, Ric Wheeler wrote:
>> Sage was pretty clear in stating the motivation which is the use case you
>> think is questionable. Probably not interesting for consumer devices, but
>> definitely extremely interesting in large servers with multiple file systems.
> Not really, he just said "It is frequently useful to sync a single file system",
> without giving any use cases. He then gave two situations where either sync or
> fsync isn't sufficient, to which I replied earlier and you called missing the
> point. But that's not the same as giving a use case.
>
>> In fact, we do it today as mentioned earlier in the thread - this simply
>> exports that useful capability in a clean way.
> Did you use the remount trick or the ioctl? If the latter, is it sufficient
> for your need? If the first, would guaranteeing that mount -o remount,rw
> trick will keep working solve the problem for you?
>
> When or why would you want to sync one specific filesystem? As you're doing
> it, you could explain your use case better instead of telling me I'm missing
> the point.
>
> If sync(2) didn't exist and people wanted to add it I'd complain too. This
> has all the problems of sync(2), but with the "not sure if all the files are
> on the file system I think" problem added.
>
> Greetings,
>
> Indan

System wide sync and file system specific sync are useful in several scenarios.
One example is something like a restore (or an rsync) where an application
copies lots of files into a target file system. It is *much* faster to do this
without doing an fsync() call per file (orders of magnitude faster).

Once your bulk write has finished, it is prudent to run "sync" to push that data
out to disk before you trust that your new copy will survive a crash or power
outage. Alternatively, you could reopen and fsync each file that was copied
which seems to be the method that you prefer.

Ric

2011-03-12 17:41:29

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Fri, Mar 11, 2011 at 08:10:01PM -0600, Jonathan Nieder wrote:
> Indan Zupancic wrote:
>
> > I'm not pushing for any official convention, just what seems good taste.
>
> In cases like this, conventions (consistency and best practices) are
> very important.
>
> > Less code added, less bloat. Architecture independent, no need to update
> > all system call tables everywhere (all archs, libc versions and strace).
> > Two files changed, instead of 7 (which only hooks up x86).
>
> Thanks for explaining. Those do seem like good reasons to use a ioctl
> instead of a new syscall.

No, make it a syscall, it's more obvious and will be documented much
better.

thanks,

greg k-h

2011-03-12 18:31:37

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On 03/11/2011 10:50 PM, Indan Zupancic wrote:
> If sync(2) didn't exist and people wanted to add it I'd complain too. This
> has all the problems of sync(2), but with the "not sure if all the files are
> on the file system I think" problem added.

You are decades too late, then...

sync_file_range() is not appropriate because that works on only one fd.

The new syscall is fine, and addresses a need.

Jeff

2011-03-12 19:22:32

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, 2011-03-12 at 02:33 +0100, Indan Zupancic wrote:
> > I think that you missed the point of the extension.
> >
> > Ric
>
> The point is clear, it's to synchronize a specific file system instead
> of all of them.
>
> But actually doing that from a program is harder than it looks, because
> programs work with files, not file systems. To make this feature useful
> the program needs meta information it can't easily get. That was my first
> point.

I had a program, actually a set of programs, which test a file-system.
And this set of programs needed such a feature quite a lot, to sync the
FS which is being tested and nothing else - for both performance reasons
and to put more stress to the FS under testing. We used -o remount, rw
for this - but this forced us to run under root.

IOW, there are programs which take a mount point as an input parameter
and want do things with the whole FS, not only individual files.

--
Best Regards,
Artem Bityutskiy (Битюцкий Артём)

2011-03-12 19:28:16

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, 2011-03-12 at 04:50 +0100, Indan Zupancic wrote:
> Did you use the remount trick or the ioctl? If the latter, is it sufficient
> for your need? If the first, would guaranteeing that mount -o remount,rw
> trick will keep working solve the problem for you?

-o remount,rw wants sysadmin capability, which limits the usage of this
trick.

--
Best Regards,
Artem Bityutskiy (Битюцкий Артём)

2011-03-13 21:08:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

> +/*
> + * sync a single super
> + */

Not exactly the most descriptive comment.

Otherwise looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2011-03-14 01:32:15

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 19:31, Jeff Garzik wrote:
> On 03/11/2011 10:50 PM, Indan Zupancic wrote:
>> If sync(2) didn't exist and people wanted to add it I'd complain too. This
>> has all the problems of sync(2), but with the "not sure if all the files are
>> on the file system I think" problem added.
>
> You are decades too late, then...

I know.

> sync_file_range() is not appropriate because that works on only one fd.
>
> The new syscall is fine, and addresses a need.

The new syscall works on only one fd too. The behaviour of the proposed
syncfs and an extended sync_file_range is exactly the same.

Greetings,

Indan

2011-03-14 01:38:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system


On Mar 13, 2011, at 9:31 PM, Indan Zupancic wrote:

> The new syscall works on only one fd too. The behaviour of the proposed
> syncfs and an extended sync_file_range is exactly the same.

No, it's quite different. One syncs the entire file system; sync_file_range
requests that the pages associated with a file be pushed to disk --- it says
nothing about the metadata associated with the file.

I'm in favor of the new sys_syncfs system call.

-- Ted

2011-03-14 01:38:08

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 20:22, Artem Bityutskiy wrote:
> On Sat, 2011-03-12 at 02:33 +0100, Indan Zupancic wrote:
>> > I think that you missed the point of the extension.
>> >
>> > Ric
>>
>> The point is clear, it's to synchronize a specific file system instead
>> of all of them.
>>
>> But actually doing that from a program is harder than it looks, because
>> programs work with files, not file systems. To make this feature useful
>> the program needs meta information it can't easily get. That was my first
>> point.
>
> I had a program, actually a set of programs, which test a file-system.
> And this set of programs needed such a feature quite a lot, to sync the
> FS which is being tested and nothing else - for both performance reasons
> and to put more stress to the FS under testing. We used -o remount, rw
> for this - but this forced us to run under root.

You could use a tiny setuid root helper binary that does the remount trick.

> IOW, there are programs which take a mount point as an input parameter
> and want do things with the whole FS, not only individual files.

And those probably need to mount or unmount the FS at one point or the
other, or can use the setuid helper binary.

If you want to add syncfs, at least ad it together with BSD so it's slightly
portable.

Greetings,

Indan

2011-03-14 01:45:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On 03/13/2011 09:31 PM, Indan Zupancic wrote:
> The new syscall works on only one fd too. The behaviour of the proposed
> syncfs and an extended sync_file_range is exactly the same.

No, it's not. You should read the patch before commenting.

Jeff


2011-03-14 01:47:55

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, March 14, 2011 02:37, Theodore Tso wrote:
>
> On Mar 13, 2011, at 9:31 PM, Indan Zupancic wrote:
>
>> The new syscall works on only one fd too. The behaviour of the proposed
>> syncfs and an extended sync_file_range is exactly the same.
>
> No, it's quite different. One syncs the entire file system; sync_file_range
> requests that the pages associated with a file be pushed to disk --- it says
> nothing about the metadata associated with the file.

You missed the part where a new flag for sync_file_range would be added that
does sync the whole file system.

> I'm in favor of the new sys_syncfs system call.

Care to define its behaviour? Does it wait till all data is written out,
does it flush disk caches, what if new data is written to the fs while
a sync is going on, etc.

Greetings,

Indan

2011-03-14 01:56:57

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sat, March 12, 2011 18:32, Greg KH wrote:
> On Fri, Mar 11, 2011 at 08:10:01PM -0600, Jonathan Nieder wrote:
>> Indan Zupancic wrote:
>>
>> > I'm not pushing for any official convention, just what seems good taste.
>>
>> In cases like this, conventions (consistency and best practices) are
>> very important.
>>
>> > Less code added, less bloat. Architecture independent, no need to update
>> > all system call tables everywhere (all archs, libc versions and strace).
>> > Two files changed, instead of 7 (which only hooks up x86).
>>
>> Thanks for explaining. Those do seem like good reasons to use a ioctl
>> instead of a new syscall.
>
> No, make it a syscall, it's more obvious and will be documented much
> better.

There is no such guarantee. Everyone seems to want to add this new syncfs,
but it's not even defined what it does. "Same as sync, but only on one fs"
is IMHO not good enough, because sync's behaviour is pretty badly documented,
and that's a system call. The sync_file_range argument effects are quite
well defined, on the other hand, unlike sync behaviour. You're right for
ioctls though.

If it has to be added, I'm in favour of adding it as a new sync_file_range
flag option, but wouldn't mind if it was added as some obscure ioctl either.

Do you think that widespread sync() usage is something good? Don't you
think sync(2) should be used as little as possible? And if so, why is
it any different than syncfs?

IMHO we're talking about some system call that shouldn't be used often, and
only rarely in very specific cases. They could use sync(2) with exactly the
same effect, but for performance reasons syncfs() is preferred. So instead
of extremely slow it will be very slow in some cases. Hurray, progress!

Next time you want an async version, and have to add a new systemcall with
a flag argument. So instead of going down that path, please add one with
a flag argument immediately. E.g:

syncfs(const char* path, int fd, int flags);

A NULL path, -1 fd and 0 flags is equal to sync (or something), and then
vary the options. But as it's for a file system, wouldn't it make more
sense to have just a path argument instead of an fd one?

Greetings,

Indan

2011-03-14 01:59:04

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, March 14, 2011 02:45, Jeff Garzik wrote:
> On 03/13/2011 09:31 PM, Indan Zupancic wrote:
>> The new syscall works on only one fd too. The behaviour of the proposed
>> syncfs and an extended sync_file_range is exactly the same.
>
> No, it's not. You should read the patch before commenting.

Have you read my sync_file_range patch? Because it's exactly the
same code, but put in sync_file_range instead of a new system call.
How that can have different behaviour, I have no idea.

2011-03-14 04:27:00

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, 14 Mar 2011, Indan Zupancic wrote:
> Everyone seems to want to add this new syncfs, but it's not even defined
> what it does. "Same as sync, but only on one fs" is IMHO not good
> enough, because sync's behaviour is pretty badly documented, and that's
> a system call.

How about the man page below? I tried to avoid the somewhat antiquated
implementation specific terminology in the sync(2) man page.

I think adding this functionality into sync_file_range(2) is forcing
unrelated functionality into an existing interface; sync_file_range
operates on _files_, not an entire file system. With each API addition it
is more important to make the interface simple and intuitive than to
minimize the size of our patches. IMO that's why a new syscall is
preferable to, say, an equivalent ioctl.

Thanks-
sage


.TH SYNCFS 2 2011-03-13 "Linux" "Linux Programmer's Manual"
.SH NAME
syncfs \- commit cached file system state to stable storage
.SH SYNOPSIS
.B #include <unistd.h>
.sp
.B void syncfs(int fd);
.SH DESCRIPTION
.BR syncfs ()
flushes any cached data modifications to the file system containing the
file referenced by the file descriptor
.I fd
to stable storage (usually a disk). This includes the results of any
file modifications or other file system operations that have completed
prior to the call to
.BR syncfs(2).
This is similar to
.BR sync(2),
but will commit changes for only a single file system instead of all
mounted file systems.
.SH ERRORS
This function is always successful.
.SH "SEE ALSO"
.BR bdflush (2),
.BR fdatasync (2),
.BR fsync (2),
.BR sync (2),
.BR sync (8),
.BR update (8)

2011-03-14 05:53:01

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, 2011-03-14 at 02:38 +0100, Indan Zupancic wrote:
> > I had a program, actually a set of programs, which test a file-system.
> > And this set of programs needed such a feature quite a lot, to sync the
> > FS which is being tested and nothing else - for both performance reasons
> > and to put more stress to the FS under testing. We used -o remount, rw
> > for this - but this forced us to run under root.
>
> You could use a tiny setuid root helper binary that does the remount trick.

Yes, but I think this is not elegant solution.

--
Best Regards,
Artem Bityutskiy (Битюцкий Артём)

2011-03-14 09:27:36

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, March 14, 2011 05:29, Sage Weil wrote:
> On Mon, 14 Mar 2011, Indan Zupancic wrote:
>> Everyone seems to want to add this new syncfs, but it's not even defined
>> what it does. "Same as sync, but only on one fs" is IMHO not good
>> enough, because sync's behaviour is pretty badly documented, and that's
>> a system call.
>
> How about the man page below? I tried to avoid the somewhat antiquated
> implementation specific terminology in the sync(2) man page.

Good enough.

>
> I think adding this functionality into sync_file_range(2) is forcing
> unrelated functionality into an existing interface; sync_file_range
> operates on _files_, not an entire file system. With each API addition
> it is more important to make the interface simple and intuitive than
> to minimize the size of our patches. IMO that's why a new syscall
> is preferable to, say, an equivalent ioctl.

Your new syncfs also operates on files, it's only the name of sync_file_range
which is not very fitting. Other than that, syncing files in weird, nonstandard
ways is what it does and in that sense it's a good place for syncfs functionality.

But hey, it seems I'm the only one favouring that approach, and as you don't
mind the size of your patch, make sure to add support for syncfs to all other
23 archs too in your next patch submission.

Take care,

Indan


> Thanks-
> sage
>
>
> .TH SYNCFS 2 2011-03-13 "Linux" "Linux Programmer's Manual"
> .SH NAME
> syncfs \- commit cached file system state to stable storage
> .SH SYNOPSIS
> .B #include <unistd.h>
> .sp
> .B void syncfs(int fd);
> .SH DESCRIPTION
> .BR syncfs ()
> flushes any cached data modifications to the file system containing the
> file referenced by the file descriptor
> .I fd
> to stable storage (usually a disk). This includes the results of any
> file modifications or other file system operations that have completed
> prior to the call to
> .BR syncfs(2).
> This is similar to
> .BR sync(2),
> but will commit changes for only a single file system instead of all
> mounted file systems.
> .SH ERRORS
> This function is always successful.
> .SH "SEE ALSO"
> .BR bdflush (2),
> .BR fdatasync (2),
> .BR fsync (2),
> .BR sync (2),
> .BR sync (8),
> .BR update (8)
>

2011-03-14 10:22:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system


On Mar 14, 2011, at 5:27 AM, Indan Zupancic wrote:

>
> Your new syncfs also operates on files, it's only the name of sync_file_range
> which is not very fitting. Other than that, syncing files in weird, nonstandard
> ways is what it does and in that sense it's a good place for syncfs functionality.

It operates on a file system. It takes a file descriptor as an argument. (Much
like the statfs(2) system call.) There's a difference.

- Ted

2011-03-14 20:11:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, 14 Mar 2011 02:56:52 +0100 (CET)
"Indan Zupancic" <[email protected]> wrote:

> On Sat, March 12, 2011 18:32, Greg KH wrote:
> > On Fri, Mar 11, 2011 at 08:10:01PM -0600, Jonathan Nieder wrote:
> >> Indan Zupancic wrote:
> >>
> >> > I'm not pushing for any official convention, just what seems good taste.
> >>
> >> In cases like this, conventions (consistency and best practices) are
> >> very important.
> >>
> >> > Less code added, less bloat. Architecture independent, no need to update
> >> > all system call tables everywhere (all archs, libc versions and strace).
> >> > Two files changed, instead of 7 (which only hooks up x86).
> >>
> >> Thanks for explaining. Those do seem like good reasons to use a ioctl
> >> instead of a new syscall.
> >
> > No, make it a syscall, it's more obvious and will be documented much
> > better.
>
> There is no such guarantee. Everyone seems to want to add this new syncfs,
> but it's not even defined what it does. "Same as sync, but only on one fs"
> is IMHO not good enough, because sync's behaviour is pretty badly documented,
> and that's a system call. The sync_file_range argument effects are quite
> well defined, on the other hand, unlike sync behaviour. You're right for
> ioctls though.

I think the semantics of sync are easily enough defined, even if
they're not well-defined in the documentation: all data which was dirty
at the time sync() was called will be written back and accessible when
the sync() returns.

I do agree that this should be a standalone syscall, not grafted into
sync_file_range() or into an ioctl.


That being said, we have two similar-looking-but-quite-different "sync"
concepts in the kernel. One is "sync for data integrity" and the other
is "sync to reduce the dirty memory load". The latter is not a data
integrity thing - it is a resource management thing.

There might one day be a requirement to be able to initiate a
resource-management-style writeback against a whole filesystem. When
that happens, we'll regret not having added a "mode" argument to
sys_syncfs().

Or maybe not - given that we're syncing the entire fs and that
sync_filesystem() does the two-pass "write for data cleaning then write
for data integrity" thing, it could be that a syncfs-for-data-cleaning
operation has little performance benefit over a
syncfs-for-data-integrity operation.

2011-03-14 20:30:06

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, 2011-03-14 at 13:10 -0700, Andrew Morton wrote:
> There might one day be a requirement to be able to initiate a
> resource-management-style writeback against a whole filesystem. When
> that happens, we'll regret not having added a "mode" argument to
> sys_syncfs().

I think Indan is right about an additional argument which could be used
for future extensions, what is the problem adding it?

--
Best Regards,
Artem Bityutskiy (Битюцкий Артём)

2011-03-14 21:11:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, Mar 14, 2011 at 01:10:42PM -0700, Andrew Morton wrote:
>
> There might one day be a requirement to be able to initiate a
> resource-management-style writeback against a whole filesystem. When
> that happens, we'll regret not having added a "mode" argument to
> sys_syncfs().

I'm a bit nervous about exposing WB_SYNC_NONE to userspace, because
its semantics are *definitely* hard to describe. For example, at the
moment if you do a WB_SYNC_NONE writeback, the writeback code will
clamp the amount of data written back for each inode to
MAX_WRITEBACK_PAGES (1024) pages. Do we want to document that?
Probably not! But if we don't document it, what can userspace expect?

If you just issue a writeback_inodes_sb(), it's not the case that it
will start a process that will eventually write out everything (i.e.,
it's not the equivalent of a non-blocking data integrity sync). It
just means, "write out some stuff".

I could imagine userspace wanting to start a non-blocking writeout of
all data blocking pages, and which doesn't cause queue flush / barrier
requests. (i.e., a non-blocking-non-barrier-issuing-but-otherwise-a-
data-integrity writeback) But that's not something that the current
writeback machinery can do easily, at least not today.

It wouldn't hurt to have a "flags" field which we could expand later
--- but that can lead to portability headaches for userspace programs
that don't know whether a particular kernel is going to support a
particular flag or not. So it's certainly not a panacea.

- Ted

2011-03-14 21:21:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, 14 Mar 2011 17:11:19 -0400
"Ted Ts'o" <[email protected]> wrote:

> On Mon, Mar 14, 2011 at 01:10:42PM -0700, Andrew Morton wrote:
> >
> > There might one day be a requirement to be able to initiate a
> > resource-management-style writeback against a whole filesystem. When
> > that happens, we'll regret not having added a "mode" argument to
> > sys_syncfs().
>
> I'm a bit nervous about exposing WB_SYNC_NONE to userspace, because
> its semantics are *definitely* hard to describe. For example, at the
> moment if you do a WB_SYNC_NONE writeback, the writeback code will
> clamp the amount of data written back for each inode to
> MAX_WRITEBACK_PAGES (1024) pages.

Wha? It does? When did that get broken?

> Do we want to document that?
> Probably not! But if we don't document it, what can userspace expect?
>
> If you just issue a writeback_inodes_sb(), it's not the case that it
> will start a process that will eventually write out everything (i.e.,
> it's not the equivalent of a non-blocking data integrity sync). It
> just means, "write out some stuff".
>
> I could imagine userspace wanting to start a non-blocking writeout of
> all data blocking pages, and which doesn't cause queue flush / barrier
> requests. (i.e., a non-blocking-non-barrier-issuing-but-otherwise-a-
> data-integrity writeback) But that's not something that the current
> writeback machinery can do easily, at least not today.

Well. Current implementation shortcomings don't carry a lot of weight
when designing a permanent interface.

> It wouldn't hurt to have a "flags" field which we could expand later
> --- but that can lead to portability headaches for userspace programs
> that don't know whether a particular kernel is going to support a
> particular flag or not. So it's certainly not a panacea.

I don't see a need to add an arg to syncfs() really. But we should
demonstrate that we've thought about it ;)

2011-03-14 21:23:31

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Monday 14 March 2011 22:11:19 Ted Ts'o wrote:
> It wouldn't hurt to have a "flags" field which we could expand later
> --- but that can lead to portability headaches for userspace programs
> that don't know whether a particular kernel is going to support a
> particular flag or not. So it's certainly not a panacea.

I think adding an unused flags argument can't hurt.

We could be fancy and ignore half the bits but bail out on the other
half with -EINVAL. That would make it possible to add both compatible
(default being full sync on old kernels) and incompatible (getting
rejected on old kernels) flags.

Arnd

2011-03-14 23:18:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Mon, Mar 14, 2011 at 02:20:32PM -0700, Andrew Morton wrote:
> > I'm a bit nervous about exposing WB_SYNC_NONE to userspace, because
> > its semantics are *definitely* hard to describe. For example, at the
> > moment if you do a WB_SYNC_NONE writeback, the writeback code will
> > clamp the amount of data written back for each inode to
> > MAX_WRITEBACK_PAGES (1024) pages.
>
> Wha? It does? When did that get broken?

Oops, sorry, I misread the code in wb_writeback(). My bad! I
misinterpreted what write_chunk does in that function.
MAX_WRITEBACK_PAGES now really is the minimum amount of pages that
wb_writeback() will request the file system to write back. I'm not
sure why we bother with write_chunk any more, but it shouldn't be
doing any harm any more.

- Ted

2011-03-15 10:12:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Sun, Mar 13, 2011 at 09:29:17PM -0700, Sage Weil wrote:
> On Mon, 14 Mar 2011, Indan Zupancic wrote:
> > Everyone seems to want to add this new syncfs, but it's not even defined
> > what it does. "Same as sync, but only on one fs" is IMHO not good
> > enough, because sync's behaviour is pretty badly documented, and that's
> > a system call.
>
> How about the man page below? I tried to avoid the somewhat antiquated
> implementation specific terminology in the sync(2) man page.
>
> I think adding this functionality into sync_file_range(2) is forcing
> unrelated functionality into an existing interface; sync_file_range
> operates on _files_, not an entire file system. With each API addition it
> is more important to make the interface simple and intuitive than to
> minimize the size of our patches. IMO that's why a new syscall is
> preferable to, say, an equivalent ioctl.
>
> Thanks-
> sage
>
>
> .TH SYNCFS 2 2011-03-13 "Linux" "Linux Programmer's Manual"
> .SH NAME
> syncfs \- commit cached file system state to stable storage
> .SH SYNOPSIS
> .B #include <unistd.h>
> .sp
> .B void syncfs(int fd);
> .SH DESCRIPTION
> .BR syncfs ()
> flushes any cached data modifications to the file system containing the
> file referenced by the file descriptor
> .I fd
> to stable storage (usually a disk). This includes the results of any
> file modifications or other file system operations that have completed
> prior to the call to
> .BR syncfs(2).
> This is similar to
> .BR sync(2),
> but will commit changes for only a single file system instead of all
> mounted file systems.
> .SH ERRORS
> This function is always successful.

Perhaps we should consider propagating errors out to the user
application rather than discarding them in kernel and pretending we
can't ever have a write error?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-03-15 12:57:52

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Tue, 15 Mar 2011, Dave Chinner wrote:
> On Sun, Mar 13, 2011 at 09:29:17PM -0700, Sage Weil wrote:
> > On Mon, 14 Mar 2011, Indan Zupancic wrote:
> > > Everyone seems to want to add this new syncfs, but it's not even defined
> > > what it does. "Same as sync, but only on one fs" is IMHO not good
> > > enough, because sync's behaviour is pretty badly documented, and that's
> > > a system call.
> >
> > How about the man page below? I tried to avoid the somewhat antiquated
> > implementation specific terminology in the sync(2) man page.
> >
> > I think adding this functionality into sync_file_range(2) is forcing
> > unrelated functionality into an existing interface; sync_file_range
> > operates on _files_, not an entire file system. With each API addition it
> > is more important to make the interface simple and intuitive than to
> > minimize the size of our patches. IMO that's why a new syscall is
> > preferable to, say, an equivalent ioctl.
> >
> > Thanks-
> > sage
> >
> >
> > .TH SYNCFS 2 2011-03-13 "Linux" "Linux Programmer's Manual"
> > .SH NAME
> > syncfs \- commit cached file system state to stable storage
> > .SH SYNOPSIS
> > .B #include <unistd.h>
> > .sp
> > .B void syncfs(int fd);
> > .SH DESCRIPTION
> > .BR syncfs ()
> > flushes any cached data modifications to the file system containing the
> > file referenced by the file descriptor
> > .I fd
> > to stable storage (usually a disk). This includes the results of any
> > file modifications or other file system operations that have completed
> > prior to the call to
> > .BR syncfs(2).
> > This is similar to
> > .BR sync(2),
> > but will commit changes for only a single file system instead of all
> > mounted file systems.
> > .SH ERRORS
> > This function is always successful.
>
> Perhaps we should consider propagating errors out to the user
> application rather than discarding them in kernel and pretending we
> can't ever have a write error?

Yeah. It also occurred to me shortly after sending this that, at the very
least, we should return an error so that users can get ENOSYS on older
kernels.

sage

2011-03-15 15:56:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

Should there be a "wait" argument or flag that allows an app to start the syncfs(), do something, and then call again to wait for completion?

A more advanced implementation would have the "wait=0" call return an opaque wait handle (e.g. transaction ID) and then calling back with this handle as an argument would only wait if that handle hadn't committed yet. That allows userspace to do efficient batching without having to wait for the full sync to complete.

Cheers, Andreas

On 2011-03-15, at 4:11, Dave Chinner <[email protected]> wrote:

> On Sun, Mar 13, 2011 at 09:29:17PM -0700, Sage Weil wrote:
>> On Mon, 14 Mar 2011, Indan Zupancic wrote:
>>> Everyone seems to want to add this new syncfs, but it's not even defined
>>> what it does. "Same as sync, but only on one fs" is IMHO not good
>>> enough, because sync's behaviour is pretty badly documented, and that's
>>> a system call.
>>
>> How about the man page below? I tried to avoid the somewhat antiquated
>> implementation specific terminology in the sync(2) man page.
>>
>> I think adding this functionality into sync_file_range(2) is forcing
>> unrelated functionality into an existing interface; sync_file_range
>> operates on _files_, not an entire file system. With each API addition it
>> is more important to make the interface simple and intuitive than to
>> minimize the size of our patches. IMO that's why a new syscall is
>> preferable to, say, an equivalent ioctl.
>>
>> Thanks-
>> sage
>>
>>
>> .TH SYNCFS 2 2011-03-13 "Linux" "Linux Programmer's Manual"
>> .SH NAME
>> syncfs \- commit cached file system state to stable storage
>> .SH SYNOPSIS
>> .B #include <unistd.h>
>> .sp
>> .B void syncfs(int fd);
>> .SH DESCRIPTION
>> .BR syncfs ()
>> flushes any cached data modifications to the file system containing the
>> file referenced by the file descriptor
>> .I fd
>> to stable storage (usually a disk). This includes the results of any
>> file modifications or other file system operations that have completed
>> prior to the call to
>> .BR syncfs(2).
>> This is similar to
>> .BR sync(2),
>> but will commit changes for only a single file system instead of all
>> mounted file systems.
>> .SH ERRORS
>> This function is always successful.
>
> Perhaps we should consider propagating errors out to the user
> application rather than discarding them in kernel and pretending we
> can't ever have a write error?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-03-15 16:05:48

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Tue, 15 Mar 2011, Andreas Dilger wrote:
> Should there be a "wait" argument or flag that allows an app to start
> the syncfs(), do something, and then call again to wait for completion?
>
> A more advanced implementation would have the "wait=0" call return an
> opaque wait handle (e.g. transaction ID) and then calling back with this
> handle as an argument would only wait if that handle hadn't committed
> yet. That allows userspace to do efficient batching without having to
> wait for the full sync to complete.

FWIW this is what the btrfs {START,WAIT}_SYNC ioctls do. Because it ties
into btrfs' transactions, there's an additional semantic that operations
after the START_SYNC are not included in the committed state (this time
around, at least). Absent that (which I'm not sure maps cleanly onto many
other file systems currently, and is of limited value when not used in
concert with btrfs snapshots), I'm not sure a wait flag is helpful. Is it
any different than an application doing the syncfs() in a separate thread?

sage


>
> Cheers, Andreas
>
> On 2011-03-15, at 4:11, Dave Chinner <[email protected]> wrote:
>
> > On Sun, Mar 13, 2011 at 09:29:17PM -0700, Sage Weil wrote:
> >> On Mon, 14 Mar 2011, Indan Zupancic wrote:
> >>> Everyone seems to want to add this new syncfs, but it's not even defined
> >>> what it does. "Same as sync, but only on one fs" is IMHO not good
> >>> enough, because sync's behaviour is pretty badly documented, and that's
> >>> a system call.
> >>
> >> How about the man page below? I tried to avoid the somewhat antiquated
> >> implementation specific terminology in the sync(2) man page.
> >>
> >> I think adding this functionality into sync_file_range(2) is forcing
> >> unrelated functionality into an existing interface; sync_file_range
> >> operates on _files_, not an entire file system. With each API addition it
> >> is more important to make the interface simple and intuitive than to
> >> minimize the size of our patches. IMO that's why a new syscall is
> >> preferable to, say, an equivalent ioctl.
> >>
> >> Thanks-
> >> sage
> >>
> >>
> >> .TH SYNCFS 2 2011-03-13 "Linux" "Linux Programmer's Manual"
> >> .SH NAME
> >> syncfs \- commit cached file system state to stable storage
> >> .SH SYNOPSIS
> >> .B #include <unistd.h>
> >> .sp
> >> .B void syncfs(int fd);
> >> .SH DESCRIPTION
> >> .BR syncfs ()
> >> flushes any cached data modifications to the file system containing the
> >> file referenced by the file descriptor
> >> .I fd
> >> to stable storage (usually a disk). This includes the results of any
> >> file modifications or other file system operations that have completed
> >> prior to the call to
> >> .BR syncfs(2).
> >> This is similar to
> >> .BR sync(2),
> >> but will commit changes for only a single file system instead of all
> >> mounted file systems.
> >> .SH ERRORS
> >> This function is always successful.
> >
> > Perhaps we should consider propagating errors out to the user
> > application rather than discarding them in kernel and pretending we
> > can't ever have a write error?
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2011-03-15 20:20:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] introduce sys_syncfs to sync a single file system

On Tue, 15 Mar 2011 09:56:08 -0600
Andreas Dilger <[email protected]> wrote:

> Should there be a "wait" argument or flag that allows an app to start the syncfs(), do something, and then call again to wait for completion?

I don't think so. If userspace wants to do that then fork().

> > Perhaps we should consider propagating errors out to the user
> > application rather than discarding them in kernel and pretending we
> > can't ever have a write error?

That would be nice, but is probably a pretty complex thing to
implement. The manpage should include words indicating that syncfs()
can return an errno. That way, userspace will hopefully have the
appropriate checks, whcih will become more useful if/when the kernel
implementation gets fixed up.