2012-05-31 20:44:08

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 0/2] msync improvements

These two patches are improvements on the implementation of the msync
system call. To give some context, I found them while working on
the implementation of a persistent dirty bitmap.

Paolo Bonzini (2):
msync: support syncing a small part of the file
msync: start async writeout when MS_ASYNC

include/linux/fs.h | 3 +-
mm/fadvise.c | 2 +-
mm/filemap.c | 11 +++++---
mm/msync.c | 63 +++++++++++++++++++++++++++++++++-------------------
4 files changed, 50 insertions(+), 29 deletions(-)


2012-05-31 20:44:11

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 1/2] msync: support syncing a small part of the file

msync does not need to flush changes to the entire file, even with MS_SYNC.
Instead, it can use vfs_fsync_range to only synchronize a part of the file.

In addition, not all metadata has to be synced; msync is closer to
fdatasync than it is to msync. So, pass 1 to vfs_fsync_range.

Cc: Andrew Morton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
mm/msync.c | 13 ++++++++++---
1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 632df45..505fe99 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -15,7 +15,7 @@
#include <linux/sched.h>

/*
- * MS_SYNC syncs the entire file - including mappings.
+ * MS_SYNC syncs the specified range - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
* Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
@@ -58,6 +58,8 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
vma = find_vma(mm, start);
for (;;) {
struct file *file;
+ unsigned long next;
+ loff_t file_offset;

/* Still start < end. */
error = -ENOMEM;
@@ -77,18 +79,23 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
goto out_unlock;
}
file = vma->vm_file;
- start = vma->vm_end;
+ next = min(end, vma->vm_end);
if ((flags & MS_SYNC) && file &&
(vma->vm_flags & VM_SHARED)) {
+ file_offset = vma->vm_pgoff * PAGE_SIZE;
get_file(file);
up_read(&mm->mmap_sem);
- error = vfs_fsync(file, 0);
+ error = vfs_fsync_range(file,
+ start - vma->vm_start + file_offset,
+ next - vma->vm_start + file_offset, 1);
fput(file);
+ start = next;
if (error || start >= end)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
} else {
+ start = next;
if (start >= end) {
error = 0;
goto out_unlock;
--
1.7.1

2012-05-31 20:44:14

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 2/2] msync: start async writeout when MS_ASYNC

msync.c says that applications had better use fsync() or fadvise(FADV_DONTNEED)
instead of MS_ASYNC. Both advices are really bad:

* fsync() can be a replacement for MS_SYNC, not for MS_ASYNC;

* fadvise(FADV_DONTNEED) invalidates the pages completely, which will make
later accesses expensive.

Having the possibility to schedule a writeback immediately is an advantage
for the applications. They can do the same thing that fadvise does,
but without the invalidation part. The implementation is also similar
to fadvise, but with tag-and-write enabled.

One example is if you are implementing a persistent dirty bitmap.
Whenever you set bits to 1 you need to synchronize it with MS_SYNC, so
that dirtiness is reported properly after a host crash. If you have set
any bits to 0, getting them to disk is not needed for correctness, but
it is still desirable to save some work after a host crash. You could
simply use MS_SYNC in a separate thread, but MS_ASYNC provides exactly
the desired semantics and is easily done in the kernel.

If the application does not want to start I/O, it can simply call msync
with flags equal to MS_INVALIDATE. This one remains a no-op, as it should
be on a reasonable implementation.

Cc: Andrew Morton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/fs.h | 3 +-
mm/fadvise.c | 2 +-
mm/filemap.c | 11 ++++++---
mm/msync.c | 60 ++++++++++++++++++++++++++++++---------------------
4 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8de6755..0aeedb9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2196,7 +2196,8 @@ extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
extern int __filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end, int sync_mode);
+ loff_t start, loff_t end, int sync_mode,
+ bool tagged_writepages);
extern int filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end);

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..a3579f1 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -118,7 +118,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping->backing_dev_info))
__filemap_fdatawrite_range(mapping, offset, endbyte,
- WB_SYNC_NONE);
+ WB_SYNC_NONE, 0);

/* First and last FULL page! */
start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
diff --git a/mm/filemap.c b/mm/filemap.c
index 79c4b2b..641e2a8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -191,6 +191,7 @@ static int sleep_on_page_killable(void *word)
* @start: offset in bytes where the range starts
* @end: offset in bytes where the range ends (inclusive)
* @sync_mode: enable synchronous operation
+ * @tagged_writepages: tag-and-write to avoid livelock (implicit if WB_SYNC_ALL)
*
* Start writeback against all of a mapping's dirty pages that lie
* within the byte offsets <start, end> inclusive.
@@ -201,7 +202,8 @@ static int sleep_on_page_killable(void *word)
* be waited upon, and not just skipped over.
*/
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
- loff_t end, int sync_mode)
+ loff_t end, int sync_mode,
+ bool tagged_writepages)
{
int ret;
struct writeback_control wbc = {
@@ -209,6 +211,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
.nr_to_write = LONG_MAX,
.range_start = start,
.range_end = end,
+ .tagged_writepages = tagged_writepages,
};

if (!mapping_cap_writeback_dirty(mapping))
@@ -221,7 +224,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
static inline int __filemap_fdatawrite(struct address_space *mapping,
int sync_mode)
{
- return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
+ return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode, 0);
}

int filemap_fdatawrite(struct address_space *mapping)
@@ -233,7 +236,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end)
{
- return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
+ return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL, 1);
}
EXPORT_SYMBOL(filemap_fdatawrite_range);

@@ -361,7 +364,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,

if (mapping->nrpages) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
- WB_SYNC_ALL);
+ WB_SYNC_ALL, 1);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = filemap_fdatawait_range(mapping,
diff --git a/mm/msync.c b/mm/msync.c
index 505fe99..4d1f813 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,20 +13,16 @@
#include <linux/file.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/backing-dev.h>
+#include <linux/writeback.h>

/*
* MS_SYNC syncs the specified range - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
- *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
- * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
- * applications.
+ * MS_ASYNC only starts I/O, as it did up to 2.5.67, but only dirty pages
+ * will now be written. While the application may run fadvise(FADV_DONTNEED)
+ * against the fd to start async writeout immediately, invalidating the
+ * pages will make later accesses expensive.
*/
SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
{
@@ -78,30 +74,44 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
error = -EBUSY;
goto out_unlock;
}
+
+ error = 0;
file = vma->vm_file;
next = min(end, vma->vm_end);
- if ((flags & MS_SYNC) && file &&
- (vma->vm_flags & VM_SHARED)) {
- file_offset = vma->vm_pgoff * PAGE_SIZE;
- get_file(file);
- up_read(&mm->mmap_sem);
- error = vfs_fsync_range(file,
- start - vma->vm_start + file_offset,
- next - vma->vm_start + file_offset, 1);
- fput(file);
- start = next;
- if (error || start >= end)
- goto out;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, start);
- } else {
+ if (!file || !(vma->vm_flags & VM_SHARED) ||
+ !(flags & ~MS_INVALIDATE)) {
start = next;
if (start >= end) {
error = 0;
goto out_unlock;
}
vma = vma->vm_next;
+ continue;
+ }
+
+ file_offset = vma->vm_pgoff * PAGE_SIZE;
+ get_file(file);
+ up_read(&mm->mmap_sem);
+ if (flags & MS_SYNC) {
+ error = vfs_fsync_range(file,
+ start - vma->vm_start + file_offset,
+ next - vma->vm_start + file_offset, 1);
+ } else {
+ struct address_space *mapping = file->f_mapping;
+ /* end offset is inclusive! */
+ if (mapping &&
+ !bdi_write_congested(mapping->backing_dev_info))
+ __filemap_fdatawrite_range(mapping,
+ start - vma->vm_start + file_offset,
+ next - 1 - vma->vm_start + file_offset,
+ WB_SYNC_NONE, 1);
}
+ fput(file);
+ start = next;
+ if (error || start >= end)
+ goto out;
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
}
out_unlock:
up_read(&mm->mmap_sem);
--
1.7.1

2012-06-12 15:38:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/2] msync improvements

Il 31/05/2012 22:43, Paolo Bonzini ha scritto:
> These two patches are improvements on the implementation of the msync
> system call. To give some context, I found them while working on
> the implementation of a persistent dirty bitmap.
>
> Paolo Bonzini (2):
> msync: support syncing a small part of the file
> msync: start async writeout when MS_ASYNC
>
> include/linux/fs.h | 3 +-
> mm/fadvise.c | 2 +-
> mm/filemap.c | 11 +++++---
> mm/msync.c | 63 +++++++++++++++++++++++++++++++++-------------------
> 4 files changed, 50 insertions(+), 29 deletions(-)
>

Ping?

Paolo

2012-06-13 21:26:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] msync: support syncing a small part of the file

On Thu, 31 May 2012 22:43:54 +0200
Paolo Bonzini <[email protected]> wrote:

> msync does not need to flush changes to the entire file, even with MS_SYNC.
> Instead, it can use vfs_fsync_range to only synchronize a part of the file.
>
> In addition, not all metadata has to be synced; msync is closer to
> fdatasync than it is to msync. So, pass 1 to vfs_fsync_range.

Would be nice, but if applications were previously assuming that an
msync() was syncing the whole file, this patch will secretly and subtly
break them. Convince me that this change won't weaken anyone's data
integrity ;)

> @@ -77,18 +79,23 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
> goto out_unlock;
> }
> file = vma->vm_file;
> - start = vma->vm_end;
> + next = min(end, vma->vm_end);
> if ((flags & MS_SYNC) && file &&
> (vma->vm_flags & VM_SHARED)) {
> + file_offset = vma->vm_pgoff * PAGE_SIZE;
> get_file(file);
> up_read(&mm->mmap_sem);
> - error = vfs_fsync(file, 0);
> + error = vfs_fsync_range(file,
> + start - vma->vm_start + file_offset,
> + next - vma->vm_start + file_offset, 1);

Is that an off-by-one? The "end" argument to vfs_fsync_range() is
"inclusive". ie, it indexes the last byte, not the last byte+1. It's
done this way so we can refer to end-of-file with -1UL.

2012-06-13 21:29:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] msync: start async writeout when MS_ASYNC

On Thu, 31 May 2012 22:43:55 +0200
Paolo Bonzini <[email protected]> wrote:

> msync.c says that applications had better use fsync() or fadvise(FADV_DONTNEED)
> instead of MS_ASYNC. Both advices are really bad:
>
> * fsync() can be a replacement for MS_SYNC, not for MS_ASYNC;
>
> * fadvise(FADV_DONTNEED) invalidates the pages completely, which will make
> later accesses expensive.
>
> Having the possibility to schedule a writeback immediately is an advantage
> for the applications. They can do the same thing that fadvise does,
> but without the invalidation part. The implementation is also similar
> to fadvise, but with tag-and-write enabled.
>
> One example is if you are implementing a persistent dirty bitmap.
> Whenever you set bits to 1 you need to synchronize it with MS_SYNC, so
> that dirtiness is reported properly after a host crash. If you have set
> any bits to 0, getting them to disk is not needed for correctness, but
> it is still desirable to save some work after a host crash. You could
> simply use MS_SYNC in a separate thread, but MS_ASYNC provides exactly
> the desired semantics and is easily done in the kernel.
>
> If the application does not want to start I/O, it can simply call msync
> with flags equal to MS_INVALIDATE. This one remains a no-op, as it should
> be on a reasonable implementation.

Means that people will find that their msync(MS_ASYNC) call will newly
start IO. This may well be undesirable for some.

Also, it hardwires into the kernel behaviour which userspace itself
could have initiated, with sync_file_range(). ie: reduced flexibility.

Perhaps we can update the msync.c code comments to direct people to
sync_file_range()?


One wonders how msync() works with nonlinear mappings. I guess
"badly". I think this was all discussed when we merged
remap_file_pages() (what a mistake that was) and we decided "too hard".

2012-06-13 22:08:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] msync: support syncing a small part of the file

On Wed, 13 Jun 2012 15:51:33 -0600
Zan Lynx <[email protected]> wrote:

> On Wed, 2012-06-13 at 14:26 -0700, Andrew Morton wrote:
> > On Thu, 31 May 2012 22:43:54 +0200
> > Paolo Bonzini <[email protected]> wrote:
> >
> > > msync does not need to flush changes to the entire file, even with MS_SYNC.
> > > Instead, it can use vfs_fsync_range to only synchronize a part of the file.
> > >
> > > In addition, not all metadata has to be synced; msync is closer to
> > > fdatasync than it is to msync. So, pass 1 to vfs_fsync_range.
> >
> > Would be nice, but if applications were previously assuming that an
> > msync() was syncing the whole file, this patch will secretly and subtly
> > break them. Convince me that this change won't weaken anyone's data
> > integrity ;)
>
> As an interested observer and a programmer who uses msync()...
>
> I never assumed msync() did the whole file.

OK, that's one user accounted for. 3 million to go.

Look, I'm not terribly averse to the change, but it does have this
risk. And it's a nasty risk because anyone who is hit by it simply
will not know that his applcation has lost some of its data integrity.

> It has an address and length
> argument. I always assumed it only looked for dirty pages within that
> range. That is also what the msync() documentation claims.
>
> As for weakening data integrity because of assumptions programmers *may*
> have made, I think this is a bad argument which followed to its logical
> conclusion can only lead to requiring an implicit sync() before and
> after every system call. :-)

No, not at all. The issue is the *removal* of a side-effect upon which
some apps/designers may have been depending. Perhaps unintentionally!

2012-06-13 22:15:15

by Zan Lynx

[permalink] [raw]
Subject: Re: [PATCH 1/2] msync: support syncing a small part of the file

On Wed, 2012-06-13 at 14:26 -0700, Andrew Morton wrote:
> On Thu, 31 May 2012 22:43:54 +0200
> Paolo Bonzini <[email protected]> wrote:
>
> > msync does not need to flush changes to the entire file, even with MS_SYNC.
> > Instead, it can use vfs_fsync_range to only synchronize a part of the file.
> >
> > In addition, not all metadata has to be synced; msync is closer to
> > fdatasync than it is to msync. So, pass 1 to vfs_fsync_range.
>
> Would be nice, but if applications were previously assuming that an
> msync() was syncing the whole file, this patch will secretly and subtly
> break them. Convince me that this change won't weaken anyone's data
> integrity ;)

As an interested observer and a programmer who uses msync()...

I never assumed msync() did the whole file. It has an address and length
argument. I always assumed it only looked for dirty pages within that
range. That is also what the msync() documentation claims.

As for weakening data integrity because of assumptions programmers *may*
have made, I think this is a bad argument which followed to its logical
conclusion can only lead to requiring an implicit sync() before and
after every system call. :-)

Asking programmers to use sync_file_range() is not a replacement for
improving msync() because msync() is POSIX and can be used everywhere.
sync_file_range() requires Linux specific code.

--
Knowledge Is Power
Power Corrupts
Study Hard
Be Evil


Attachments:
signature.asc (490.00 B)
This is a digitally signed message part

2012-06-14 08:57:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/2] msync: support syncing a small part of the file

Il 14/06/2012 00:08, Andrew Morton ha scritto:
>> > As an interested observer and a programmer who uses msync()...
>> >
>> > I never assumed msync() did the whole file.
> OK, that's one user accounted for. 3 million to go.
>
> Look, I'm not terribly averse to the change, but it does have this
> risk. And it's a nasty risk because anyone who is hit by it simply
> will not know that his applcation has lost some of its data integrity.

Sure, it reminds me of the recent bug caused by using memcpy with
overlapping areas.

But really, in this case the possibility for confusion is really small,
and would entirely be the fault of the programmer. There are argument
for addr and length, and the man page says: "To be more precise, the
part of the file that corresponds to the memory area starting at addr
and having length length is updated". Our implementation having been
suboptimal for a long time, is not a good reason to keep it suboptimal.

Anyway yes, there is an off-by-one.

Paolo

2012-06-14 09:02:09

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/2] msync: start async writeout when MS_ASYNC

Il 13/06/2012 23:29, Andrew Morton ha scritto:
> > If the application does not want to start I/O, it can simply call msync
> > with flags equal to MS_INVALIDATE. This one remains a no-op, as it should
> > be on a reasonable implementation.
>
> Means that people will find that their msync(MS_ASYNC) call will newly
> start IO. This may well be undesirable for some.

If they knew it was a no-op, and relying on it, they might as well not
have called it at all and save a syscall.

> Also, it hardwires into the kernel behaviour which userspace itself
> could have initiated, with sync_file_range(). ie: reduced flexibility.

No, it hardwires into the kernel behaviour which is mentioned in the
spec. Strictly speaking it is correct that an asynchronous operation
is never done (the two are indistinguishable), but still this patch
improves the fidelity to the specification and in a portable manner.

Paolo

2012-06-14 10:08:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] msync: start async writeout when MS_ASYNC

On Thu, 14 Jun 2012 11:02:02 +0200 Paolo Bonzini <[email protected]> wrote:

> Il 13/06/2012 23:29, Andrew Morton ha scritto:
> > > If the application does not want to start I/O, it can simply call msync
> > > with flags equal to MS_INVALIDATE. This one remains a no-op, as it should
> > > be on a reasonable implementation.
> >
> > Means that people will find that their msync(MS_ASYNC) call will newly
> > start IO. This may well be undesirable for some.
>
> If they knew it was a no-op, and relying on it, they might as well not
> have called it at all and save a syscall.

Nope. Back in the day (3+ years ago?), msync(MS_ASYNC) would flush
pte-dirty bits into page->flags:PG_Dirty. So it was used to make the
filesystem aware that userspace had modified some MAP_SHARED memory.
The fs would then perform writeback within (typically) 30 seconds. So
a legitimate use would be for the app to periodically run
msync(MS_ASYNC) to avoid having modified data floating about
un-written-back for arbitrarily long periods.

Nowadays we mark the pte's read-only and take a fault on first write,
and mark the page dirty at that time.

So it wasn't a no-op, and with this patch, such applications will newly
find themselves to be starting I/O within the msync() call.

2012-06-14 10:19:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/2] msync: start async writeout when MS_ASYNC

Il 14/06/2012 12:07, Andrew Morton ha scritto:
>> > If they knew it was a no-op, and relying on it, they might as well not
>> > have called it at all and save a syscall.
> Nope. Back in the day (3+ years ago?), msync(MS_ASYNC) would flush
> pte-dirty bits into page->flags:PG_Dirty. So it was used to make the
> filesystem aware that userspace had modified some MAP_SHARED memory.
> The fs would then perform writeback within (typically) 30 seconds. So
> a legitimate use would be for the app to periodically run
> msync(MS_ASYNC) to avoid having modified data floating about
> un-written-back for arbitrarily long periods.

This is in fact the same thing I would like to do in my application.
The fact that I/O is started immediately rather than within 30 seconds
is an implementation detail.

An application that calls msync(MS_ASYNC) periodically indeed wants to
start I/O _sooner or later_, it doesn't really matter when.

> Nowadays we mark the pte's read-only and take a fault on first write,
> and mark the page dirty at that time.

We mark pages dirty but we don't promise writing them back (sooner or
later we will, of course). Otherwise there would be no need for
sync_file_range(..., SYNC_FILE_RANGE_WRITE).

Paolo

2012-06-14 12:24:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/2] msync: start async writeout when MS_ASYNC

Il 13/06/2012 23:29, Andrew Morton ha scritto:
> Also, it hardwires into the kernel behaviour which userspace itself
> could have initiated, with sync_file_range(). ie: reduced flexibility.

Actually sync_file_range calls filemap_fdatawrite_range with
WB_SYNC_ALL, hence it does the writeout synchronously and can block for
an extended period of time. Hence sync_file_range is more similar to
MS_SYNC except without the metadata write.

Instead, this patch uses WB_SYNC_NONE, while still using tag-and-write
to avoid livelock.

I'll wait a day or two to let others voice their opinions, and then send
a fixed v2 with updated commit messages from this discussion.

Paolo