From: Anton Salikhmetov <[email protected]>
Due to the lack of reaction in LKML I presume the message was lost
in the high traffic of that list. Resending it now with the addressee changed
to the memory management mailing list.
I would like to propose my solution for the bug #2645 from the kernel bug tracker:
http://bugzilla.kernel.org/show_bug.cgi?id=2645
The Open Group defines the behavior of the mmap() function as follows.
The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.
The above citation was taken from the following link:
http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.
I changed his test program a little. The changed unit test can be downloaded
using the following link:
http://pygx.sourceforge.net/mmap.c
This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.
The program shows appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.
Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.
The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
"manual" alignment, improved readability of the loop,
which traverses the process memory regions, updated comments.
Signed-off-by: Anton Salikhmetov <[email protected]>
---
diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
/*
* linux/mm/msync.c
*
+ * The msync() system call.
* Copyright (C) 1994-1999 Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
*/
-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>
/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
* The application may now run fsync() to
* write out the dirty pages and wait on the writeout and check the result.
* Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -33,70 +39,68 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
- int unmapped_error = 0;
- int error = -EINVAL;
+ int error = 0, unmapped_error = 0;
- if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
- goto out;
- if (start & ~PAGE_MASK)
+ if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
+ (start & ~PAGE_MASK) ||
+ ((flags & MS_ASYNC) && (flags & MS_SYNC))) {
+ error = -EINVAL;
goto out;
- if ((flags & MS_ASYNC) && (flags & MS_SYNC))
- goto out;
- error = -ENOMEM;
- len = (len + ~PAGE_MASK) & PAGE_MASK;
+ }
+
+ len = PAGE_ALIGN(len);
end = start + len;
- if (end < start)
+ if (end < start) {
+ error = -ENOMEM;
goto out;
- error = 0;
+ }
if (end == start)
goto out;
+
/*
* If the interval [start,end) covers some unmapped address ranges,
* just ignore them, but return -ENOMEM at the end.
*/
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- for (;;) {
+ do {
struct file *file;
- /* Still start < end. */
- error = -ENOMEM;
- if (!vma)
- goto out_unlock;
- /* Here start < vma->vm_end. */
+ if (!vma) {
+ error = -ENOMEM;
+ break;
+ }
if (start < vma->vm_start) {
start = vma->vm_start;
- if (start >= end)
- goto out_unlock;
+ if (start >= end) {
+ error = -ENOMEM;
+ break;
+ }
unmapped_error = -ENOMEM;
}
- /* Here vma->vm_start <= start < vma->vm_end. */
if ((flags & MS_INVALIDATE) &&
(vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
- goto out_unlock;
+ break;
}
file = vma->vm_file;
- start = vma->vm_end;
- if ((flags & MS_SYNC) && file &&
- (vma->vm_flags & VM_SHARED)) {
+ if (file && (vma->vm_flags & VM_SHARED)) {
get_file(file);
- up_read(&mm->mmap_sem);
- error = do_fsync(file, 0);
- fput(file);
- if (error || start >= end)
- goto out;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, start);
- } else {
- if (start >= end) {
- error = 0;
- goto out_unlock;
+ if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
+ file_update_time(file);
+ if (flags & MS_SYNC) {
+ up_read(&mm->mmap_sem);
+ error = do_fsync(file, 0);
+ down_read(&mm->mmap_sem);
}
- vma = vma->vm_next;
+ fput(file);
+ if (error)
+ break;
}
- }
-out_unlock:
+
+ start = vma->vm_end;
+ vma = vma->vm_next;
+ } while (start < end);
up_read(&mm->mmap_sem);
out:
return error ? : unmapped_error;
Since no reaction in LKML was recieved for this message it seemed
logical to suggest closing the bug #2645 as "WONTFIX":
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15
However, the reporter of the bug, Jacob Oestergaard, insisted the
solution to be resubmitted once more:
>>>
Please re-submit to LKML.
This bug causes backup systems to *miss* changed files.
This bug does cause data loss in common real-world deployments (I gave an
example with a database when posting the bug, but this affects the data from
all mmap using applications with common backup systems).
Silent exclusion from backups is very very nasty.
<<<
Please comment on my solution or commit it if it's acceptable in its
present form.
2008/1/7, Anton Salikhmetov <[email protected]>:
> From: Anton Salikhmetov <[email protected]>
>
> Due to the lack of reaction in LKML I presume the message was lost
> in the high traffic of that list. Resending it now with the addressee changed
> to the memory management mailing list.
>
> I would like to propose my solution for the bug #2645 from the kernel bug tracker:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> The Open Group defines the behavior of the mmap() function as follows.
>
> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> and PROT_WRITE shall be marked for update at some point in the interval
> between a write reference to the mapped region and the next call to msync()
> with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> If there is no such call and if the underlying file is modified as a result
> of a write reference, then these fields shall be marked for update at some
> time after the write reference.
>
> The above citation was taken from the following link:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>
> Therefore, the msync() function should be called before verifying the time
> stamps st_mtime and st_ctime in the test program Badari wrote in the context
> of the bug #2645. Otherwise, the time stamps may be updated
> at some unspecified moment according to the POSIX standard.
>
> I changed his test program a little. The changed unit test can be downloaded
> using the following link:
>
> http://pygx.sourceforge.net/mmap.c
>
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
>
> Additionally, the test cases for the msync() system call from
> the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
> and mmapstress10) successfully passed using the kernel
> with the patch included into this email.
>
> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
> Signed-off-by: Anton Salikhmetov <[email protected]>
>
> ---
>
> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..cb973eb 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -1,26 +1,32 @@
> /*
> * linux/mm/msync.c
> *
> + * The msync() system call.
> * Copyright (C) 1994-1999 Linus Torvalds
> + *
> + * Updating the mtime and ctime stamps for mapped files
> + * and code cleanup.
> + * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> */
>
> -/*
> - * The msync() system call.
> - */
> +#include <linux/file.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
> #include <linux/mman.h>
> -#include <linux/file.h>
> -#include <linux/syscalls.h>
> #include <linux/sched.h>
> +#include <linux/syscalls.h>
>
> /*
> * MS_SYNC syncs the entire file - including mappings.
> *
> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> * Now it doesn't do anything, since dirty pages are properly tracked.
> *
> + * The msync() system call updates the ctime and mtime fields for
> + * the mapped file when called with the MS_SYNC or MS_ASYNC flags
> + * according to the POSIX standard.
> + *
> * The application may now run fsync() to
> * write out the dirty pages and wait on the writeout and check the result.
> * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
> @@ -33,70 +39,68 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> unsigned long end;
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> - int unmapped_error = 0;
> - int error = -EINVAL;
> + int error = 0, unmapped_error = 0;
>
> - if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> - goto out;
> - if (start & ~PAGE_MASK)
> + if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
> + (start & ~PAGE_MASK) ||
> + ((flags & MS_ASYNC) && (flags & MS_SYNC))) {
> + error = -EINVAL;
> goto out;
> - if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> - goto out;
> - error = -ENOMEM;
> - len = (len + ~PAGE_MASK) & PAGE_MASK;
> + }
> +
> + len = PAGE_ALIGN(len);
> end = start + len;
> - if (end < start)
> + if (end < start) {
> + error = -ENOMEM;
> goto out;
> - error = 0;
> + }
> if (end == start)
> goto out;
> +
> /*
> * If the interval [start,end) covers some unmapped address ranges,
> * just ignore them, but return -ENOMEM at the end.
> */
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, start);
> - for (;;) {
> + do {
> struct file *file;
>
> - /* Still start < end. */
> - error = -ENOMEM;
> - if (!vma)
> - goto out_unlock;
> - /* Here start < vma->vm_end. */
> + if (!vma) {
> + error = -ENOMEM;
> + break;
> + }
> if (start < vma->vm_start) {
> start = vma->vm_start;
> - if (start >= end)
> - goto out_unlock;
> + if (start >= end) {
> + error = -ENOMEM;
> + break;
> + }
> unmapped_error = -ENOMEM;
> }
> - /* Here vma->vm_start <= start < vma->vm_end. */
> if ((flags & MS_INVALIDATE) &&
> (vma->vm_flags & VM_LOCKED)) {
> error = -EBUSY;
> - goto out_unlock;
> + break;
> }
> file = vma->vm_file;
> - start = vma->vm_end;
> - if ((flags & MS_SYNC) && file &&
> - (vma->vm_flags & VM_SHARED)) {
> + if (file && (vma->vm_flags & VM_SHARED)) {
> get_file(file);
> - up_read(&mm->mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error || start >= end)
> - goto out;
> - down_read(&mm->mmap_sem);
> - vma = find_vma(mm, start);
> - } else {
> - if (start >= end) {
> - error = 0;
> - goto out_unlock;
> + if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
> + file_update_time(file);
> + if (flags & MS_SYNC) {
> + up_read(&mm->mmap_sem);
> + error = do_fsync(file, 0);
> + down_read(&mm->mmap_sem);
> }
> - vma = vma->vm_next;
> + fput(file);
> + if (error)
> + break;
> }
> - }
> -out_unlock:
> +
> + start = vma->vm_end;
> + vma = vma->vm_next;
> + } while (start < end);
> up_read(&mm->mmap_sem);
> out:
> return error ? : unmapped_error;
>
>
On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:
> Since no reaction in LKML was recieved for this message it seemed
> logical to suggest closing the bug #2645 as "WONTFIX":
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15
Thank you!
A quick run-down for those who don't know what this is about:
Some applications use mmap() to modify files. Common examples are databases.
Linux does not update the mtime of files that are modified using mmap, even if
msync() is called.
This is very clearly against OpenGroup specifications.
This misfeatures causes such files to be silently *excluded* from normal backup
runs.
Solaris implements this properly.
NT has the same bug as Linux, using their private bastardisation of the mmap
interface - but since they don't care about SuS and are broken in so many other
ways, that really doesn't matter.
So, dear kernel developers, can we please integrate this patch to make Linux
stop silently excluding peoples databases from their backup?
--
/ jakob
On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:
...
>
> This bug causes backup systems to *miss* changed files.
>
This problem is seen with both Amanda and TSM (Tivoli Storage Manager).
A site running Amanda with, say, a full backup weekly and incremental backups
daily, will only get weekly backups of their mmap modified databases.
However, large sites running TSM will be hit even harder by this because TSM
will always perform incremental backups from the client (managing which
versions to keep for how long on the server side). TSM will *never* again take
a backup of the mmap modified database.
The really nasty part is; nothing is failing. Everything *appears* to work.
Your data is just not backed up because it appears to be untouched.
So, if you run TSM (or pretty much any other backup solution actually) on
Linux, maybe you should run a
find / -type f -print0 | xargs -0 touch
before starting your backup job. Sort of removes the point of using proper
backup software, but at least you'll get your data backed up.
--
/ jakob
On 09/01/2008, Anton Salikhmetov <[email protected]> wrote:
> Since no reaction in LKML was recieved for this message it seemed
> logical to suggest closing the bug #2645 as "WONTFIX":
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15
>
> However, the reporter of the bug, Jacob Oestergaard, insisted the
> solution to be resubmitted once more:
>
Good idea. The bug is real and should be fixed IMHO.
...
> This bug causes backup systems to *miss* changed files.
>
> This bug does cause data loss in common real-world deployments (I gave an
> example with a database when posting the bug, but this affects the data from
> all mmap using applications with common backup systems).
>
Not just backup systems, but any application that relies on mtime
being correctly updated will be bitten by this.
> Silent exclusion from backups is very very nasty.
>
Agreed.
In fact if mtime is not reliable (which it is not) one could argue
that we might as well not update it at all, ever. But I think we can
all agree that just fixing it (as your patch does) is a lot better.
> Please comment on my solution or commit it if it's acceptable in its
> present form.
>
I've only looked briefly at your patch but it seems resonable. I'll
try to do some testing with it later.
Thank you for working on this long standing bug.
...
> > I would like to propose my solution for the bug #2645 from the kernel bug tracker:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > The Open Group defines the behavior of the mmap() function as follows.
> >
> > The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> > and PROT_WRITE shall be marked for update at some point in the interval
> > between a write reference to the mapped region and the next call to msync()
> > with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> > If there is no such call and if the underlying file is modified as a result
> > of a write reference, then these fields shall be marked for update at some
> > time after the write reference.
> >
> > The above citation was taken from the following link:
> >
> > http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
> >
...
I agree that our current behaviour is certainly not what the standard
(sensibly) requires.
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
2008/1/9, Jesper Juhl <[email protected]>:
> I've only looked briefly at your patch but it seems resonable. I'll
> try to do some testing with it later.
Jesper, thank you very much for your answer!
In fact, I tested my change quite extensively using test cases for the
mmap() and msync() system calls from the LTP test suite. Please note
that I did mention that in my previous message:
>>>
Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.
<<<
On Mon, 07 Jan 2008 20:54:19 +0300
Anton Salikhmetov <[email protected]> wrote:
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
As long as the ctime and mtime stamps change when the memory is
written to, what exactly is the problem?
Is it that the ctime and mtime does not change again when the memory
is written to again?
Is there a way for backup programs to miss file modification times?
Could you explain (using short words and simple sentences) what the
exact problem is?
Eg.
1) program mmaps file
2) program writes to mmaped area
3) ??? <=== this part, in equally simple words :)
4) data loss
An explanation like that will help people understand exactly what the
bug is, and why the patch should be applied ASAP.
> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment, improved readability of the loop,
> which traverses the process memory regions, updated comments.
Due to the various cleanups all being in one file, it took me a while
to understand the patch. In an area of code this subtle, it may be
better to submit the cleanups in (a) separate patch(es) from the patch
that adds the call to file_update_time().
> - (vma->vm_flags & VM_SHARED)) {
> + if (file && (vma->vm_flags & VM_SHARED)) {
> get_file(file);
> - up_read(&mm->mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error || start >= end)
> - goto out;
> - down_read(&mm->mmap_sem);
> - vma = find_vma(mm, start);
> - } else {
> - if (start >= end) {
> - error = 0;
> - goto out_unlock;
> + if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
> + file_update_time(file);
I wonder if calling file_update_time() from inside the loop is the
best idea. Why not call that function just once, after msync breaks
from the loop?
thanks,
Rik
--
All Rights Reversed
On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:
> Could you explain (using short words and simple sentences) what the
> exact problem is?
>
> Eg.
>
> 1) program mmaps file
> 2) program writes to mmaped area
> 3) ??? <=== this part, in equally simple words :)
> 4) data loss
It's like this:
Monday 9:04AM: System boots, database server starts up, mmaps file
Monday 9:06AM: Database server writes to mmap area, updates mtime/ctime
Monday <many times> Database server writes to mmap area, no further update..
Monday 11:45PM: Backup sees "file modified 9:06AM, let's back it up"
Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this up..
Wed 9:00AM-5:00PM: More updates, more not touching the mtime
Wed 11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it up..
Lather, rinse, repeat....
Anton Salikhmetov wrote:
> From: Anton Salikhmetov <[email protected]>
>
> I would like to propose my solution for the bug #2645 from the kernel
bug tracker:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> The Open Group defines the behavior of the mmap() function as follows.
>
> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> and PROT_WRITE shall be marked for update at some point in the interval
> between a write reference to the mapped region and the next call to
msync()
> with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> If there is no such call and if the underlying file is modified as a
result
> of a write reference, then these fields shall be marked for update at
some
> time after the write reference.
>
> The above citation was taken from the following link:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>
> Therefore, the msync() function should be called before verifying the
time
> stamps st_mtime and st_ctime in the test program Badari wrote in the
context
> of the bug #2645. Otherwise, the time stamps may be updated
> at some unspecified moment according to the POSIX standard.
>
> I changed his test program a little. The changed unit test can be
downloaded
> using the following link:
>
> http://pygx.sourceforge.net/mmap.c
>
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows the appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
>
>
Sorry, I don't see where the test program shows that the file
times did not change if there had not been an intervening
modification to the mmap'd region. It appears to me that it
just shows the file times changing or not when there has been
intervening modification after the mmap call and before the
fstat call.
Or am I looking in the wrong place? :-)
> Additionally, the test cases for the msync() system call from
> the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
> and mmapstress10) successfully passed using the kernel
> with the patch included into this email.
>
> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment check, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
>
These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.
However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call. When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.
The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.
--
A better architecture would be to arrange for the file times
to be updated when the page makes the transition from being
unmodified to modified. This is not straightforward due to
the current locking, but should be doable, I think. Perhaps
recording the current time and then using it to update the
file times at a more suitable time (no pun intended) might
work.
Thanx...
ps
> Signed-off-by: Anton Salikhmetov <[email protected]>
>
> ---
>
> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..cb973eb 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -1,26 +1,32 @@
> /*
> * linux/mm/msync.c
> *
> + * The msync() system call.
> * Copyright (C) 1994-1999 Linus Torvalds
> + *
> + * Updating the mtime and ctime stamps for mapped files
> + * and code cleanup.
> + * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> */
>
> -/*
> - * The msync() system call.
> - */
> +#include <linux/file.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
> #include <linux/mman.h>
> -#include <linux/file.h>
> -#include <linux/syscalls.h>
> #include <linux/sched.h>
> +#include <linux/syscalls.h>
>
> /*
> * MS_SYNC syncs the entire file - including mappings.
> *
> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> * Now it doesn't do anything, since dirty pages are properly tracked.
> *
> + * The msync() system call updates the ctime and mtime fields for
> + * the mapped file when called with the MS_SYNC or MS_ASYNC flags
> + * according to the POSIX standard.
> + *
> * The application may now run fsync() to
> * write out the dirty pages and wait on the writeout and check the
result.
> * Or the application may run fadvise(FADV_DONTNEED) against the fd
to start
> @@ -33,70 +39,68 @@ asmlinkage long sys_msync(unsigned long start,
size_t len, int flags)
> unsigned long end;
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> - int unmapped_error = 0;
> - int error = -EINVAL;
> + int error = 0, unmapped_error = 0;
>
> - if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> - goto out;
> - if (start & ~PAGE_MASK)
> + if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
> + (start & ~PAGE_MASK) ||
> + ((flags & MS_ASYNC) && (flags & MS_SYNC))) {
> + error = -EINVAL;
> goto out;
> - if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> - goto out;
> - error = -ENOMEM;
> - len = (len + ~PAGE_MASK) & PAGE_MASK;
> + }
> +
> + len = PAGE_ALIGN(len);
> end = start + len;
> - if (end < start)
> + if (end < start) {
> + error = -ENOMEM;
> goto out;
> - error = 0;
> + }
> if (end == start)
> goto out;
> +
> /*
> * If the interval [start,end) covers some unmapped address ranges,
> * just ignore them, but return -ENOMEM at the end.
> */
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, start);
> - for (;;) {
> + do {
> struct file *file;
>
> - /* Still start < end. */
> - error = -ENOMEM;
> - if (!vma)
> - goto out_unlock;
> - /* Here start < vma->vm_end. */
> + if (!vma) {
> + error = -ENOMEM;
> + break;
> + }
> if (start < vma->vm_start) {
> start = vma->vm_start;
> - if (start >= end)
> - goto out_unlock;
> + if (start >= end) {
> + error = -ENOMEM;
> + break;
> + }
> unmapped_error = -ENOMEM;
> }
> - /* Here vma->vm_start <= start < vma->vm_end. */
> if ((flags & MS_INVALIDATE) &&
> (vma->vm_flags & VM_LOCKED)) {
> error = -EBUSY;
> - goto out_unlock;
> + break;
> }
> file = vma->vm_file;
> - start = vma->vm_end;
> - if ((flags & MS_SYNC) && file &&
> - (vma->vm_flags & VM_SHARED)) {
> + if (file && (vma->vm_flags & VM_SHARED)) {
> get_file(file);
> - up_read(&mm->mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error || start >= end)
> - goto out;
> - down_read(&mm->mmap_sem);
> - vma = find_vma(mm, start);
> - } else {
> - if (start >= end) {
> - error = 0;
> - goto out_unlock;
> + if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
> + file_update_time(file);
> + if (flags & MS_SYNC) {
> + up_read(&mm->mmap_sem);
> + error = do_fsync(file, 0);
> + down_read(&mm->mmap_sem);
> }
> - vma = vma->vm_next;
> + fput(file);
> + if (error)
> + break;
> }
> - }
> -out_unlock:
> +
> + start = vma->vm_end;
> + vma = vma->vm_next;
> + } while (start < end);
> up_read(&mm->mmap_sem);
> out:
> return error ? : unmapped_error;
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jan 09, 2008 at 15:50:15 -0500, Rik van Riel wrote:
> > Specifically, the ctime and mtime time stamps do change
> > when modifying the mapped memory and do not change when
> > there have been no write references between the mmap()
> > and msync() system calls.
>
> As long as the ctime and mtime stamps change when the memory is
> written to, what exactly is the problem?
A "not" is missing from the sentence above. The quote above should have
read:
> > Specifically, the ctime and mtime time stamps do _not_ change when
> > modifying the mapped memory and do not change when there have been
> > no write references between the mmap() and msync() system calls.
So essentially the problem is that mtime stamps are _never_ changed when
the file is only modified through mmap. Not even when calling msync().
--
Kind regards,
Klaus S. Madsen
Anton Salikhmetov wrote:
> Since no reaction in LKML was recieved for this message it seemed
> logical to suggest closing the bug #2645 as "WONTFIX":
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15
>
> However, the reporter of the bug, Jacob Oestergaard, insisted the
> solution to be resubmitted once more:
>
>
>
> Please re-submit to LKML.
>
>
Yes, please!
Let's have the right discussion and get this bug addressed for real.
It is a real bug and is causing data corruption for some very large
Red Hat customers because their applications were architected to
use mmap, but their backups are not backing up the modified files
due to this aspect of the system.
This is the 4'th or 5'th attempt in the last 2 years to submit a
patch to address this situation. None have been able to make it
all of the way through the process and to be integrated.
I posted some comments.
Thanx...
ps
> This bug causes backup systems to *miss* changed files.
>
> This bug does cause data loss in common real-world deployments (I gave an
> example with a database when posting the bug, but this affects the data from
> all mmap using applications with common backup systems).
>
> Silent exclusion from backups is very very nasty.
>
> <<<
>
> Please comment on my solution or commit it if it's acceptable in its
> present form.
>
> 2008/1/7, Anton Salikhmetov <[email protected]>:
>
>> From: Anton Salikhmetov <[email protected]>
>>
>> Due to the lack of reaction in LKML I presume the message was lost
>> in the high traffic of that list. Resending it now with the addressee changed
>> to the memory management mailing list.
>>
>> I would like to propose my solution for the bug #2645 from the kernel bug tracker:
>>
>> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>>
>> The Open Group defines the behavior of the mmap() function as follows.
>>
>> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
>> and PROT_WRITE shall be marked for update at some point in the interval
>> between a write reference to the mapped region and the next call to msync()
>> with MS_ASYNC or MS_SYNC for that portion of the file by any process.
>> If there is no such call and if the underlying file is modified as a result
>> of a write reference, then these fields shall be marked for update at some
>> time after the write reference.
>>
>> The above citation was taken from the following link:
>>
>> http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>>
>> Therefore, the msync() function should be called before verifying the time
>> stamps st_mtime and st_ctime in the test program Badari wrote in the context
>> of the bug #2645. Otherwise, the time stamps may be updated
>> at some unspecified moment according to the POSIX standard.
>>
>> I changed his test program a little. The changed unit test can be downloaded
>> using the following link:
>>
>> http://pygx.sourceforge.net/mmap.c
>>
>> This program showed that the msync() function had a bug:
>> it did not update the st_mtime and st_ctime fields.
>>
>> The program shows appropriate behavior of the msync()
>> function using the kernel with the proposed patch applied.
>> Specifically, the ctime and mtime time stamps do change
>> when modifying the mapped memory and do not change when
>> there have been no write references between the mmap()
>> and msync() system calls.
>>
>> Additionally, the test cases for the msync() system call from
>> the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
>> and mmapstress10) successfully passed using the kernel
>> with the patch included into this email.
>>
>> The patch adds a call to the file_update_time() function to change
>> the file metadata before syncing. The patch also contains
>> substantial code cleanup: consolidated error check
>> for function parameters, using the PAGE_ALIGN() macro instead of
>> "manual" alignment, improved readability of the loop,
>> which traverses the process memory regions, updated comments.
>>
>> Signed-off-by: Anton Salikhmetov <[email protected]>
>>
>> ---
>>
>> diff --git a/mm/msync.c b/mm/msync.c
>> index 144a757..cb973eb 100644
>> --- a/mm/msync.c
>> +++ b/mm/msync.c
>> @@ -1,26 +1,32 @@
>> /*
>> * linux/mm/msync.c
>> *
>> + * The msync() system call.
>> * Copyright (C) 1994-1999 Linus Torvalds
>> + *
>> + * Updating the mtime and ctime stamps for mapped files
>> + * and code cleanup.
>> + * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
>> */
>>
>> -/*
>> - * The msync() system call.
>> - */
>> +#include <linux/file.h>
>> #include <linux/fs.h>
>> #include <linux/mm.h>
>> #include <linux/mman.h>
>> -#include <linux/file.h>
>> -#include <linux/syscalls.h>
>> #include <linux/sched.h>
>> +#include <linux/syscalls.h>
>>
>> /*
>> * MS_SYNC syncs the entire file - including mappings.
>> *
>> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
>> - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
>> + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
>> * Now it doesn't do anything, since dirty pages are properly tracked.
>> *
>> + * The msync() system call updates the ctime and mtime fields for
>> + * the mapped file when called with the MS_SYNC or MS_ASYNC flags
>> + * according to the POSIX standard.
>> + *
>> * The application may now run fsync() to
>> * write out the dirty pages and wait on the writeout and check the result.
>> * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
>> @@ -33,70 +39,68 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
>> unsigned long end;
>> struct mm_struct *mm = current->mm;
>> struct vm_area_struct *vma;
>> - int unmapped_error = 0;
>> - int error = -EINVAL;
>> + int error = 0, unmapped_error = 0;
>>
>> - if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
>> - goto out;
>> - if (start & ~PAGE_MASK)
>> + if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
>> + (start & ~PAGE_MASK) ||
>> + ((flags & MS_ASYNC) && (flags & MS_SYNC))) {
>> + error = -EINVAL;
>> goto out;
>> - if ((flags & MS_ASYNC) && (flags & MS_SYNC))
>> - goto out;
>> - error = -ENOMEM;
>> - len = (len + ~PAGE_MASK) & PAGE_MASK;
>> + }
>> +
>> + len = PAGE_ALIGN(len);
>> end = start + len;
>> - if (end < start)
>> + if (end < start) {
>> + error = -ENOMEM;
>> goto out;
>> - error = 0;
>> + }
>> if (end == start)
>> goto out;
>> +
>> /*
>> * If the interval [start,end) covers some unmapped address ranges,
>> * just ignore them, but return -ENOMEM at the end.
>> */
>> down_read(&mm->mmap_sem);
>> vma = find_vma(mm, start);
>> - for (;;) {
>> + do {
>> struct file *file;
>>
>> - /* Still start < end. */
>> - error = -ENOMEM;
>> - if (!vma)
>> - goto out_unlock;
>> - /* Here start < vma->vm_end. */
>> + if (!vma) {
>> + error = -ENOMEM;
>> + break;
>> + }
>> if (start < vma->vm_start) {
>> start = vma->vm_start;
>> - if (start >= end)
>> - goto out_unlock;
>> + if (start >= end) {
>> + error = -ENOMEM;
>> + break;
>> + }
>> unmapped_error = -ENOMEM;
>> }
>> - /* Here vma->vm_start <= start < vma->vm_end. */
>> if ((flags & MS_INVALIDATE) &&
>> (vma->vm_flags & VM_LOCKED)) {
>> error = -EBUSY;
>> - goto out_unlock;
>> + break;
>> }
>> file = vma->vm_file;
>> - start = vma->vm_end;
>> - if ((flags & MS_SYNC) && file &&
>> - (vma->vm_flags & VM_SHARED)) {
>> + if (file && (vma->vm_flags & VM_SHARED)) {
>> get_file(file);
>> - up_read(&mm->mmap_sem);
>> - error = do_fsync(file, 0);
>> - fput(file);
>> - if (error || start >= end)
>> - goto out;
>> - down_read(&mm->mmap_sem);
>> - vma = find_vma(mm, start);
>> - } else {
>> - if (start >= end) {
>> - error = 0;
>> - goto out_unlock;
>> + if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
>> + file_update_time(file);
>> + if (flags & MS_SYNC) {
>> + up_read(&mm->mmap_sem);
>> + error = do_fsync(file, 0);
>> + down_read(&mm->mmap_sem);
>> }
>> - vma = vma->vm_next;
>> + fput(file);
>> + if (error)
>> + break;
>> }
>> - }
>> -out_unlock:
>> +
>> + start = vma->vm_end;
>> + vma = vma->vm_next;
>> + } while (start < end);
>> up_read(&mm->mmap_sem);
>> out:
>> return error ? : unmapped_error;
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, 09 Jan 2008 16:06:17 -0500
[email protected] wrote:
> On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:
>
> > Could you explain (using short words and simple sentences) what the
> > exact problem is?
>
> It's like this:
>
> Monday 9:04AM: System boots, database server starts up, mmaps file
> Monday 9:06AM: Database server writes to mmap area, updates mtime/ctime
> Monday <many times> Database server writes to mmap area, no further update..
> Monday 11:45PM: Backup sees "file modified 9:06AM, let's back it up"
> Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
> Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this up..
> Wed 9:00AM-5:00PM: More updates, more not touching the mtime
> Wed 11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it up..
>
> Lather, rinse, repeat....
On the other hand, updating the mtime and ctime whenever a page is dirtied
also does not work right. Apparently that can break mutt.
Calling msync() every once in a while with Anton's patch does not look like a
fool proof method to me either, because the VM can write all the dirty pages
to disk by itself, leaving nothing for msync() to detect. (I think...)
Can we get by with simply updating the ctime and mtime every time msync()
is called, regardless of whether or not the mmaped pages were still dirty
by the time we called msync() ?
--
All Rights Reversed
Rik van Riel wrote:
> On Wed, 09 Jan 2008 16:06:17 -0500
> [email protected] wrote:
>
>> On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:
>>
>>
>>> Could you explain (using short words and simple sentences) what the
>>> exact problem is?
>>>
>> It's like this:
>>
>> Monday 9:04AM: System boots, database server starts up, mmaps file
>> Monday 9:06AM: Database server writes to mmap area, updates mtime/ctime
>> Monday <many times> Database server writes to mmap area, no further update..
>> Monday 11:45PM: Backup sees "file modified 9:06AM, let's back it up"
>> Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
>> Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this up..
>> Wed 9:00AM-5:00PM: More updates, more not touching the mtime
>> Wed 11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it up..
>>
>> Lather, rinse, repeat....
>>
>
> On the other hand, updating the mtime and ctime whenever a page is dirtied
> also does not work right. Apparently that can break mutt.
>
>
Could you elaborate on why that would break mutt? I am assuming
that the pages being modified are mmap'd, but if they are not, then
it is very clear why mutt (and anything else) would break.
> Calling msync() every once in a while with Anton's patch does not look like a
> fool proof method to me either, because the VM can write all the dirty pages
> to disk by itself, leaving nothing for msync() to detect. (I think...)
>
> Can we get by with simply updating the ctime and mtime every time msync()
> is called, regardless of whether or not the mmaped pages were still dirty
> by the time we called msync() ?
As long as we can keep track of that information and then remember
it for an munmap so that eventually the file times do get updated,
then this should work.
It would seem that a better solution would be to update the file
times whenever the inode gets cleaned, ie. modified pages written
out and the inode synchronized to the disk. That way, long running
programs would not have to msync occasionally in order to have
the data file properly backed up.
Thanx...
ps
ps
On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:
...
> >
> > Lather, rinse, repeat....
Just verified this at one customer site; they had a db that was last backed up
in 2003 :/
>
> On the other hand, updating the mtime and ctime whenever a page is dirtied
> also does not work right. Apparently that can break mutt.
>
Thinking back on the atime discussion, I bet there would be some performance
problems in updating the ctime/mtime that often too :)
> Calling msync() every once in a while with Anton's patch does not look like a
> fool proof method to me either, because the VM can write all the dirty pages
> to disk by itself, leaving nothing for msync() to detect. (I think...)
Reading the man page:
"The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED and
PROT_WRITE will be marked for update at some point in the interval between a
write reference to the mapped region and the next call to msync() with
MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is no
such call, these fields may be marked for update at any time after a write
reference if the underlying file is modified as a result."
So, whenever someone writes in the region, this must cause us to update the
mtime/ctime no later than at the time of the next call to msync().
Could one do something like the lazy atime updates, coupled with a forced flush
at msync()?
> Can we get by with simply updating the ctime and mtime every time msync()
> is called, regardless of whether or not the mmaped pages were still dirty
> by the time we called msync() ?
The update must still happen, eventually, after a write to the mapped region
followed by an unmap/close even if no msync is ever called.
The msync only serves as a "no later than" deadline. The write to the region
triggers the need for the update.
At least this is how I read the standard - please feel free to correct me if I
am mistaken.
--
/ jakob
On Wed, 9 Jan 2008 23:33:40 +0100
Jakob Oestergaard <[email protected]> wrote:
> On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:
> > Can we get by with simply updating the ctime and mtime every time msync()
> > is called, regardless of whether or not the mmaped pages were still dirty
> > by the time we called msync() ?
>
> The update must still happen, eventually, after a write to the mapped region
> followed by an unmap/close even if no msync is ever called.
>
> The msync only serves as a "no later than" deadline. The write to the region
> triggers the need for the update.
>
> At least this is how I read the standard - please feel free to correct me if I
> am mistaken.
You are absolutely right. If we wrote dirty pages to disk, the ctime
and mtime updates must happen no later than msync or close time.
I guess a third possible time (if we want to minimize the number of
updates) would be when natural syncing of the file data to disk, by
other things in the VM, would be about to clear the I_DIRTY_PAGES
flag on the inode. That way we do not need to remember any special
"we already flushed all dirty data, but we have not updated the mtime
and ctime yet" state.
Does this sound reasonable?
--
All rights reversed.
2008/1/10, Rik van Riel <[email protected]>:
> On Wed, 9 Jan 2008 23:33:40 +0100
> Jakob Oestergaard <[email protected]> wrote:
> > On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:
>
> > > Can we get by with simply updating the ctime and mtime every time msync()
> > > is called, regardless of whether or not the mmaped pages were still dirty
> > > by the time we called msync() ?
> >
> > The update must still happen, eventually, after a write to the mapped region
> > followed by an unmap/close even if no msync is ever called.
> >
> > The msync only serves as a "no later than" deadline. The write to the region
> > triggers the need for the update.
> >
> > At least this is how I read the standard - please feel free to correct me if I
> > am mistaken.
>
> You are absolutely right. If we wrote dirty pages to disk, the ctime
> and mtime updates must happen no later than msync or close time.
>
> I guess a third possible time (if we want to minimize the number of
> updates) would be when natural syncing of the file data to disk, by
> other things in the VM, would be about to clear the I_DIRTY_PAGES
> flag on the inode. That way we do not need to remember any special
> "we already flushed all dirty data, but we have not updated the mtime
> and ctime yet" state.
>
> Does this sound reasonable?
No, it doesn't. The msync() system call called with the MS_ASYNC flag
should (the POSIX standard requires that) update the st_ctime and
st_mtime stamps in the same manner as for the MS_SYNC flag. However,
the current implementation of msync() doesn't call the do_fsync()
function for the MS_ASYNC case. The msync() function may be called
with the MS_ASYNC flag before "natural syncing".
2008/1/9, Rik van Riel <[email protected]>:
> On Mon, 07 Jan 2008 20:54:19 +0300
> Anton Salikhmetov <[email protected]> wrote:
>
> > This program showed that the msync() function had a bug:
> > it did not update the st_mtime and st_ctime fields.
> >
> > The program shows appropriate behavior of the msync()
> > function using the kernel with the proposed patch applied.
> > Specifically, the ctime and mtime time stamps do change
> > when modifying the mapped memory and do not change when
> > there have been no write references between the mmap()
> > and msync() system calls.
>
> As long as the ctime and mtime stamps change when the memory is
> written to, what exactly is the problem?
>
> Is it that the ctime and mtime does not change again when the memory
> is written to again?
>
> Is there a way for backup programs to miss file modification times?
>
> Could you explain (using short words and simple sentences) what the
> exact problem is?
>
> Eg.
>
> 1) program mmaps file
> 2) program writes to mmaped area
> 3) ??? <=== this part, in equally simple words :)
> 4) data loss
>
> An explanation like that will help people understand exactly what the
> bug is, and why the patch should be applied ASAP.
>
> > The patch adds a call to the file_update_time() function to change
> > the file metadata before syncing. The patch also contains
> > substantial code cleanup: consolidated error check
> > for function parameters, using the PAGE_ALIGN() macro instead of
> > "manual" alignment, improved readability of the loop,
> > which traverses the process memory regions, updated comments.
>
> Due to the various cleanups all being in one file, it took me a while
> to understand the patch. In an area of code this subtle, it may be
> better to submit the cleanups in (a) separate patch(es) from the patch
> that adds the call to file_update_time().
Now I'm working on my next solution for this bug. It will probably
modify more than one file and be split into several parts.
>
> > - (vma->vm_flags & VM_SHARED)) {
> > + if (file && (vma->vm_flags & VM_SHARED)) {
> > get_file(file);
> > - up_read(&mm->mmap_sem);
> > - error = do_fsync(file, 0);
> > - fput(file);
> > - if (error || start >= end)
> > - goto out;
> > - down_read(&mm->mmap_sem);
> > - vma = find_vma(mm, start);
> > - } else {
> > - if (start >= end) {
> > - error = 0;
> > - goto out_unlock;
> > + if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
> > + file_update_time(file);
>
> I wonder if calling file_update_time() from inside the loop is the
> best idea. Why not call that function just once, after msync breaks
> from the loop?
That function should be called inside of the loop because the memory
region, which msync() is called with, may contain pages mapped for
different files.
>
> thanks,
>
> Rik
> --
> All Rights Reversed
>
2008/1/10, Rik van Riel <[email protected]>:
> On Wed, 09 Jan 2008 16:06:17 -0500
> [email protected] wrote:
> > On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:
> >
> > > Could you explain (using short words and simple sentences) what the
> > > exact problem is?
> >
> > It's like this:
> >
> > Monday 9:04AM: System boots, database server starts up, mmaps file
> > Monday 9:06AM: Database server writes to mmap area, updates mtime/ctime
> > Monday <many times> Database server writes to mmap area, no further update..
> > Monday 11:45PM: Backup sees "file modified 9:06AM, let's back it up"
> > Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
> > Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this up..
> > Wed 9:00AM-5:00PM: More updates, more not touching the mtime
> > Wed 11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it up..
> >
> > Lather, rinse, repeat....
>
> On the other hand, updating the mtime and ctime whenever a page is dirtied
> also does not work right. Apparently that can break mutt.
Please tell why you think that can break mutt? Such approach was
suggested by Peter once and looks reasonable to me too.
>
> Calling msync() every once in a while with Anton's patch does not look like a
> fool proof method to me either, because the VM can write all the dirty pages
> to disk by itself, leaving nothing for msync() to detect. (I think...)
>
> Can we get by with simply updating the ctime and mtime every time msync()
> is called, regardless of whether or not the mmaped pages were still dirty
> by the time we called msync() ?
>
> --
> All Rights Reversed
>
On Thu, Jan 10, 2008 at 03:03:03AM +0300, Anton Salikhmetov wrote:
...
> > I guess a third possible time (if we want to minimize the number of
> > updates) would be when natural syncing of the file data to disk, by
> > other things in the VM, would be about to clear the I_DIRTY_PAGES
> > flag on the inode. That way we do not need to remember any special
> > "we already flushed all dirty data, but we have not updated the mtime
> > and ctime yet" state.
> >
> > Does this sound reasonable?
>
> No, it doesn't. The msync() system call called with the MS_ASYNC flag
> should (the POSIX standard requires that) update the st_ctime and
> st_mtime stamps in the same manner as for the MS_SYNC flag. However,
> the current implementation of msync() doesn't call the do_fsync()
> function for the MS_ASYNC case. The msync() function may be called
> with the MS_ASYNC flag before "natural syncing".
If the update was done as Rik suggested, with the addition that msync()
triggered an explicit sync of the inode data, then everything would be ok,
right?
--
/ jakob
2008/1/10, Jakob Oestergaard <[email protected]>:
> On Thu, Jan 10, 2008 at 03:03:03AM +0300, Anton Salikhmetov wrote:
> ...
> > > I guess a third possible time (if we want to minimize the number of
> > > updates) would be when natural syncing of the file data to disk, by
> > > other things in the VM, would be about to clear the I_DIRTY_PAGES
> > > flag on the inode. That way we do not need to remember any special
> > > "we already flushed all dirty data, but we have not updated the mtime
> > > and ctime yet" state.
> > >
> > > Does this sound reasonable?
> >
> > No, it doesn't. The msync() system call called with the MS_ASYNC flag
> > should (the POSIX standard requires that) update the st_ctime and
> > st_mtime stamps in the same manner as for the MS_SYNC flag. However,
> > the current implementation of msync() doesn't call the do_fsync()
> > function for the MS_ASYNC case. The msync() function may be called
> > with the MS_ASYNC flag before "natural syncing".
>
> If the update was done as Rik suggested, with the addition that msync()
> triggered an explicit sync of the inode data, then everything would be ok,
> right?
Indeed, if msync() is called with MS_SYNC an explicit sync is
triggered, and Rik's suggestion would work. However, the POSIX
standard requires a call to msync() with MS_ASYNC to update the
st_ctime and st_mtime stamps too. No explicit sync of the inode data
is triggered in the current implementation of msync(). Hence Rik's
suggestion would fail to satisfy POSIX in the latter case.
>
> --
>
> / jakob
>
>
On Thu, 10 Jan 2008 13:53:59 +0300
"Anton Salikhmetov" <[email protected]> wrote:
> Indeed, if msync() is called with MS_SYNC an explicit sync is
> triggered, and Rik's suggestion would work. However, the POSIX
> standard requires a call to msync() with MS_ASYNC to update the
> st_ctime and st_mtime stamps too. No explicit sync of the inode data
> is triggered in the current implementation of msync(). Hence Rik's
> suggestion would fail to satisfy POSIX in the latter case.
Since your patch is already changing msync(), has it occurred
to you that your patch could change msync() to do the right
thing?
--
All rights reversed.
2008/1/10, Rik van Riel <[email protected]>:
> On Thu, 10 Jan 2008 13:53:59 +0300
> "Anton Salikhmetov" <[email protected]> wrote:
>
> > Indeed, if msync() is called with MS_SYNC an explicit sync is
> > triggered, and Rik's suggestion would work. However, the POSIX
> > standard requires a call to msync() with MS_ASYNC to update the
> > st_ctime and st_mtime stamps too. No explicit sync of the inode data
> > is triggered in the current implementation of msync(). Hence Rik's
> > suggestion would fail to satisfy POSIX in the latter case.
>
> Since your patch is already changing msync(), has it occurred
> to you that your patch could change msync() to do the right
> thing?
No, not quite. Peter Staubach mentioned an issue in my solution:
>>>
> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment check, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
>
These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.
However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call. When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.
The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.
<<<
So I'm working on my next solution for this bug now.
>
> --
> All rights reversed.
>
On Thu, 10 Jan 2008 18:56:07 +0300
"Anton Salikhmetov" <[email protected]> wrote:
> However, I don't see how they will work if there has been
> something like a sync(2) done after the mmap'd region is
> modified and the msync call. When the inode is written out
> as part of the sync process, I_DIRTY_PAGES will be cleared,
> thus causing a miss in this code.
>
> The I_DIRTY_PAGES check here is good, but I think that there
> needs to be some code elsewhere too, to catch the case where
> I_DIRTY_PAGES is being cleared, but the time fields still need
> to be updated.
Agreed. The mtime and ctime should probably also be updated
when I_DIRTY_PAGES is cleared.
The alternative would be to remember that the inode had been
dirty in the past, and have the mtime and ctime updated on
msync or close - which would be more complex.
--
All rights reversed.
2008/1/10, Rik van Riel <[email protected]>:
> On Thu, 10 Jan 2008 18:56:07 +0300
> "Anton Salikhmetov" <[email protected]> wrote:
>
> > However, I don't see how they will work if there has been
> > something like a sync(2) done after the mmap'd region is
> > modified and the msync call. When the inode is written out
> > as part of the sync process, I_DIRTY_PAGES will be cleared,
> > thus causing a miss in this code.
> >
> > The I_DIRTY_PAGES check here is good, but I think that there
> > needs to be some code elsewhere too, to catch the case where
> > I_DIRTY_PAGES is being cleared, but the time fields still need
> > to be updated.
>
> Agreed. The mtime and ctime should probably also be updated
> when I_DIRTY_PAGES is cleared.
>
> The alternative would be to remember that the inode had been
> dirty in the past, and have the mtime and ctime updated on
> msync or close - which would be more complex.
Adding the new flag (AS_MCTIME) has been already suggested by Peter
Staubach in his first solution for this bug. Now I understand that the
AS_MCTIME flag is required for fixing the bug.
>
> --
> All rights reversed.
>
Rik van Riel wrote:
> On Thu, 10 Jan 2008 18:56:07 +0300
> "Anton Salikhmetov" <[email protected]> wrote:
>
>
>> However, I don't see how they will work if there has been
>> something like a sync(2) done after the mmap'd region is
>> modified and the msync call. When the inode is written out
>> as part of the sync process, I_DIRTY_PAGES will be cleared,
>> thus causing a miss in this code.
>>
>> The I_DIRTY_PAGES check here is good, but I think that there
>> needs to be some code elsewhere too, to catch the case where
>> I_DIRTY_PAGES is being cleared, but the time fields still need
>> to be updated.
>>
>
> Agreed. The mtime and ctime should probably also be updated
> when I_DIRTY_PAGES is cleared.
>
> The alternative would be to remember that the inode had been
> dirty in the past, and have the mtime and ctime updated on
> msync or close - which would be more complex.
And also remembering that the file times should not be updated
if the pages were modified via a write(2) operation. Or if
there has been an intervening write(2) operation...
The number of cases to consider and the boundary conditions
quickly make this reasonably complex to get right. That's why
this is the 4'th or 5'th attempt in the last 18 months or so
to get this situation addressed.
ps
Anton Salikhmetov wrote:
> 2008/1/10, Rik van Riel <[email protected]>:
>
>> On Thu, 10 Jan 2008 18:56:07 +0300
>> "Anton Salikhmetov" <[email protected]> wrote:
>>
>>
>>> However, I don't see how they will work if there has been
>>> something like a sync(2) done after the mmap'd region is
>>> modified and the msync call. When the inode is written out
>>> as part of the sync process, I_DIRTY_PAGES will be cleared,
>>> thus causing a miss in this code.
>>>
>>> The I_DIRTY_PAGES check here is good, but I think that there
>>> needs to be some code elsewhere too, to catch the case where
>>> I_DIRTY_PAGES is being cleared, but the time fields still need
>>> to be updated.
>>>
>> Agreed. The mtime and ctime should probably also be updated
>> when I_DIRTY_PAGES is cleared.
>>
>> The alternative would be to remember that the inode had been
>> dirty in the past, and have the mtime and ctime updated on
>> msync or close - which would be more complex.
>>
>
> Adding the new flag (AS_MCTIME) has been already suggested by Peter
> Staubach in his first solution for this bug. Now I understand that the
> AS_MCTIME flag is required for fixing the bug.
Well, that was the approach before we had I_DIRTY_PAGES. I am
still wondering whether we can get this approach to work, with
a little more support and heuristics. PeterZ's work to better
track dirty pages should be helpful in determining when and why
a patch was dirty.
I keep thinking that by recording the time when a page was found
to be dirty and the file is mmap'd and then updating the mtime
and ctime fields in the inode during msync() and sync_single_inode()
if that time is newer than the mtime and ctime fields, then we
can solve the problem of when and when not to update those two
time fields.
I haven't had a chance to think it all through completely or do
the appropriate analysis yet though.
ps
On Wed, 09 Jan 2008 18:41:41 EST, Rik van Riel said:
> I guess a third possible time (if we want to minimize the number of
> updates) would be when natural syncing of the file data to disk, by
> other things in the VM, would be about to clear the I_DIRTY_PAGES
> flag on the inode. That way we do not need to remember any special
> "we already flushed all dirty data, but we have not updated the mtime
> and ctime yet" state.
>
> Does this sound reasonable?
Is it possible that a *very* large file (multi-gigabyte or even bigger database,
for example) would never get out of I_DIRTY_PAGES, because there's always a
few dozen just-recently dirtied pages that haven't made it out to disk yet?
Of course, getting a *consistent* backup of a file like that is quite the
challenge already, because of the high likelyhood of the file being changed
while the backup runs - that's why big sites often do a 'quiesce/snapshot/wakeup'
on a database and then backup the snapshot...