2008-01-17 22:32:18

by Anton Salikhmetov

[permalink] [raw]
Subject: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

This is the sixth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) a few cosmetic changes according to the latest feedback
for the cleanup patch;

2) implementation of the following suggestion by Miklos Szeredi:

http://lkml.org/lkml/2008/1/17/158

These changes were tested as explained below. Please note that all
tests were performed with all recommended kernel debug options
enabled. Also note that the tests were performed on regular files
residing on both an ext3 partition and a tmpfs filesystem. I also
checked the block device case, which worked for me as well.

1. My own unit test:

http://bugzilla.kernel.org/attachment.cgi?id=14430

Result: all test cases passed successfully.

2. Unit test provided by Miklos Szeredi:

http://lkml.org/lkml/2008/1/14/104

Result: this test produced the following output:

debian-64:~# ./miklos_test test
begin 1200598736 1200598736 1200598617
write 1200598737 1200598737 1200598617
mmap 1200598737 1200598737 1200598738
b 1200598739 1200598739 1200598738
msync b 1200598739 1200598739 1200598738
c 1200598741 1200598741 1200598738
msync c 1200598741 1200598741 1200598738
d 1200598743 1200598743 1200598738
munmap 1200598743 1200598743 1200598738
close 1200598743 1200598743 1200598738
sync 1200598743 1200598743 1200598738
debian-64:~#

3. Regression tests were performed using the following test cases from
the LTP test suite:

msync01
msync02
msync03
msync04
msync05
mmapstress01
mmapstress09
mmapstress10

Result: no regressions were found while running these test cases.

4. Performance test was done using the program available from the
following link:

http://bugzilla.kernel.org/attachment.cgi?id=14493

Result: the impact of the changes was negligible for files of a few
hundred megabytes.

I wonder if these changes can be applied now.

These patches are the result of many fruitful discussions with
Miklos Szeredi, Peter Zijlstra, Rik van Riel, Peter Staubach,
and Jacob Oestergaard. I am grateful to you all for your support
during the days I was working on this long-standing nasty bug.


2008-01-17 22:32:31

by Anton Salikhmetov

[permalink] [raw]
Subject: [PATCH -v6 1/2] Massive code cleanup of sys_msync()

Using the PAGE_ALIGN() macro instead of "manual" alignment and
improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[email protected]>
---
mm/msync.c | 77 ++++++++++++++++++++++++++++--------------------------------
1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
/*
- * linux/mm/msync.c
+ * The msync() system call.
*
- * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
*/

-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>

/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
{
unsigned long end;
- struct mm_struct *mm = current->mm;
+ int error, unmapped_error;
struct vm_area_struct *vma;
- int unmapped_error = 0;
- int error = -EINVAL;
+ struct mm_struct *mm;

+ error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
- len = (len + ~PAGE_MASK) & PAGE_MASK;
+ len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
- error = 0;
+
+ error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
* If the interval [start,end) covers some unmapped address ranges,
* just ignore them, but return -ENOMEM at the end.
*/
+ mm = current->mm;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- for (;;) {
+ do {
struct file *file;

- /* Still start < end. */
error = -ENOMEM;
if (!vma)
- goto out_unlock;
- /* Here start < vma->vm_end. */
+ break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
- goto out_unlock;
- unmapped_error = -ENOMEM;
- }
- /* Here vma->vm_start <= start < vma->vm_end. */
- if ((flags & MS_INVALIDATE) &&
- (vma->vm_flags & VM_LOCKED)) {
- error = -EBUSY;
- goto out_unlock;
+ break;
+ unmapped_error = error;
}
- file = vma->vm_file;
+
+ error = -EBUSY;
+ if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+ break;
+
+ error = 0;
start = vma->vm_end;
- if ((flags & MS_SYNC) && file &&
- (vma->vm_flags & VM_SHARED)) {
+ file = vma->vm_file;
+ if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- } else {
- if (start >= end) {
- error = 0;
- goto out_unlock;
- }
- vma = vma->vm_next;
+ continue;
}
- }
-out_unlock:
+
+ vma = vma->vm_next;
+ } while (start < end);
up_read(&mm->mmap_sem);
+
out:
- return error ? : unmapped_error;
+ return error ? error : unmapped_error;
}
--
1.4.4.4

2008-01-17 22:32:48

by Anton Salikhmetov

[permalink] [raw]
Subject: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[email protected]>
---
mm/memory.c | 6 ++++++
mm/msync.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
/*
* Yes, Virginia, this is actually required to prevent a race
* with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
#include <linux/syscalls.h>

/*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+ if (pte_dirty(*pte) && pte_write(*pte))
+ *pte = pte_wrprotect(*pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+}
+
+/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
*
* The application may now run fsync() to write out the dirty pages and
* wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
- if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
- get_file(file);
- up_read(&mm->mmap_sem);
- error = do_fsync(file, 0);
- fput(file);
- if (error || start >= end)
- goto out;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, start);
- continue;
+ if (file && (vma->vm_flags & VM_SHARED)) {
+ if (flags & MS_ASYNC)
+ vma_wrprotect(vma);
+ if (flags & MS_SYNC) {
+ get_file(file);
+ up_read(&mm->mmap_sem);
+ error = do_fsync(file, 0);
+ fput(file);
+ if (error || start >= end)
+ goto out;
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
+ continue;
+ }
}

vma = vma->vm_next;
--
1.4.4.4

2008-01-18 09:34:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()

> unsigned long end;
> - struct mm_struct *mm = current->mm;
> + int error, unmapped_error;
> struct vm_area_struct *vma;
> - int unmapped_error = 0;
> - int error = -EINVAL;
> + struct mm_struct *mm;
>
> + error = -EINVAL;

I think you may have misunderstood my last comment. These are OK:

struct mm_struct *mm = current->mm;
int unmapped_error = 0;
int error = -EINVAL;

This is not so good:

int error, unmapped_error;

This is the worst:

int error = -EINVAL, unmapped_error = 0;

So I think the original code is fine as it is.

Othewise patch looks OK now.

Miklos

2008-01-18 09:40:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

> 4. Performance test was done using the program available from the
> following link:
>
> http://bugzilla.kernel.org/attachment.cgi?id=14493
>
> Result: the impact of the changes was negligible for files of a few
> hundred megabytes.

Could you also test with ext4 and post some numbers? Afaik, ext4 uses
nanosecond timestamps, so the time updating code would be exercised
more during the page faults.

What about performance impact on msync(MS_ASYNC)? Could you please do
some measurment of that as well?

Thanks,
Miklos

2008-01-18 09:51:35

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> Updating file times at write references to memory-mapped files and
> forcing file times update at the next write reference after
> calling the msync() system call with the MS_ASYNC flag.
>
> Signed-off-by: Anton Salikhmetov <[email protected]>
> ---
> mm/memory.c | 6 ++++++
> mm/msync.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 45 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4bf0b6d..13d5bbf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1668,6 +1668,9 @@ gotten:
> unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> +
> /*
> * Yes, Virginia, this is actually required to prevent a race
> * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2341,6 +2344,9 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> +
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
> diff --git a/mm/msync.c b/mm/msync.c
> index a4de868..a49af28 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -13,11 +13,33 @@
> #include <linux/syscalls.h>
>
> /*
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + spinlock_t *ptl;
> + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> + pud_t *pud = pud_offset(pgd, addr);
> + pmd_t *pmd = pmd_offset(pud, addr);
> + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> + if (pte_dirty(*pte) && pte_write(*pte))
> + *pte = pte_wrprotect(*pte);
> + pte_unmap_unlock(pte, ptl);
> + }
> +}

What about ram based filesystems? They don't start out with read-only
pte's, so I think they don't want them read-protected now either.
Unless this is essential for correct mtime/ctime accounting on these
filesystems (I don't think it really is). But then the mapping should
start out read-only as well, otherwise the time update will only work
after an msync(MS_ASYNC).

> +
> +/*
> * MS_SYNC syncs the entire file - including mappings.
> *
> - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> - * Now it doesn't do anything, since dirty pages are properly tracked.
> + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> + * read-only by calling vma_wrprotect(). This is needed to catch the next
> + * write reference to the mapped region and update the file times
> + * accordingly.
> *
> * The application may now run fsync() to write out the dirty pages and
> * wait on the writeout and check the result. Or the application may run
> @@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> error = 0;
> start = vma->vm_end;
> file = vma->vm_file;
> - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> - get_file(file);
> - up_read(&mm->mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error || start >= end)
> - goto out;
> - down_read(&mm->mmap_sem);
> - vma = find_vma(mm, start);
> - continue;
> + if (file && (vma->vm_flags & VM_SHARED)) {
> + if (flags & MS_ASYNC)
> + vma_wrprotect(vma);
> + if (flags & MS_SYNC) {
> + get_file(file);
> + up_read(&mm->mmap_sem);
> + error = do_fsync(file, 0);
> + fput(file);
> + if (error || start >= end)
> + goto out;
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, start);
> + continue;
> + }
> }
>
> vma = vma->vm_next;
> --
> 1.4.4.4
>
>

2008-01-18 10:15:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files


On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:

> > diff --git a/mm/msync.c b/mm/msync.c
> > index a4de868..a49af28 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -13,11 +13,33 @@
> > #include <linux/syscalls.h>
> >
> > /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + spinlock_t *ptl;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > + pud_t *pud = pud_offset(pgd, addr);
> > + pmd_t *pmd = pmd_offset(pud, addr);
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte))
> > + *pte = pte_wrprotect(*pte);
> > + pte_unmap_unlock(pte, ptl);
> > + }
> > +}
>
> What about ram based filesystems? They don't start out with read-only
> pte's, so I think they don't want them read-protected now either.
> Unless this is essential for correct mtime/ctime accounting on these
> filesystems (I don't think it really is). But then the mapping should
> start out read-only as well, otherwise the time update will only work
> after an msync(MS_ASYNC).

page_mkclean() has all the needed logic for this, it also walks the rmap
and cleans out all other users, which I think is needed too for
consistencies sake:

Process A Process B

mmap(foo.txt) mmap(foo.txt)

dirty page
dirty page

msync(MS_ASYNC)

dirty page

msync(MS_ASYNC) <--- now what?!


So what I would suggest is using the page table walkers from mm, and
walks the page range, obtain the page using vm_normal_page() and call
page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)

All in all, that sounds rather expensive..


2008-01-18 10:26:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files


On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
>
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..a49af28 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -13,11 +13,33 @@
> > > #include <linux/syscalls.h>
> > >
> > > /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > + spinlock_t *ptl;
> > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > + pud_t *pud = pud_offset(pgd, addr);
> > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +
> > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > + *pte = pte_wrprotect(*pte);
> > > + pte_unmap_unlock(pte, ptl);
> > > + }
> > > +}
> >
> > What about ram based filesystems? They don't start out with read-only
> > pte's, so I think they don't want them read-protected now either.
> > Unless this is essential for correct mtime/ctime accounting on these
> > filesystems (I don't think it really is). But then the mapping should
> > start out read-only as well, otherwise the time update will only work
> > after an msync(MS_ASYNC).
>
> page_mkclean() has all the needed logic for this, it also walks the rmap
> and cleans out all other users, which I think is needed too for
> consistencies sake:
>
> Process A Process B
>
> mmap(foo.txt) mmap(foo.txt)
>
> dirty page
> dirty page
>
> msync(MS_ASYNC)
>
> dirty page
>
> msync(MS_ASYNC) <--- now what?!
>
>
> So what I would suggest is using the page table walkers from mm, and
> walks the page range, obtain the page using vm_normal_page() and call
> page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
>
> All in all, that sounds rather expensive..

Bah, and will break on s390... so we'd need a page_mkclean() variant
that doesn't actually clear dirty.

2008-01-18 10:30:28

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008/1/18, Miklos Szeredi <[email protected]>:
> > unsigned long end;
> > - struct mm_struct *mm = current->mm;
> > + int error, unmapped_error;
> > struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + struct mm_struct *mm;
> >
> > + error = -EINVAL;
>
> I think you may have misunderstood my last comment. These are OK:
>
> struct mm_struct *mm = current->mm;
> int unmapped_error = 0;
> int error = -EINVAL;
>
> This is not so good:
>
> int error, unmapped_error;
>
> This is the worst:
>
> int error = -EINVAL, unmapped_error = 0;
>
> So I think the original code is fine as it is.
>
> Othewise patch looks OK now.

I moved the initialization of the variables to the code where they are needed.

I don't agree that "int a; int b;" is better than "int a, b".

>
> Miklos
>

2008-01-18 10:31:41

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008/1/18, Miklos Szeredi <[email protected]>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers? Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)? Could you please do
> some measurment of that as well?

I'll do the measurements for the MS_ASYNC case and for the Ext4 filesystem.

>
> Thanks,
> Miklos
>
>

2008-01-18 10:38:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
>
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..a49af28 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -13,11 +13,33 @@
> > > #include <linux/syscalls.h>
> > >
> > > /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > + spinlock_t *ptl;
> > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > + pud_t *pud = pud_offset(pgd, addr);
> > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +
> > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > + *pte = pte_wrprotect(*pte);
> > > + pte_unmap_unlock(pte, ptl);
> > > + }
> > > +}
> >
> > What about ram based filesystems? They don't start out with read-only
> > pte's, so I think they don't want them read-protected now either.
> > Unless this is essential for correct mtime/ctime accounting on these
> > filesystems (I don't think it really is). But then the mapping should
> > start out read-only as well, otherwise the time update will only work
> > after an msync(MS_ASYNC).
>
> page_mkclean() has all the needed logic for this, it also walks the rmap
> and cleans out all other users, which I think is needed too for
> consistencies sake:
>
> Process A Process B
>
> mmap(foo.txt) mmap(foo.txt)
>
> dirty page
> dirty page
>
> msync(MS_ASYNC)
>
> dirty page
>
> msync(MS_ASYNC) <--- now what?!

Nothing. I think it's perfectly acceptable behavior, if msync in
process A doesn't care about any dirtying in process B.

> All in all, that sounds rather expensive..

Right. The advantage of Anton's current approach, is that it's at
least simple, and possibly not so expensive, while providing same
quite sane semantics for MS_ASYNC vs. mtime updates.

Miklos

2008-01-18 10:40:15

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008/1/18, Peter Zijlstra <[email protected]>:
>
> On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> >
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > > #include <linux/syscalls.h>
> > > >
> > > > /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > + unsigned long addr;
> > > > +
> > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > + spinlock_t *ptl;
> > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > + pud_t *pud = pud_offset(pgd, addr);
> > > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +
> > > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > > + *pte = pte_wrprotect(*pte);
> > > > + pte_unmap_unlock(pte, ptl);
> > > > + }
> > > > +}
> > >
> > > What about ram based filesystems? They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is). But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> >
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> >
> > Process A Process B
> >
> > mmap(foo.txt) mmap(foo.txt)
> >
> > dirty page
> > dirty page
> >
> > msync(MS_ASYNC)
> >
> > dirty page
> >
> > msync(MS_ASYNC) <--- now what?!
> >
> >
> > So what I would suggest is using the page table walkers from mm, and
> > walks the page range, obtain the page using vm_normal_page() and call
> > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> >
> > All in all, that sounds rather expensive..
>
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.

>
>

2008-01-18 11:01:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files


On Fri, 2008-01-18 at 11:38 +0100, Miklos Szeredi wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> >
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > > #include <linux/syscalls.h>
> > > >
> > > > /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > + unsigned long addr;
> > > > +
> > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > + spinlock_t *ptl;
> > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > + pud_t *pud = pud_offset(pgd, addr);
> > > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +
> > > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > > + *pte = pte_wrprotect(*pte);
> > > > + pte_unmap_unlock(pte, ptl);
> > > > + }
> > > > +}
> > >
> > > What about ram based filesystems? They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is). But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> >
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> >
> > Process A Process B
> >
> > mmap(foo.txt) mmap(foo.txt)
> >
> > dirty page
> > dirty page
> >
> > msync(MS_ASYNC)
> >
> > dirty page
> >
> > msync(MS_ASYNC) <--- now what?!

how about:

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a1b3fc6 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -14,6 +14,122 @@
#include <linux/syscalls.h>
#include <linux/sched.h>

+unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
+ unsigned long addr, unsigned long end)
+{
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ do {
+ pte_t ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ if (pte_dirty(ptent) && pte_write(ptent)) {
+ flush_cache_page(vma, addr, pte_pfn(ptent));
+ ptent = ptep_clear_flush(vma, addr, pte);
+ ptent = pte_wrprotect(ptent);
+ set_pte_at(vma->vm_mnm, addr, pte, ptent);
+ }
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+
+ return addr;
+}
+
+unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+ unsigned long addr, unsigned long end)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = pmd_offset(pud, addr);
+ do {
+ next = pmd_addr_end(addr, end);
+ if (pmd_none_or_clear_bad(pmd))
+ continue;
+ next = masync_pte_range(vma, pmd, addr, next);
+ } while (pmd++, addr = next, addr != end);
+
+ return addr;
+}
+
+unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = pud_offset(pgd, addr);
+ do {
+ next = pud_addr_end(addr, end);
+ if (pud_none_or_clear_bad(pud))
+ continue;
+ next = masync_pmd_range(vma, pud, addr, next);
+ } while (pud++, addr = next, addr != end);
+
+ return addr;
+}
+
+unsigned long masync_pgd_range()
+{
+ pgd_t *pgd;
+ unsigned long next;
+
+ pgd = pgd_offset(vma->vm_mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_none_of_clear_bad(pgd))
+ continue;
+ next = masync_pud_range(vma, pgd, addr, next);
+ } while (pgd++, addr = next, addr != end);
+
+ return addr;
+}
+
+int masync_vma_one(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ if (start < vma->vm_start)
+ start = vma->vm_start;
+
+ if (end > vma->vm_end)
+ end = vma->vm_end;
+
+ masync_pgd_range(vma, start, end);
+
+ return 0;
+}
+
+int masync_vma(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma_iter;
+
+ if (!(vma->vm_flags & VM_SHARED))
+ return 0;
+
+ mapping = vma->vm_file->f_mapping;
+
+ if (!mapping_cap_account_dirty(mapping))
+ return 0;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end)
+ masync_vma_one(vma_iter, start, end);
+ spin_unlock(&mapping->i_mmap_lock);
+
+ return 0;
+}
+
/*
* MS_SYNC syncs the entire file - including mappings.
*


2008-01-18 11:17:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..a1b3fc6 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -14,6 +14,122 @@
> #include <linux/syscalls.h>
> #include <linux/sched.h>
>
> +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
> + unsigned long addr, unsigned long end)
> +{
> + pte_t *pte;
> + spinlock_t *ptl;
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> + arch_enter_lazy_mmu_mode();
> + do {
> + pte_t ptent = *pte;
> +
> + if (pte_none(ptent))
> + continue;
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + if (pte_dirty(ptent) && pte_write(ptent)) {
> + flush_cache_page(vma, addr, pte_pfn(ptent));

Hmm, I'm not sure flush_cache_page() is needed. Or does does dirty
data in the cache somehow interfere with the page protection?

> + ptent = ptep_clear_flush(vma, addr, pte);
> + ptent = pte_wrprotect(ptent);
> + set_pte_at(vma->vm_mnm, addr, pte, ptent);
> + }
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(pte - 1, ptl);
> +
> + return addr;
> +}
> +
> +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> + unsigned long addr, unsigned long end)
> +{
> + pmd_t *pmd;
> + unsigned long next;
> +
> + pmd = pmd_offset(pud, addr);
> + do {
> + next = pmd_addr_end(addr, end);
> + if (pmd_none_or_clear_bad(pmd))
> + continue;
> + next = masync_pte_range(vma, pmd, addr, next);
> + } while (pmd++, addr = next, addr != end);
> +
> + return addr;
> +}
> +
> +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
> + unsigned long addr, unsigned long end)
> +{
> + pud_t *pud;
> + unsigned long next;
> +
> + pud = pud_offset(pgd, addr);
> + do {
> + next = pud_addr_end(addr, end);
> + if (pud_none_or_clear_bad(pud))
> + continue;
> + next = masync_pmd_range(vma, pud, addr, next);
> + } while (pud++, addr = next, addr != end);
> +
> + return addr;
> +}
> +
> +unsigned long masync_pgd_range()
> +{
> + pgd_t *pgd;
> + unsigned long next;
> +
> + pgd = pgd_offset(vma->vm_mm, addr);
> + do {
> + next = pgd_addr_end(addr, end);
> + if (pgd_none_of_clear_bad(pgd))
> + continue;
> + next = masync_pud_range(vma, pgd, addr, next);
> + } while (pgd++, addr = next, addr != end);
> +
> + return addr;
> +}
> +
> +int masync_vma_one(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + if (start < vma->vm_start)
> + start = vma->vm_start;
> +
> + if (end > vma->vm_end)
> + end = vma->vm_end;
> +
> + masync_pgd_range(vma, start, end);
> +
> + return 0;
> +}
> +
> +int masync_vma(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + struct address_space *mapping;
> + struct vm_area_struct *vma_iter;
> +
> + if (!(vma->vm_flags & VM_SHARED))
> + return 0;
> +
> + mapping = vma->vm_file->f_mapping;
> +
> + if (!mapping_cap_account_dirty(mapping))
> + return 0;
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end)
> + masync_vma_one(vma_iter, start, end);
> + spin_unlock(&mapping->i_mmap_lock);

This is hoding i_mmap_lock for possibly quite long. Isn't that going
to cause problems?

Miklos

> +
> + return 0;
> +}
> +
> /*
> * MS_SYNC syncs the entire file - including mappings.
> *
>
>
>
>

2008-01-18 11:24:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files


On Fri, 2008-01-18 at 12:17 +0100, Miklos Szeredi wrote:
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..a1b3fc6 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -14,6 +14,122 @@
> > #include <linux/syscalls.h>
> > #include <linux/sched.h>
> >
> > +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
> > + unsigned long addr, unsigned long end)
> > +{
> > + pte_t *pte;
> > + spinlock_t *ptl;
> > +
> > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > + arch_enter_lazy_mmu_mode();
> > + do {
> > + pte_t ptent = *pte;
> > +
> > + if (pte_none(ptent))
> > + continue;
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + if (pte_dirty(ptent) && pte_write(ptent)) {
> > + flush_cache_page(vma, addr, pte_pfn(ptent));
>
> Hmm, I'm not sure flush_cache_page() is needed. Or does does dirty
> data in the cache somehow interfere with the page protection?

No, just being paranoid..

> > + ptent = ptep_clear_flush(vma, addr, pte);
> > + ptent = pte_wrprotect(ptent);
> > + set_pte_at(vma->vm_mnm, addr, pte, ptent);
> > + }
> > + } while (pte++, addr += PAGE_SIZE, addr != end);
> > + arch_leave_lazy_mmu_mode();
> > + pte_unmap_unlock(pte - 1, ptl);
> > +
> > + return addr;
> > +}
> > +
> > +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> > + unsigned long addr, unsigned long end)
> > +{
> > + pmd_t *pmd;
> > + unsigned long next;
> > +
> > + pmd = pmd_offset(pud, addr);
> > + do {
> > + next = pmd_addr_end(addr, end);
> > + if (pmd_none_or_clear_bad(pmd))
> > + continue;
> > + next = masync_pte_range(vma, pmd, addr, next);
> > + } while (pmd++, addr = next, addr != end);
> > +
> > + return addr;
> > +}
> > +
> > +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
> > + unsigned long addr, unsigned long end)
> > +{
> > + pud_t *pud;
> > + unsigned long next;
> > +
> > + pud = pud_offset(pgd, addr);
> > + do {
> > + next = pud_addr_end(addr, end);
> > + if (pud_none_or_clear_bad(pud))
> > + continue;
> > + next = masync_pmd_range(vma, pud, addr, next);
> > + } while (pud++, addr = next, addr != end);
> > +
> > + return addr;
> > +}
> > +
> > +unsigned long masync_pgd_range()
> > +{
> > + pgd_t *pgd;
> > + unsigned long next;
> > +
> > + pgd = pgd_offset(vma->vm_mm, addr);
> > + do {
> > + next = pgd_addr_end(addr, end);
> > + if (pgd_none_of_clear_bad(pgd))
> > + continue;
> > + next = masync_pud_range(vma, pgd, addr, next);
> > + } while (pgd++, addr = next, addr != end);
> > +
> > + return addr;
> > +}
> > +
> > +int masync_vma_one(struct vm_area_struct *vma,
> > + unsigned long start, unsigned long end)
> > +{
> > + if (start < vma->vm_start)
> > + start = vma->vm_start;
> > +
> > + if (end > vma->vm_end)
> > + end = vma->vm_end;
> > +
> > + masync_pgd_range(vma, start, end);
> > +
> > + return 0;
> > +}
> > +
> > +int masync_vma(struct vm_area_struct *vma,
> > + unsigned long start, unsigned long end)
> > +{
> > + struct address_space *mapping;
> > + struct vm_area_struct *vma_iter;
> > +
> > + if (!(vma->vm_flags & VM_SHARED))
> > + return 0;
> > +
> > + mapping = vma->vm_file->f_mapping;
> > +
> > + if (!mapping_cap_account_dirty(mapping))
> > + return 0;
> > +
> > + spin_lock(&mapping->i_mmap_lock);
> > + vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end)
> > + masync_vma_one(vma_iter, start, end);
> > + spin_unlock(&mapping->i_mmap_lock);
>
> This is hoding i_mmap_lock for possibly quite long. Isn't that going
> to cause problems?

Possibly, I didn't see a quick way to break that iteration.
>From a quick glance at prio_tree.c the iterator isn't valid anymore
after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

I also realized I forgot to copy/paste the prio_tree_iter declaration
and ought to make all these functions static.

But for a quick draft it conveys the idea pretty well, I guess :-)

2008-01-18 11:36:33

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> Possibly, I didn't see a quick way to break that iteration.
> >From a quick glance at prio_tree.c the iterator isn't valid anymore
> after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

Maybe i_mmap_lock isn't needed at all, since msync holds mmap_sem,
which protects the prio tree as well, no?

> I also realized I forgot to copy/paste the prio_tree_iter declaration
> and ought to make all these functions static.
>
> But for a quick draft it conveys the idea pretty well, I guess :-)

Yes :)

There could also be nasty performance corner cases, like having a huge
file mapped thousands of times, and having only a couple of pages
dirtied between MS_ASYNC invocations. Then most of that page table
walking would be just unnecessary overhead.

There's something to be said for walking only the dirty pages, and
doing page_mkclean on them, even if in some cases that would be
slower.

But I have a strong feeling of deja vu, and last time it ended with
Andrew not liking the whole thing...

Miklos

2008-01-18 18:00:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Fri, 18 Jan 2008, Peter Zijlstra wrote:
>
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

No, we simply want to not play all these very expensive games with dirty
in the first place.

Guys, mmap access times aren't important enough for this. It's not
specified closely enough, and people don't care enough.

Of the patches around so far, the best one by far seems to be the simple
four-liner from Miklos.

And even in that four-liner, I suspect that the *last* two lines are
actually incorrect: there's no point in updating the file time when the
page *becomes* dirty, we should update the file time when it is marked
clean, and "msync(MS_SYNC)" should update it as part of *that*.

So I think the file time update should be part of just the page writeout
logic, not by msync() or page faulting itself or anything like that.

Linus

2008-01-18 18:12:08

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> And even in that four-liner, I suspect that the *last* two lines are
> actually incorrect: there's no point in updating the file time when the
> page *becomes* dirty,

Actually all four lines do that. The first two for a write access on
a present, read-only pte, the other two for a write on a non-present
pte.

> we should update the file time when it is marked
> clean, and "msync(MS_SYNC)" should update it as part of *that*.

That would need a new page flag (PG_mmap_dirty?). Do we have one
available?

Miklos

2008-01-18 18:29:39

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

On Fri, 18 Jan 2008 19:11:47 +0100
Miklos Szeredi <[email protected]> wrote:

> > And even in that four-liner, I suspect that the *last* two lines are
> > actually incorrect: there's no point in updating the file time when the
> > page *becomes* dirty,
>
> Actually all four lines do that. The first two for a write access on
> a present, read-only pte, the other two for a write on a non-present
> pte.
>
> > we should update the file time when it is marked
> > clean, and "msync(MS_SYNC)" should update it as part of *that*.
>
> That would need a new page flag (PG_mmap_dirty?). Do we have one
> available?

I thought the page writing stuff looked at (and cleared) the pte
dirty bit, too?

--
All rights reversed.

2008-01-18 18:45:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Fri, 18 Jan 2008, Miklos Szeredi wrote:
>
> That would need a new page flag (PG_mmap_dirty?). Do we have one
> available?

Yeah, that would be bad. We probably have flags free, but those page flags
are always a pain. Scratch that.

How about just setting a per-vma dirty flag, and then instead of updating
the mtime when taking the dirty-page fault, we just set that flag?

Then, on unmap and msync, we just do

if (vma->dirty-flag) {
vma->dirty_flag = 0;
update_file_times(vma->vm_file);
}

and be done with it?


Linus

2008-01-18 18:51:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> > > And even in that four-liner, I suspect that the *last* two lines are
> > > actually incorrect: there's no point in updating the file time when the
> > > page *becomes* dirty,
> >
> > Actually all four lines do that. The first two for a write access on
> > a present, read-only pte, the other two for a write on a non-present
> > pte.
> >
> > > we should update the file time when it is marked
> > > clean, and "msync(MS_SYNC)" should update it as part of *that*.
> >
> > That would need a new page flag (PG_mmap_dirty?). Do we have one
> > available?
>
> I thought the page writing stuff looked at (and cleared) the pte
> dirty bit, too?

Yeah, it does. Hmm...

What happens on munmap? The times _could_ get updated from there as
well, but it's getting complicated.

Miklos

2008-01-18 18:57:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> > That would need a new page flag (PG_mmap_dirty?). Do we have one
> > available?
>
> Yeah, that would be bad. We probably have flags free, but those page flags
> are always a pain. Scratch that.
>
> How about just setting a per-vma dirty flag, and then instead of updating
> the mtime when taking the dirty-page fault, we just set that flag?
>
> Then, on unmap and msync, we just do
>
> if (vma->dirty-flag) {
> vma->dirty_flag = 0;
> update_file_times(vma->vm_file);
> }
>
> and be done with it?

But then background writeout, sync(2), etc, wouldn't update the times.
Dunno. I don't think actual _physical_ writeout matters much, so it's
not worse to be 30s early with the timestamp, than to be 30s or more
late.

Miklos

2008-01-18 19:11:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Fri, 18 Jan 2008, Miklos Szeredi wrote:
>
> But then background writeout, sync(2), etc, wouldn't update the times.

Sure it would, but only when doing the final unmap.

Did you miss the "on unmap and msync" part?

Linus

2008-01-18 19:22:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> >
> > But then background writeout, sync(2), etc, wouldn't update the times.
>
> Sure it would, but only when doing the final unmap.
>
> Did you miss the "on unmap and msync" part?

No :)

What I'm saying is that the times could be left un-updated for a long
time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

If program has this pattern:

mmap()
write to map
msync(MS_ASYNC)
sleep(long)
write to map
msync(MS_ASYNC)
sleep(long)
...

Then we'd never see time updates (until the program exits, but that
could be years).

Maybe this doesn't matter, I'm just saying this is a disadvantage
compared to the "update on first dirtying" approach, which would
ensure, that times are updated at least once per 30s.

Miklos

2008-01-18 19:37:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Fri, 18 Jan 2008, Miklos Szeredi wrote:
>
> What I'm saying is that the times could be left un-updated for a long
> time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

Sure.

But in those circumstances, the programmer cannot depend on the mtime
*anyway* (because there is no synchronization), so what's the downside?

Let's face it, there's exactly three possible solutions:

- the insane one: trap EVERY SINGLE instruction that does a write to the
page, and update mtime each and every time.

This one is so obviously STUPID that it's not even worth discussing
further, except to say that "yes, there is an 'exact' algorithm, but
no, we are never EVER going to use it".

- the non-exact solutions that don't give you mtime updates every time
a write to the page happens, but give *some* guarantees for things that
will update it.

This is the one I think we can do, and the only things a programmer can
impact using it is "msync()" and "munmap()", since no other operations
really have any thing to do with it in a programmer-visible way (ie a
normal "sync" operation may happen in the background and has no
progam-relevant timing information)

Other things *may* or may not update mtime (some filesystems - take
most networked one as an example - will *always* update mtime on the
server on writeback, so we cannot ever guarantee that nothing but
msync/munmap does so), but at least we'll have a minimum set of things
that people can depend on.

- the "we don't care at all solutions".

mmap(MAP_WRITE) doesn't really update times reliably after the write
has happened (but might do it *before* - maybe the mmap() itself does).

Those are the three choices, I think. We currently approximate #3. We
*can* do #2 (and there are various flavors of it). And even *aiming* for
#1 is totally insane and stupid.

Linus

2008-01-18 19:48:49

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008/1/18, Miklos Szeredi <[email protected]>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers? Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)? Could you please do
> some measurment of that as well?

Did a quick test on an ext4 partition. This is how it looks like:

debian:~/miklos# ./miklos_test /mnt/file
begin 1200662360 1200662360 1200662353
write 1200662361 1200662361 1200662353
mmap 1200662361 1200662361 1200662362
b 1200662363 1200662363 1200662362
msync b 1200662363 1200662363 1200662362
c 1200662365 1200662365 1200662362
msync c 1200662365 1200662365 1200662362
d 1200662367 1200662367 1200662362
munmap 1200662367 1200662367 1200662362
close 1200662367 1200662367 1200662362
sync 1200662367 1200662367 1200662362
debian:~/miklos# mount | grep /mnt
/root/image.ext4 on /mnt type ext4dev (rw,loop=/dev/loop0)

> What about performance impact on msync(MS_ASYNC)? Could you please do
> some measurment of that as well?

Following are the results of the measurements. Here are the relevant
portions of the test program:

>>>

#define FILE_SIZE (1024 * 1024 * 512)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(&tv_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(&tv_stop, NULL);

<<<

For reference tests, the following platforms were used:

1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB.

2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB),
Memory: 15.98 GB.

3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs.
Memory: 4G.

The tests of my solution were performed using the following platform:

A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2
Duo CPU T7300 @ 2.00GHz. Further referred to as "the first case".

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. On Linux
systems, the following commands were issued prior to running the
tests:

echo 80 >/proc/sys/vm/dirty_ratio
echo 80 >/proc/sys/vm/dirty_background_ratio
echo 30000 >/proc/sys/vm/dirty_expire_centisecs
sync

Table 1. Reference platforms.

------------------------------------------------------------
| | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD |
------------------------------------------------------------
| First run | 263405 usec | 202283 usec | 90 SECONDS |
------------------------------------------------------------
| Second run | 262253 usec | 172837 usec | 90 SECONDS |
------------------------------------------------------------
| Third run | 238465 usec | 238465 usec | 90 SECONDS |
------------------------------------------------------------

It looks like FreeBSD is a clear outsider here. Note that FreeBSD
showed an almost liner depencence of the time spent in the
msync(MS_ASYNC) call on the file size.

Table 2. The Qemu system. File size is 512M.

---------------------------------------------------
| | Before the patch | After the patch |
---------------------------------------------------
| First run | 35 usec | 5852 usec |
---------------------------------------------------
| Second run | 35 usec | 4444 usec |
---------------------------------------------------
| Third run | 35 usec | 6330 usec |
---------------------------------------------------

I think that the data above prove the viability of the solution I
presented. Also, I guess that this bug fix is most probably ready
for getting upstream.

Please apply the sixth version of my solution.

>
> Thanks,
> Miklos
>
>

2008-01-18 19:59:01

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008/1/18, Linus Torvalds <[email protected]>:
>
>
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> >
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
>
> Sure.
>
> But in those circumstances, the programmer cannot depend on the mtime
> *anyway* (because there is no synchronization), so what's the downside?
>
> Let's face it, there's exactly three possible solutions:
>
> - the insane one: trap EVERY SINGLE instruction that does a write to the
> page, and update mtime each and every time.
>
> This one is so obviously STUPID that it's not even worth discussing
> further, except to say that "yes, there is an 'exact' algorithm, but
> no, we are never EVER going to use it".
>
> - the non-exact solutions that don't give you mtime updates every time
> a write to the page happens, but give *some* guarantees for things that
> will update it.
>
> This is the one I think we can do, and the only things a programmer can
> impact using it is "msync()" and "munmap()", since no other operations
> really have any thing to do with it in a programmer-visible way (ie a
> normal "sync" operation may happen in the background and has no
> progam-relevant timing information)
>
> Other things *may* or may not update mtime (some filesystems - take
> most networked one as an example - will *always* update mtime on the
> server on writeback, so we cannot ever guarantee that nothing but
> msync/munmap does so), but at least we'll have a minimum set of things
> that people can depend on.
>
> - the "we don't care at all solutions".
>
> mmap(MAP_WRITE) doesn't really update times reliably after the write
> has happened (but might do it *before* - maybe the mmap() itself does).
>
> Those are the three choices, I think. We currently approximate #3. We
> *can* do #2 (and there are various flavors of it). And even *aiming* for
> #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread "-v6 0/2":

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.

>
> Linus
>

2008-01-18 20:23:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
>
> The current solution doesn't hit the performance at all when compared to
> the competitor POSIX-compliant systems. It is faster and does even more
> than the POSIX standard requires.

Your current patches have two problems:
- they are simply unnecessarily invasive for a relatively simple issue
- all versions I've looked at closer are buggy too

Example:

+ if (pte_dirty(*pte) && pte_write(*pte))
+ *pte = pte_wrprotect(*pte);

Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
other CPU's may be accessing it and updating it from their hw page table
walkers. What will happen? Who knows? I can see lost access bits at a
minimum.

IOW, this isn't simple code. It's code that it is simple to screw up. In
this case, you really need to use ptep_set_wrprotect(), for example.

So why not do it in many fewer lines with that simpler vma->dirty flag?

Linus

2008-01-18 21:03:25

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008/1/18, Linus Torvalds <[email protected]>:
>
>
> On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> >
> > The current solution doesn't hit the performance at all when compared to
> > the competitor POSIX-compliant systems. It is faster and does even more
> > than the POSIX standard requires.
>
> Your current patches have two problems:
> - they are simply unnecessarily invasive for a relatively simple issue
> - all versions I've looked at closer are buggy too
>
> Example:
>
> + if (pte_dirty(*pte) && pte_write(*pte))
> + *pte = pte_wrprotect(*pte);
>
> Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
> other CPU's may be accessing it and updating it from their hw page table
> walkers. What will happen? Who knows? I can see lost access bits at a
> minimum.
>
> IOW, this isn't simple code. It's code that it is simple to screw up. In
> this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.


>
> So why not do it in many fewer lines with that simpler vma->dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.

>
> Linus
>

2008-01-18 21:28:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
>
> Before using pte_wrprotect() the vma_wrprotect() routine uses the
> pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> spinlock. Why did you say that this code was not SMP-safe? It should
> be atomic, I think.

It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

Guess how much another x86 CPU cares when it sets the accessed bit in
hardware?

> The POSIX standard requires the ctime and mtime stamps to be updated
> not later than at the second call to msync() with the MS_ASYNC flag.

.. and that is no excuse for bad code.

Linus

2008-01-18 22:05:00

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008/1/19, Linus Torvalds <[email protected]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > Before using pte_wrprotect() the vma_wrprotect() routine uses the
> > pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> > spinlock. Why did you say that this code was not SMP-safe? It should
> > be atomic, I think.
>
> It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.
>
> Guess how much another x86 CPU cares when it sets the accessed bit in
> hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me. But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, &ptl);
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.

>
> > The POSIX standard requires the ctime and mtime stamps to be updated
> > not later than at the second call to msync() with the MS_ASYNC flag.
>
> .. and that is no excuse for bad code.
>
> Linus
>

2008-01-18 22:22:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
>
> The page_check_address() function is called from the
> page_mkclean_one() routine as follows:

.. and the page_mkclean_one() function is totally different.

Lookie here, this is the correct and complex sequence:

> entry = ptep_clear_flush(vma, address, pte);
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
> set_pte_at(mm, address, pte, entry);

That's a rather expensive sequence, but it's done exactly because it has
to be done that way. What it does is to

- *atomically* load the pte entry _and_ clear the old one in memory.

That's the

entry = ptep_clear_flush(vma, address, pte);

thing, and it basically means that it's doing some
architecture-specific magic to make sure that another CPU that accesses
the PTE at the same time will never actually modify the pte (because
it's clear and not valid)

- it then - while the page table is actually clear and invalid - takes
the old value and turns it into the new one:

entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);

- and finally, it replaces the entry with the new one:

set_pte_at(mm, address, pte, entry);

which takes care to write the new entry in some specific way that is
atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
entry it writes the high word first, see the write barriers in
"native_set_pte()" in include/asm-x86/pgtable-3level.h

Now, compare that subtle and correct thing with what is *not* correct:

if (pte_dirty(*pte) && pte_write(*pte))
*pte = pte_wrprotect(*pte);

which makes no effort at all to make sure that it's safe in case another
CPU updates the accessed bit.

Now, arguably it's unlikely to cause horrible problems at least on x86,
because:

- we only do this if the pte is already marked dirty, so while we can
lose the accessed bit, we can *not* lose the dirty bit. And the
accessed bit isn't such a big deal.

- it's not doing any of the "be careful about" ordering things, but since
the really important bits aren't changing, ordering probably won't
practically matter.

But the problem is that we have something like 24 different architectures,
it's hard to make sure that none of them have issues.

In other words: it may well work in practice. But when these things go
subtly wrong, they are *really* nasty to find, and the unsafe sequence is
really not how it's supposed to be done. For example, you don't even flush
the TLB, so even if there are no cross-CPU issues, there's probably going
to be writable entries in the TLB that now don't match the page tables.

Will it matter? Again, probably impossible to see in practice. But ...

Linus

2008-01-18 22:29:36

by Ingo Oeser

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

Hi Linus,

On Friday 18 January 2008, Linus Torvalds wrote:
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> >
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
>
> Sure.
>
> But in those circumstances, the programmer cannot depend on the mtime
> *anyway* (because there is no synchronization), so what's the downside?

Can we get "if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before"?

That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)


Best Regards

Ingo Oeser

2008-01-18 22:35:35

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008/1/19, Linus Torvalds <[email protected]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > The page_check_address() function is called from the
> > page_mkclean_one() routine as follows:
>
> .. and the page_mkclean_one() function is totally different.
>
> Lookie here, this is the correct and complex sequence:
>
> > entry = ptep_clear_flush(vma, address, pte);
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(mm, address, pte, entry);
>
> That's a rather expensive sequence, but it's done exactly because it has
> to be done that way. What it does is to
>
> - *atomically* load the pte entry _and_ clear the old one in memory.
>
> That's the
>
> entry = ptep_clear_flush(vma, address, pte);
>
> thing, and it basically means that it's doing some
> architecture-specific magic to make sure that another CPU that accesses
> the PTE at the same time will never actually modify the pte (because
> it's clear and not valid)
>
> - it then - while the page table is actually clear and invalid - takes
> the old value and turns it into the new one:
>
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
>
> - and finally, it replaces the entry with the new one:
>
> set_pte_at(mm, address, pte, entry);
>
> which takes care to write the new entry in some specific way that is
> atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
> entry it writes the high word first, see the write barriers in
> "native_set_pte()" in include/asm-x86/pgtable-3level.h
>
> Now, compare that subtle and correct thing with what is *not* correct:
>
> if (pte_dirty(*pte) && pte_write(*pte))
> *pte = pte_wrprotect(*pte);
>
> which makes no effort at all to make sure that it's safe in case another
> CPU updates the accessed bit.
>
> Now, arguably it's unlikely to cause horrible problems at least on x86,
> because:
>
> - we only do this if the pte is already marked dirty, so while we can
> lose the accessed bit, we can *not* lose the dirty bit. And the
> accessed bit isn't such a big deal.
>
> - it's not doing any of the "be careful about" ordering things, but since
> the really important bits aren't changing, ordering probably won't
> practically matter.
>
> But the problem is that we have something like 24 different architectures,
> it's hard to make sure that none of them have issues.
>
> In other words: it may well work in practice. But when these things go
> subtly wrong, they are *really* nasty to find, and the unsafe sequence is
> really not how it's supposed to be done. For example, you don't even flush
> the TLB, so even if there are no cross-CPU issues, there's probably going
> to be writable entries in the TLB that now don't match the page tables.
>
> Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton

>
> Linus
>

2008-01-18 22:48:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files



On Fri, 18 Jan 2008, Ingo Oeser wrote:
>
> Can we get "if the write to the page hits the disk, the mtime has hit the disk
> already no less than SOME_GRANULARITY before"?
>
> That is very important for computer forensics. Esp. in saving your ass!
>
> Ok, now back again to making that fast :-)

I certainly don't mind it if we have some tighter guarantees, but what I'd
want is:

- keep it simple. Let's face it, Linux has never ever given those
guarantees before, and it's not is if anybody has really cared. Even
now, the issue seems to be more about paper standards conformance than
anything else.

- I get worried about people playing around with the dirty bit in
particular. We have had some really rather nasty bugs here. Most of
which are totally impossible to trigger under normal loads (for
example the old random-access utorrent writable mmap issue from about
a year ago).

So these two issues - the big red danger signs flashing in my brain,
coupled with the fact that no application has apparently ever really
noticed in the last 15 years - just makes it a case where I'd like each
step of the way to be obvious and simple and no larger than really
absolutely necessary.

Linus

2008-01-18 22:55:57

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

On Fri, 18 Jan 2008 14:47:33 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

> - keep it simple. Let's face it, Linux has never ever given those
> guarantees before, and it's not is if anybody has really cared. Even
> now, the issue seems to be more about paper standards conformance than
> anything else.

There is one issue which is way more than just standards conformance.

When a program changes file data through mmap(), at some point the
mtime needs to be update so that backup programs know to back up the
new version of the file.

Backup programs not seeing an updated mtime is a really big deal.

--
All rights reversed.

2008-01-19 00:52:25

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files


On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote:
> On Fri, 18 Jan 2008 14:47:33 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
> > - keep it simple. Let's face it, Linux has never ever given those
> > guarantees before, and it's not is if anybody has really cared. Even
> > now, the issue seems to be more about paper standards conformance than
> > anything else.
>
> There is one issue which is way more than just standards conformance.
>
> When a program changes file data through mmap(), at some point the
> mtime needs to be update so that backup programs know to back up the
> new version of the file.
>
> Backup programs not seeing an updated mtime is a really big deal.

And that's fixed with the 4-line approach.

Reminds me, I've got a patch here for addressing that problem with loop mounts:

Writes to loop should update the mtime of the underlying file.

Signed-off-by: Matt Mackall <[email protected]>

Index: l/drivers/block/loop.c
===================================================================
--- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600
+++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600
@@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
bv_offs = bvec->bv_offset;
len = bvec->bv_len;
+ file_update_time(file);
while (len > 0) {
sector_t IV;
unsigned size;
@@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil

set_fs(get_ds());
bw = file->f_op->write(file, buf, len, &pos);
+ file_update_time(file);
set_fs(old_fs);
if (likely(bw == len))
return 0;


--
Mathematics is the supreme nostalgia of our time.

2008-01-19 04:26:49

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

On Fri, 18 Jan 2008 18:50:03 -0600
Matt Mackall <[email protected]> wrote:
> On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote:

> > Backup programs not seeing an updated mtime is a really big deal.
>
> And that's fixed with the 4-line approach.
>
> Reminds me, I've got a patch here for addressing that problem with loop mounts:
>
> Writes to loop should update the mtime of the underlying file.
>
> Signed-off-by: Matt Mackall <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2008-01-19 10:23:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

> Reminds me, I've got a patch here for addressing that problem with loop mounts:
>
> Writes to loop should update the mtime of the underlying file.
>
> Signed-off-by: Matt Mackall <[email protected]>
>
> Index: l/drivers/block/loop.c
> ===================================================================
> --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600
> +++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600
> @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
> offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
> bv_offs = bvec->bv_offset;
> len = bvec->bv_len;
> + file_update_time(file);
> while (len > 0) {
> sector_t IV;
> unsigned size;
> @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
>
> set_fs(get_ds());
> bw = file->f_op->write(file, buf, len, &pos);
> + file_update_time(file);

->write should have already updated the times, no?

> set_fs(old_fs);
> if (likely(bw == len))
> return 0;
>
>

2008-01-19 10:45:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

> 2008/1/18, Miklos Szeredi <[email protected]>:
> > > 4. Performance test was done using the program available from the
> > > following link:
> > >
> > > http://bugzilla.kernel.org/attachment.cgi?id=14493
> > >
> > > Result: the impact of the changes was negligible for files of a few
> > > hundred megabytes.
> >
> > Could you also test with ext4 and post some numbers? Afaik, ext4 uses
> > nanosecond timestamps, so the time updating code would be exercised
> > more during the page faults.
> >
> > What about performance impact on msync(MS_ASYNC)? Could you please do
> > some measurment of that as well?
>
> Did a quick test on an ext4 partition. This is how it looks like:

Thanks for running these tests.

I was more interested in the slowdown on ext4 (checked with the above
mentioned program). Can you do such a test as well, and post
resulting times with and without the patch?

> Table 1. Reference platforms.
>
> ------------------------------------------------------------
> | | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD |
> ------------------------------------------------------------
> | First run | 263405 usec | 202283 usec | 90 SECONDS |
> ------------------------------------------------------------
> | Second run | 262253 usec | 172837 usec | 90 SECONDS |
> ------------------------------------------------------------
> | Third run | 238465 usec | 238465 usec | 90 SECONDS |
> ------------------------------------------------------------
>
> It looks like FreeBSD is a clear outsider here. Note that FreeBSD
> showed an almost liner depencence of the time spent in the
> msync(MS_ASYNC) call on the file size.
>
> Table 2. The Qemu system. File size is 512M.
>
> ---------------------------------------------------
> | | Before the patch | After the patch |
> ---------------------------------------------------
> | First run | 35 usec | 5852 usec |
> ---------------------------------------------------
> | Second run | 35 usec | 4444 usec |
> ---------------------------------------------------
> | Third run | 35 usec | 6330 usec |
> ---------------------------------------------------

Interesting.

Thanks,
Miklos

2008-01-19 15:52:06

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files


On Sat, 2008-01-19 at 11:22 +0100, Miklos Szeredi wrote:
> > Reminds me, I've got a patch here for addressing that problem with loop mounts:
> >
> > Writes to loop should update the mtime of the underlying file.
> >
> > Signed-off-by: Matt Mackall <[email protected]>
> >
> > Index: l/drivers/block/loop.c
> > ===================================================================
> > --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600
> > +++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600
> > @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
> > offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
> > bv_offs = bvec->bv_offset;
> > len = bvec->bv_len;
> > + file_update_time(file);
> > while (len > 0) {
> > sector_t IV;
> > unsigned size;
> > @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
> >
> > set_fs(get_ds());
> > bw = file->f_op->write(file, buf, len, &pos);
> > + file_update_time(file);
>
> ->write should have already updated the times, no?

Yes, this second case is redundant. Still needed in the first case.

--
Mathematics is the supreme nostalgia of our time.

2008-01-21 14:26:22

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

Linus Torvalds wrote:
> On Fri, 18 Jan 2008, Ingo Oeser wrote:
>
>> Can we get "if the write to the page hits the disk, the mtime has hit the disk
>> already no less than SOME_GRANULARITY before"?
>>
>> That is very important for computer forensics. Esp. in saving your ass!
>>
>> Ok, now back again to making that fast :-)
>>
>
> I certainly don't mind it if we have some tighter guarantees, but what I'd
> want is:
>
> - keep it simple. Let's face it, Linux has never ever given those
> guarantees before, and it's not is if anybody has really cared. Even
> now, the issue seems to be more about paper standards conformance than
> anything else.
>
>

I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted. They are incomplete because although
their applications update files, they don't get backed up
because the mtime never changes.

> - I get worried about people playing around with the dirty bit in
> particular. We have had some really rather nasty bugs here. Most of
> which are totally impossible to trigger under normal loads (for
> example the old random-access utorrent writable mmap issue from about
> a year ago).
>
> So these two issues - the big red danger signs flashing in my brain,
> coupled with the fact that no application has apparently ever really
> noticed in the last 15 years - just makes it a case where I'd like each
> step of the way to be obvious and simple and no larger than really
> absolutely necessary.

Simple is good. However, too simple is not good. I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

Thanx...

ps

2008-01-21 14:36:44

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008/1/21, Peter Staubach <[email protected]>:
> Linus Torvalds wrote:
> > On Fri, 18 Jan 2008, Ingo Oeser wrote:
> >
> >> Can we get "if the write to the page hits the disk, the mtime has hit the disk
> >> already no less than SOME_GRANULARITY before"?
> >>
> >> That is very important for computer forensics. Esp. in saving your ass!
> >>
> >> Ok, now back again to making that fast :-)
> >>
> >
> > I certainly don't mind it if we have some tighter guarantees, but what I'd
> > want is:
> >
> > - keep it simple. Let's face it, Linux has never ever given those
> > guarantees before, and it's not is if anybody has really cared. Even
> > now, the issue seems to be more about paper standards conformance than
> > anything else.
> >
> >
>
> I have been working on getting something supported here for
> because I have some very large Wall Street customers who do
> care about getting the mtime updated because their backups
> are getting corrupted. They are incomplete because although
> their applications update files, they don't get backed up
> because the mtime never changes.
>
> > - I get worried about people playing around with the dirty bit in
> > particular. We have had some really rather nasty bugs here. Most of
> > which are totally impossible to trigger under normal loads (for
> > example the old random-access utorrent writable mmap issue from about
> > a year ago).
> >
> > So these two issues - the big red danger signs flashing in my brain,
> > coupled with the fact that no application has apparently ever really
> > noticed in the last 15 years - just makes it a case where I'd like each
> > step of the way to be obvious and simple and no larger than really
> > absolutely necessary.
>
> Simple is good. However, too simple is not good. I would suggest
> that we implement file time updates which make sense and if they
> happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.

>
> Thanx...
>
> ps
>