This is the seventh version of my solution for the bug #2645:
http://bugzilla.kernel.org/show_bug.cgi?id=2645
Since the previous version, the following has changed: based on
Linus' comment, SMP-safe PTE update implemented.
Discussions, which followed my past submissions, showed that it was
tempting to approach this problem using very different assumptions.
However, many such designs have proved to be incomplete or inefficient.
Taking into account the obvious complexity of this problem, I prepared a
design document, the purpose of which is twofold. First, it summarizes
previous attempts to resolve the ctime/mtime issue. Second, it attempts
to prove that what the patches do is necessary and sufficient for mtime
and ctime update provided that we start from a certain sane set of
requirements. The design document is available via the following link:
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
For the seventh version, comprehensive performance testing was performed.
The results of performance tests, including numbers, are available here:
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
Use the PAGE_ALIGN() macro instead of "manual" alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.
Signed-off-by: Anton Salikhmetov <[email protected]>
---
mm/msync.c | 77 ++++++++++++++++++++++++++++--------------------------------
1 files changed, 36 insertions(+), 41 deletions(-)
diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
/*
- * linux/mm/msync.c
+ * The msync() system call.
*
- * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
*/
-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>
/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
{
unsigned long end;
- struct mm_struct *mm = current->mm;
+ int error, unmapped_error;
struct vm_area_struct *vma;
- int unmapped_error = 0;
- int error = -EINVAL;
+ struct mm_struct *mm;
+ error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
- len = (len + ~PAGE_MASK) & PAGE_MASK;
+ len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
- error = 0;
+
+ error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
* If the interval [start,end) covers some unmapped address ranges,
* just ignore them, but return -ENOMEM at the end.
*/
+ mm = current->mm;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- for (;;) {
+ do {
struct file *file;
- /* Still start < end. */
error = -ENOMEM;
if (!vma)
- goto out_unlock;
- /* Here start < vma->vm_end. */
+ break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
- goto out_unlock;
- unmapped_error = -ENOMEM;
- }
- /* Here vma->vm_start <= start < vma->vm_end. */
- if ((flags & MS_INVALIDATE) &&
- (vma->vm_flags & VM_LOCKED)) {
- error = -EBUSY;
- goto out_unlock;
+ break;
+ unmapped_error = error;
}
- file = vma->vm_file;
+
+ error = -EBUSY;
+ if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+ break;
+
+ error = 0;
start = vma->vm_end;
- if ((flags & MS_SYNC) && file &&
- (vma->vm_flags & VM_SHARED)) {
+ file = vma->vm_file;
+ if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- } else {
- if (start >= end) {
- error = 0;
- goto out_unlock;
- }
- vma = vma->vm_next;
+ continue;
}
- }
-out_unlock:
+
+ vma = vma->vm_next;
+ } while (start < end);
up_read(&mm->mmap_sem);
+
out:
- return error ? : unmapped_error;
+ return error ? error : unmapped_error;
}
--
1.4.4.4
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
Update file times at write references to memory-mapped files.
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.
Signed-off-by: Anton Salikhmetov <[email protected]>
---
mm/memory.c | 6 ++++++
mm/msync.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 50 insertions(+), 13 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
/*
* Yes, Virginia, this is actually required to prevent a race
* with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..394130d 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
* Copyright (C) 2008 Anton Salikhmetov <[email protected]>
*/
+#include <asm/tlbflush.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -13,11 +14,37 @@
#include <linux/syscalls.h>
/*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+ if (pte_dirty(*pte) && pte_write(*pte)) {
+ pte_t entry = ptep_clear_flush(vma, addr, pte);
+
+ entry = pte_wrprotect(entry);
+ set_pte_at(vma->vm_mm, addr, pte, entry);
+ }
+ pte_unmap_unlock(pte, ptl);
+ }
+}
+
+/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
*
* The application may now run fsync() to write out the dirty pages and
* wait on the writeout and check the result. Or the application may run
@@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
- if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
- get_file(file);
- up_read(&mm->mmap_sem);
- error = do_fsync(file, 0);
- fput(file);
- if (error || start >= end)
- goto out;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, start);
- continue;
+ if (file && (vma->vm_flags & VM_SHARED)) {
+ if (flags & MS_ASYNC)
+ vma_wrprotect(vma);
+ if (flags & MS_SYNC) {
+ get_file(file);
+ up_read(&mm->mmap_sem);
+ error = do_fsync(file, 0);
+ fput(file);
+ if (error || start >= end)
+ goto out;
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
+ continue;
+ }
}
vma = vma->vm_next;
--
1.4.4.4
On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> This is the seventh version of my solution for the bug #2645:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> Since the previous version, the following has changed: based on
> Linus' comment, SMP-safe PTE update implemented.
>
> Discussions, which followed my past submissions, showed that it was
> tempting to approach this problem using very different assumptions.
> However, many such designs have proved to be incomplete or inefficient.
>
> Taking into account the obvious complexity of this problem, I prepared a
> design document, the purpose of which is twofold. First, it summarizes
> previous attempts to resolve the ctime/mtime issue. Second, it attempts
> to prove that what the patches do is necessary and sufficient for mtime
> and ctime update provided that we start from a certain sane set of
> requirements. The design document is available via the following link:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
>
> For the seventh version, comprehensive performance testing was performed.
> The results of performance tests, including numbers, are available here:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
>
Hi Anton,
I applied your patches here and as far as my own test programs go,
these patches solve the previously observed problems I saw with mtime
not getting updated.
Thank you very much for so persistently working on these long standing issues.
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
Some very pedantic nitpicking below;
On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
>
> Update file times at write references to memory-mapped files.
> Force file times update at the next write reference after
> calling the msync() system call with the MS_ASYNC flag.
>
> Signed-off-by: Anton Salikhmetov <[email protected]>
> ---
> mm/memory.c | 6 ++++++
> mm/msync.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------------
> 2 files changed, 50 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6dd1cd8..4b0144b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1670,6 +1670,9 @@ gotten:
> unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> +
> /*
> * Yes, Virginia, this is actually required to prevent a race
> * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2343,6 +2346,9 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> +
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
> diff --git a/mm/msync.c b/mm/msync.c
> index a4de868..394130d 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -5,6 +5,7 @@
> * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> */
>
> +#include <asm/tlbflush.h>
> #include <linux/file.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
> @@ -13,11 +14,37 @@
> #include <linux/syscalls.h>
>
> /*
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
I know it's not the common "Linux Kernel way", but 'addr' could be
made to have just 'for' scope here according to C99;
for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
addr += PAGE_SIZE) {
> + spinlock_t *ptl;
> + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> + pud_t *pud = pud_offset(pgd, addr);
> + pmd_t *pmd = pmd_offset(pud, addr);
> + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> + if (pte_dirty(*pte) && pte_write(*pte)) {
> + pte_t entry = ptep_clear_flush(vma, addr, pte);
> +
> + entry = pte_wrprotect(entry);
> + set_pte_at(vma->vm_mm, addr, pte, entry);
> + }
> + pte_unmap_unlock(pte, ptl);
> + }
> +}
> +
> +/*
> * MS_SYNC syncs the entire file - including mappings.
> *
> - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> - * Now it doesn't do anything, since dirty pages are properly tracked.
I think keeping some version of the "up to ..." comments makes sense.
It documents that we previously had different behaviour.
> + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> + * read-only by calling vma_wrprotect(). This is needed to catch the next
> + * write reference to the mapped region and update the file times
> + * accordingly.
> *
> * The application may now run fsync() to write out the dirty pages and
> * wait on the writeout and check the result. Or the application may run
> @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> error = 0;
> start = vma->vm_end;
> file = vma->vm_file;
> - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> - get_file(file);
> - up_read(&mm->mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error || start >= end)
> - goto out;
> - down_read(&mm->mmap_sem);
> - vma = find_vma(mm, start);
> - continue;
> + if (file && (vma->vm_flags & VM_SHARED)) {
> + if (flags & MS_ASYNC)
> + vma_wrprotect(vma);
> + if (flags & MS_SYNC) {
"else if" ??
> + get_file(file);
> + up_read(&mm->mmap_sem);
> + error = do_fsync(file, 0);
> + fput(file);
> + if (error || start >= end)
> + goto out;
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, start);
> + continue;
> + }
> }
>
> vma = vma->vm_next;
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
2008/1/22, Jesper Juhl <[email protected]>:
> On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> > This is the seventh version of my solution for the bug #2645:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Since the previous version, the following has changed: based on
> > Linus' comment, SMP-safe PTE update implemented.
> >
> > Discussions, which followed my past submissions, showed that it was
> > tempting to approach this problem using very different assumptions.
> > However, many such designs have proved to be incomplete or inefficient.
> >
> > Taking into account the obvious complexity of this problem, I prepared a
> > design document, the purpose of which is twofold. First, it summarizes
> > previous attempts to resolve the ctime/mtime issue. Second, it attempts
> > to prove that what the patches do is necessary and sufficient for mtime
> > and ctime update provided that we start from a certain sane set of
> > requirements. The design document is available via the following link:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> >
> > For the seventh version, comprehensive performance testing was performed.
> > The results of performance tests, including numbers, are available here:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
> >
>
> Hi Anton,
>
> I applied your patches here and as far as my own test programs go,
> these patches solve the previously observed problems I saw with mtime
> not getting updated.
>
> Thank you very much for so persistently working on these long standing issues.
Thank you very much for testing these patches!
>
> --
> Jesper Juhl <[email protected]>
> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please http://www.expita.com/nomime.html
>
2008/1/22, Jesper Juhl <[email protected]>:
> Some very pedantic nitpicking below;
>
> On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> >
> > Update file times at write references to memory-mapped files.
> > Force file times update at the next write reference after
> > calling the msync() system call with the MS_ASYNC flag.
> >
> > Signed-off-by: Anton Salikhmetov <[email protected]>
> > ---
> > mm/memory.c | 6 ++++++
> > mm/msync.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------------
> > 2 files changed, 50 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dd1cd8..4b0144b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1670,6 +1670,9 @@ gotten:
> > unlock:
> > pte_unmap_unlock(page_table, ptl);
> > if (dirty_page) {
> > + if (vma->vm_file)
> > + file_update_time(vma->vm_file);
> > +
> > /*
> > * Yes, Virginia, this is actually required to prevent a race
> > * with clear_page_dirty_for_io() from clearing the page dirty
> > @@ -2343,6 +2346,9 @@ out_unlocked:
> > if (anon)
> > page_cache_release(vmf.page);
> > else if (dirty_page) {
> > + if (vma->vm_file)
> > + file_update_time(vma->vm_file);
> > +
> > set_page_dirty_balance(dirty_page, page_mkwrite);
> > put_page(dirty_page);
> > }
> > diff --git a/mm/msync.c b/mm/msync.c
> > index a4de868..394130d 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -5,6 +5,7 @@
> > * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> > */
> >
> > +#include <asm/tlbflush.h>
> > #include <linux/file.h>
> > #include <linux/fs.h>
> > #include <linux/mm.h>
> > @@ -13,11 +14,37 @@
> > #include <linux/syscalls.h>
> >
> > /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
>
> I know it's not the common "Linux Kernel way", but 'addr' could be
> made to have just 'for' scope here according to C99;
I believe that the C89 style is more common for the Linux kernel, so
I've used the out-of-scope variable declaration.
>
> for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
> addr += PAGE_SIZE) {
>
>
> > + spinlock_t *ptl;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > + pud_t *pud = pud_offset(pgd, addr);
> > + pmd_t *pmd = pmd_offset(pud, addr);
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte)) {
> > + pte_t entry = ptep_clear_flush(vma, addr, pte);
> > +
> > + entry = pte_wrprotect(entry);
> > + set_pte_at(vma->vm_mm, addr, pte, entry);
> > + }
> > + pte_unmap_unlock(pte, ptl);
> > + }
> > +}
> > +
> > +/*
> > * MS_SYNC syncs the entire file - including mappings.
> > *
> > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> > - * Now it doesn't do anything, since dirty pages are properly tracked.
>
> I think keeping some version of the "up to ..." comments makes sense.
> It documents that we previously had different behaviour.
Earlier I had a request to remove any "changelog-style" comments from the code.
>
> > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> > + * read-only by calling vma_wrprotect(). This is needed to catch the next
> > + * write reference to the mapped region and update the file times
> > + * accordingly.
> > *
> > * The application may now run fsync() to write out the dirty pages and
> > * wait on the writeout and check the result. Or the application may run
> > @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> > error = 0;
> > start = vma->vm_end;
> > file = vma->vm_file;
> > - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> > - get_file(file);
> > - up_read(&mm->mmap_sem);
> > - error = do_fsync(file, 0);
> > - fput(file);
> > - if (error || start >= end)
> > - goto out;
> > - down_read(&mm->mmap_sem);
> > - vma = find_vma(mm, start);
> > - continue;
> > + if (file && (vma->vm_flags & VM_SHARED)) {
> > + if (flags & MS_ASYNC)
> > + vma_wrprotect(vma);
> > + if (flags & MS_SYNC) {
>
> "else if" ??
The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
did not use the "else-if" here. Moreover, this function itself checks
that they never come together.
>
> > + get_file(file);
> > + up_read(&mm->mmap_sem);
> > + error = do_fsync(file, 0);
> > + fput(file);
> > + if (error || start >= end)
> > + goto out;
> > + down_read(&mm->mmap_sem);
> > + vma = find_vma(mm, start);
> > + continue;
> > + }
> > }
> >
> > vma = vma->vm_next;
>
> --
> Jesper Juhl <[email protected]>
> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please http://www.expita.com/nomime.html
>
On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> 2008/1/22, Jesper Juhl <[email protected]>:
> > Some very pedantic nitpicking below;
> >
> > On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
...
> > > + if (file && (vma->vm_flags & VM_SHARED)) {
> > > + if (flags & MS_ASYNC)
> > > + vma_wrprotect(vma);
> > > + if (flags & MS_SYNC) {
> >
> > "else if" ??
>
> The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
> did not use the "else-if" here. Moreover, this function itself checks
> that they never come together.
>
I would say that them being mutually exclusive would be a reason *for*
using "else-if" here.
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
2008/1/22, Jesper Juhl <[email protected]>:
> On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> > 2008/1/22, Jesper Juhl <[email protected]>:
> > > Some very pedantic nitpicking below;
> > >
> > > On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> ...
> > > > + if (file && (vma->vm_flags & VM_SHARED)) {
> > > > + if (flags & MS_ASYNC)
> > > > + vma_wrprotect(vma);
> > > > + if (flags & MS_SYNC) {
> > >
> > > "else if" ??
> >
> > The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
> > did not use the "else-if" here. Moreover, this function itself checks
> > that they never come together.
> >
>
> I would say that them being mutually exclusive would be a reason *for*
> using "else-if" here.
This check is performed by the sys_msync() function itself in its very
beginning.
We don't need to check it later.
>
> --
> Jesper Juhl <[email protected]>
> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please http://www.expita.com/nomime.html
>
2008/1/22, Anton Salikhmetov <[email protected]>:
> 2008/1/22, Jesper Juhl <[email protected]>:
> > Some very pedantic nitpicking below;
> >
> > On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> > > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> > >
> > > Update file times at write references to memory-mapped files.
> > > Force file times update at the next write reference after
> > > calling the msync() system call with the MS_ASYNC flag.
> > >
> > > Signed-off-by: Anton Salikhmetov <[email protected]>
> > > ---
> > > mm/memory.c | 6 ++++++
> > > mm/msync.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------------
> > > 2 files changed, 50 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6dd1cd8..4b0144b 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1670,6 +1670,9 @@ gotten:
> > > unlock:
> > > pte_unmap_unlock(page_table, ptl);
> > > if (dirty_page) {
> > > + if (vma->vm_file)
> > > + file_update_time(vma->vm_file);
> > > +
> > > /*
> > > * Yes, Virginia, this is actually required to prevent a race
> > > * with clear_page_dirty_for_io() from clearing the page dirty
> > > @@ -2343,6 +2346,9 @@ out_unlocked:
> > > if (anon)
> > > page_cache_release(vmf.page);
> > > else if (dirty_page) {
> > > + if (vma->vm_file)
> > > + file_update_time(vma->vm_file);
> > > +
> > > set_page_dirty_balance(dirty_page, page_mkwrite);
> > > put_page(dirty_page);
> > > }
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..394130d 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -5,6 +5,7 @@
> > > * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> > > */
> > >
> > > +#include <asm/tlbflush.h>
> > > #include <linux/file.h>
> > > #include <linux/fs.h>
> > > #include <linux/mm.h>
> > > @@ -13,11 +14,37 @@
> > > #include <linux/syscalls.h>
> > >
> > > /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> >
> > I know it's not the common "Linux Kernel way", but 'addr' could be
> > made to have just 'for' scope here according to C99;
>
> I believe that the C89 style is more common for the Linux kernel, so
> I've used the out-of-scope variable declaration.
>
> >
> > for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
> > addr += PAGE_SIZE) {
By the way, if we're talking "pedantic", then:
>>>
debian:/tmp$ cat c.c
void f()
{
for (unsigned long i = 0; i < 10; i++)
continue;
}
debian:/tmp$ gcc -c -pedantic c.c
c.c: In function 'f':
c.c:3: error: 'for' loop initial declaration used outside C99 mode
debian:/tmp$
<<<
No pun intended :)
> >
> >
> > > + spinlock_t *ptl;
> > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > + pud_t *pud = pud_offset(pgd, addr);
> > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +
> > > + if (pte_dirty(*pte) && pte_write(*pte)) {
> > > + pte_t entry = ptep_clear_flush(vma, addr, pte);
> > > +
> > > + entry = pte_wrprotect(entry);
> > > + set_pte_at(vma->vm_mm, addr, pte, entry);
> > > + }
> > > + pte_unmap_unlock(pte, ptl);
> > > + }
> > > +}
> > > +
> > > +/*
> > > * MS_SYNC syncs the entire file - including mappings.
> > > *
> > > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> > > - * Now it doesn't do anything, since dirty pages are properly tracked.
> >
> > I think keeping some version of the "up to ..." comments makes sense.
> > It documents that we previously had different behaviour.
>
> Earlier I had a request to remove any "changelog-style" comments from the code.
>
> >
> > > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> > > + * read-only by calling vma_wrprotect(). This is needed to catch the next
> > > + * write reference to the mapped region and update the file times
> > > + * accordingly.
> > > *
> > > * The application may now run fsync() to write out the dirty pages and
> > > * wait on the writeout and check the result. Or the application may run
> > > @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> > > error = 0;
> > > start = vma->vm_end;
> > > file = vma->vm_file;
> > > - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> > > - get_file(file);
> > > - up_read(&mm->mmap_sem);
> > > - error = do_fsync(file, 0);
> > > - fput(file);
> > > - if (error || start >= end)
> > > - goto out;
> > > - down_read(&mm->mmap_sem);
> > > - vma = find_vma(mm, start);
> > > - continue;
> > > + if (file && (vma->vm_flags & VM_SHARED)) {
> > > + if (flags & MS_ASYNC)
> > > + vma_wrprotect(vma);
> > > + if (flags & MS_SYNC) {
> >
> > "else if" ??
>
> The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
> did not use the "else-if" here. Moreover, this function itself checks
> that they never come together.
>
> >
> > > + get_file(file);
> > > + up_read(&mm->mmap_sem);
> > > + error = do_fsync(file, 0);
> > > + fput(file);
> > > + if (error || start >= end)
> > > + goto out;
> > > + down_read(&mm->mmap_sem);
> > > + vma = find_vma(mm, start);
> > > + continue;
> > > + }
> > > }
> > >
> > > vma = vma->vm_next;
> >
> > --
> > Jesper Juhl <[email protected]>
> > Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
> > Plain text mails only, please http://www.expita.com/nomime.html
> >
>
On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> 2008/1/22, Anton Salikhmetov <[email protected]>:
> > 2008/1/22, Jesper Juhl <[email protected]>:
> > > Some very pedantic nitpicking below;
> > >
...
>
> By the way, if we're talking "pedantic", then:
>
> >>>
>
> debian:/tmp$ cat c.c
> void f()
> {
> for (unsigned long i = 0; i < 10; i++)
> continue;
> }
> debian:/tmp$ gcc -c -pedantic c.c
> c.c: In function 'f':
> c.c:3: error: 'for' loop initial declaration used outside C99 mode
> debian:/tmp$
>
Well, I just wrote the way I'd have writen the loop, I know it's not
the common kernel style. Just offering my feedback/input :)
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
On Tue, 22 Jan 2008, Anton Salikhmetov wrote:
>
> /*
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + spinlock_t *ptl;
> + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> + pud_t *pud = pud_offset(pgd, addr);
> + pmd_t *pmd = pmd_offset(pud, addr);
> + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
This is extremely expensive over bigger areas, especially sparsely mapped
ones (it does all the lookups for all four levels over and over and over
again for eachg page).
I think Peter Zijlstra posted a version that uses the regular kind of
nested loop (with inline functions to keep the thing nice and clean),
which gets rid of that.
[ The sad/funny part is that this is all how we *used* to do msync(), back
in the days: we're literally going back to the "pre-cleanup" logic. See
commit 204ec841fbea3e5138168edbc3a76d46747cc987: "mm: msync() cleanup"
for details ]
Quite frankly, I really think you might be better off just doing a
git revert 204ec841fbea3e5138168edbc3a76d46747cc987
and working from there! I just checked, and it still reverts cleanly, and
you'd end up with a nice code-base that (a) has gotten years of testing
and (b) already has the looping-over-the-pagetables code.
Linus
On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> 2008/1/22, Jesper Juhl <[email protected]>:
> > On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> > > 2008/1/22, Jesper Juhl <[email protected]>:
> > > > Some very pedantic nitpicking below;
> > > >
> > > > On 22/01/2008, Anton Salikhmetov <[email protected]> wrote:
> > ...
> > > > > + if (file && (vma->vm_flags & VM_SHARED)) {
> > > > > + if (flags & MS_ASYNC)
> > > > > + vma_wrprotect(vma);
> > > > > + if (flags & MS_SYNC) {
> > > >
> > > > "else if" ??
> > >
> > > The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
> > > did not use the "else-if" here. Moreover, this function itself checks
> > > that they never come together.
> > >
> >
> > I would say that them being mutually exclusive would be a reason *for*
> > using "else-if" here.
>
> This check is performed by the sys_msync() function itself in its very
> beginning.
>
> We don't need to check it later.
>
Sure, it's just that, to me, using 'else-if' makes it explicit that
the two are mutually exclusive. Using "if (...), if (...)" doesn't.
Maybe it's just me, but I feel that 'else-if' here better shows the
intention... No big deal.
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
2008/1/22, Linus Torvalds <[email protected]>:
>
>
> On Tue, 22 Jan 2008, Anton Salikhmetov wrote:
> >
> > /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + spinlock_t *ptl;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > + pud_t *pud = pud_offset(pgd, addr);
> > + pmd_t *pmd = pmd_offset(pud, addr);
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>
> This is extremely expensive over bigger areas, especially sparsely mapped
> ones (it does all the lookups for all four levels over and over and over
> again for eachg page).
>
> I think Peter Zijlstra posted a version that uses the regular kind of
> nested loop (with inline functions to keep the thing nice and clean),
> which gets rid of that.
Thanks for your feedback, Linus!
I will use Peter Zijlstra's version of such an operation in my next
patch series.
>
> [ The sad/funny part is that this is all how we *used* to do msync(), back
> in the days: we're literally going back to the "pre-cleanup" logic. See
> commit 204ec841fbea3e5138168edbc3a76d46747cc987: "mm: msync() cleanup"
> for details ]
>
> Quite frankly, I really think you might be better off just doing a
>
> git revert 204ec841fbea3e5138168edbc3a76d46747cc987
>
> and working from there! I just checked, and it still reverts cleanly, and
> you'd end up with a nice code-base that (a) has gotten years of testing
> and (b) already has the looping-over-the-pagetables code.
>
> Linus
>
Anton Salikhmetov <[email protected]> writes:
You should probably put your design document somewhere in Documentation
with a patch.
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + spinlock_t *ptl;
> + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> + pud_t *pud = pud_offset(pgd, addr);
> + pmd_t *pmd = pmd_offset(pud, addr);
> + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
This means on i386 with highmem ptes you will map/flush tlb/unmap each
PTE individually. You will do 512 times as much work as really needed
per PTE leaf page.
The performance critical address space walkers use a different design
pattern that avoids this.
> + if (pte_dirty(*pte) && pte_write(*pte)) {
> + pte_t entry = ptep_clear_flush(vma, addr, pte);
Flushing TLBs unbatched can also be very expensive because if the MM is
shared by several CPUs you'll have a inter-processor interrupt for
each iteration. They are quite costly even on smaller systems.
It would be better if you did a single flush_tlb_range() at the end.
This means on x86 this will currently always do a full flush, but that's
still better than really slowing down in the heavily multithreaded case.
-Andi
> > >
> > > /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > + spinlock_t *ptl;
> > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > + pud_t *pud = pud_offset(pgd, addr);
> > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >
> > This is extremely expensive over bigger areas, especially sparsely mapped
> > ones (it does all the lookups for all four levels over and over and over
> > again for eachg page).
> >
> > I think Peter Zijlstra posted a version that uses the regular kind of
> > nested loop (with inline functions to keep the thing nice and clean),
> > which gets rid of that.
>
> Thanks for your feedback, Linus!
>
> I will use Peter Zijlstra's version of such an operation in my next
> patch series.
But note, that those functions iterate over all the vmas for the given
page range, not just the one msync was performed on. This might get
even more expensive, if the file is mapped lots of times.
The old version, that Linus was referring to, needs some modification
as well, because it doesn't write protect the ptes, just marks them
clean.
Miklos