2008-01-17 00:58:00

by Anton Salikhmetov

[permalink] [raw]
Subject: [PATCH -v5 0/2] Updating ctime and mtime for memory-mapped files

This is the fifth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) the case of retouching an already-dirty page pointed out
by Miklos Szeredi has been correctly addressed;

2) a few cosmetic changes according to the latest feedback;

3) fixed the error of calling a possibly sleeping function
from an atomic context.

The design for the first item above was suggested by Peter Zijlstra:

> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Miklos' test program now produces the following output for
the repeated calls to msync() with the MS_ASYNC flag:

debian:~/miklos# ./miklos_test file
begin 1200529196 1200529196 1200528798
write 1200529197 1200529197 1200528798
mmap 1200529197 1200529197 1200529198
b 1200529197 1200529197 1200529198
msync b 1200529199 1200529199 1200529198
c 1200529199 1200529199 1200529198
msync c 1200529201 1200529201 1200529198
d 1200529201 1200529201 1200529198
munmap 1200529201 1200529201 1200529198
close 1200529201 1200529201 1200529198
sync 1200529204 1200529204 1200529198
debian:~/miklos#

Miklos' test program can be found using the following link:

http://lkml.org/lkml/2008/1/14/104


2008-01-17 00:58:22

by Anton Salikhmetov

[permalink] [raw]
Subject: [PATCH -v5 1/2] Massive code cleanup of sys_msync()

Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[email protected]>
---
mm/msync.c | 74 +++++++++++++++++++++++++++++------------------------------
1 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..44997bf 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,22 @@
/*
- * linux/mm/msync.c
+ * The msync() system call.
*
- * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
*/

-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>

/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
* The application may now run fsync() to
@@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
- int unmapped_error = 0;
- int error = -EINVAL;
+ int error = -EINVAL, unmapped_error = 0;

if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
@@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
- error = -ENOMEM;
- len = (len + ~PAGE_MASK) & PAGE_MASK;
+
+ len = PAGE_ALIGN(len);
end = start + len;
- if (end < start)
+ if (end < start) {
+ error = -ENOMEM;
goto out;
+ }
+
error = 0;
+
if (end == start)
goto out;
+
/*
* If the interval [start,end) covers some unmapped address ranges,
* just ignore them, but return -ENOMEM at the end.
*/
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- for (;;) {
+ do {
struct file *file;

- /* Still start < end. */
- error = -ENOMEM;
- if (!vma)
- goto out_unlock;
- /* Here start < vma->vm_end. */
+ if (!vma) {
+ error = -ENOMEM;
+ break;
+ }
if (start < vma->vm_start) {
start = vma->vm_start;
- if (start >= end)
- goto out_unlock;
+ if (start >= end) {
+ error = -ENOMEM;
+ break;
+ }
unmapped_error = -ENOMEM;
}
- /* Here vma->vm_start <= start < vma->vm_end. */
- if ((flags & MS_INVALIDATE) &&
- (vma->vm_flags & VM_LOCKED)) {
+ if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
- goto out_unlock;
+ break;
}
- file = vma->vm_file;
start = vma->vm_end;
- if ((flags & MS_SYNC) && file &&
- (vma->vm_flags & VM_SHARED)) {
+
+ file = vma->vm_file;
+ if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
fput(file);
- if (error || start >= end)
+ if (error)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
- } else {
- if (start >= end) {
- error = 0;
- goto out_unlock;
- }
- vma = vma->vm_next;
+ continue;
}
- }
-out_unlock:
+
+ vma = vma->vm_next;
+ } while (start < end);
up_read(&mm->mmap_sem);
+
out:
- return error ? : unmapped_error;
+ return error ? error : unmapped_error;
}
--
1.4.4.4

2008-01-17 00:58:37

by Anton Salikhmetov

[permalink] [raw]
Subject: [PATCH -v5 2/2] Updating ctime and mtime at syncing

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) a new flag triggering update of the inode data;
2) a new field in the address_space structure for saving modification time;
3) a new helper function to update ctime and mtime when needed;
4) updating time stamps for mapped files in sys_msync() and do_fsync();
5) implementing lazy ctime and mtime update.

Signed-off-by: Anton Salikhmetov <[email protected]>
---
fs/buffer.c | 3 ++
fs/fs-writeback.c | 2 +
fs/inode.c | 43 +++++++++++++++++++++++----------
fs/sync.c | 2 +
include/linux/fs.h | 13 +++++++++-
include/linux/pagemap.h | 3 +-
mm/msync.c | 61 +++++++++++++++++++++++++++++++++-------------
mm/page-writeback.c | 54 ++++++++++++++++++++++-------------------
8 files changed, 124 insertions(+), 57 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);

+ mapping->mtime = CURRENT_TIME;
+ set_bit(AS_MCTIME, &mapping->flags);
+
if (TestSetPageDirty(page))
return 0;

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)

spin_unlock(&inode_lock);

+ mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);

/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..edd5bf4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
EXPORT_SYMBOL(touch_atime);

/**
- * file_update_time - update mtime and ctime time
- * @file: file accessed
+ * inode_update_time - update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
+ * @sync: whether to do synchronous update
*
* Update the mtime and ctime members of an inode and mark the inode
* for writeback. Note that this function is meant exclusively for
@@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
* S_NOCTIME inode flag, e.g. for network filesystem where these
* timestamps are handled by the server.
*/
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
{
- struct inode *inode = file->f_path.dentry->d_inode;
- struct timespec now;
int sync_it = 0;

if (IS_NOCMTIME(inode))
@@ -1265,22 +1264,41 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;

- now = current_fs_time(inode->i_sb);
- if (!timespec_equal(&inode->i_mtime, &now)) {
- inode->i_mtime = now;
+ if (timespec_compare(&inode->i_mtime, ts) < 0) {
+ inode->i_mtime = *ts;
sync_it = 1;
}

- if (!timespec_equal(&inode->i_ctime, &now)) {
- inode->i_ctime = now;
+ if (timespec_compare(&inode->i_ctime, ts) < 0) {
+ inode->i_ctime = *ts;
sync_it = 1;
}

if (sync_it)
mark_inode_dirty_sync(inode);
}
+EXPORT_SYMBOL(inode_update_time);

-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+ if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
+ struct inode *inode = mapping->host;
+ struct timespec *ts = &mapping->mtime;
+
+ if (S_ISBLK(inode->i_mode)) {
+ struct block_device *bdev = inode->i_bdev;
+
+ mutex_lock(&bdev->bd_mutex);
+ list_for_each_entry(inode, &bdev->bd_inodes, i_devices)
+ inode_update_time(inode, ts);
+ mutex_unlock(&bdev->bd_mutex);
+ } else
+ inode_update_time(inode, ts);
+ }
+}

int inode_needs_sync(struct inode *inode)
{
@@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
}
-
EXPORT_SYMBOL(inode_needs_sync);

int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..5561464 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}

+ mapping_update_time(mapping);
+
ret = filemap_fdatawrite(mapping);

/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..f0d3ced 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -511,6 +511,7 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+ struct timespec mtime; /* modification time */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -1977,7 +1978,17 @@ extern int buffer_migrate_page(struct address_space *,
extern int inode_change_ok(struct inode *, struct iattr *);
extern int __must_check inode_setattr(struct inode *, struct iattr *);

-extern void file_update_time(struct file *file);
+extern void inode_update_time(struct inode *, struct timespec *);
+
+static inline void file_update_time(struct file *file)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct timespec ts = current_fs_time(inode->i_sb);
+
+ inode_update_time(inode, &ts);
+}
+
+extern void mapping_update_time(struct address_space *);

static inline ino_t parent_ino(struct dentry *dentry)
{
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..bf0f9e7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -17,8 +17,9 @@
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
* allocation mode flags.
*/
-#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
+#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
#define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
+#define AS_MCTIME (__GFP_BITS_SHIFT + 2) /* mtime and ctime to update */

static inline void mapping_set_error(struct address_space *mapping, int error)
{
diff --git a/mm/msync.c b/mm/msync.c
index 44997bf..7657776 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,16 +13,37 @@
#include <linux/syscalls.h>

/*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+ if (pte_dirty(*pte) && pte_write(*pte))
+ *pte = pte_wrprotect(*pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+}
+
+/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
*
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
@@ -80,16 +101,22 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
start = vma->vm_end;

file = vma->vm_file;
- if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
- get_file(file);
- up_read(&mm->mmap_sem);
- error = do_fsync(file, 0);
- fput(file);
- if (error)
- goto out;
- down_read(&mm->mmap_sem);
- vma = find_vma(mm, start);
- continue;
+ if (file && (vma->vm_flags & VM_SHARED)) {
+ if (flags & MS_ASYNC) {
+ vma_wrprotect(vma);
+ mapping_update_time(file->f_mapping);
+ }
+ if (flags & MS_SYNC) {
+ get_file(file);
+ up_read(&mm->mmap_sem);
+ error = do_fsync(file, 0);
+ fput(file);
+ if (error)
+ goto out;
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
+ continue;
+ }
}

vma = vma->vm_next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3d3848f..53d0e34 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page)
*/
int __set_page_dirty_nobuffers(struct page *page)
{
- if (!TestSetPageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
- struct address_space *mapping2;
+ struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping2;

- if (!mapping)
- return 1;
+ if (!mapping)
+ return 1;

- write_lock_irq(&mapping->tree_lock);
- mapping2 = page_mapping(page);
- if (mapping2) { /* Race with truncate? */
- BUG_ON(mapping2 != mapping);
- WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
- if (mapping_cap_account_dirty(mapping)) {
- __inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
- task_io_account_write(PAGE_CACHE_SIZE);
- }
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- }
- write_unlock_irq(&mapping->tree_lock);
- if (mapping->host) {
- /* !PageAnon && !swapper_space */
- __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ mapping->mtime = CURRENT_TIME;
+ set_bit(AS_MCTIME, &mapping->flags);
+
+ if (TestSetPageDirty(page))
+ return 0;
+
+ write_lock_irq(&mapping->tree_lock);
+ mapping2 = page_mapping(page);
+ if (mapping2) {
+ /* Race with truncate? */
+ BUG_ON(mapping2 != mapping);
+ WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+ if (mapping_cap_account_dirty(mapping)) {
+ __inc_zone_page_state(page, NR_FILE_DIRTY);
+ __inc_bdi_stat(mapping->backing_dev_info,
+ BDI_RECLAIMABLE);
+ task_io_account_write(PAGE_CACHE_SIZE);
}
- return 1;
+ radix_tree_tag_set(&mapping->page_tree,
+ page_index(page), PAGECACHE_TAG_DIRTY);
}
- return 0;
+ write_unlock_irq(&mapping->tree_lock);
+
+ if (mapping->host)
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+ return 1;
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);

--
1.4.4.4

2008-01-17 11:01:38

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v5 1/2] Massive code cleanup of sys_msync()

> Substantial code cleanup of the sys_msync() function:
>
> 1) using the PAGE_ALIGN() macro instead of "manual" alignment;
> 2) improved readability of the loop traversing the process memory regions.
>
> Signed-off-by: Anton Salikhmetov <[email protected]>
> ---
> mm/msync.c | 74 +++++++++++++++++++++++++++++------------------------------
> 1 files changed, 36 insertions(+), 38 deletions(-)
>
> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..44997bf 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -1,24 +1,22 @@
> /*
> - * linux/mm/msync.c
> + * The msync() system call.
> *
> - * Copyright (C) 1994-1999 Linus Torvalds
> + * Copyright (C) 1994-1999 Linus Torvalds
> + * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> */
>
> -/*
> - * The msync() system call.
> - */
> +#include <linux/file.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
> #include <linux/mman.h>
> -#include <linux/file.h>
> -#include <linux/syscalls.h>
> #include <linux/sched.h>
> +#include <linux/syscalls.h>
>
> /*
> * MS_SYNC syncs the entire file - including mappings.
> *
> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> * Now it doesn't do anything, since dirty pages are properly tracked.
> *
> * The application may now run fsync() to
> @@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> unsigned long end;
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> - int unmapped_error = 0;
> - int error = -EINVAL;
> + int error = -EINVAL, unmapped_error = 0;

I prefer multi-line variable declarations, especially for ones with an
initializer.

>
> if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> goto out;
> @@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> goto out;
> if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> goto out;
> - error = -ENOMEM;
> - len = (len + ~PAGE_MASK) & PAGE_MASK;
> +
> + len = PAGE_ALIGN(len);
> end = start + len;
> - if (end < start)
> + if (end < start) {
> + error = -ENOMEM;

The usual style is to have the error assignment outside the
conditional. That way is shorter, clearer, as well as possibly
generating better code.

> goto out;
> + }
> +
> error = 0;
> +

Unnecessary empty line here, these two statements actually belong
together.

> if (end == start)
> goto out;
> +
> /*
> * If the interval [start,end) covers some unmapped address ranges,
> * just ignore them, but return -ENOMEM at the end.
> */
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, start);
> - for (;;) {
> + do {
> struct file *file;
>
> - /* Still start < end. */
> - error = -ENOMEM;
> - if (!vma)
> - goto out_unlock;
> - /* Here start < vma->vm_end. */
> + if (!vma) {
> + error = -ENOMEM;
> + break;
> + }

Again, error asignment should be outside the conditional. This of
course means, you'll have to set the error back to zero at the end of
the loop, but that's fine.

> if (start < vma->vm_start) {
> start = vma->vm_start;
> - if (start >= end)
> - goto out_unlock;
> + if (start >= end) {
> + error = -ENOMEM;
> + break;
> + }

Ditto.

> unmapped_error = -ENOMEM;
> }
> - /* Here vma->vm_start <= start < vma->vm_end. */
> - if ((flags & MS_INVALIDATE) &&
> - (vma->vm_flags & VM_LOCKED)) {
> + if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
> error = -EBUSY;

Ditto2.

> - goto out_unlock;
> + break;
> }
> - file = vma->vm_file;
> start = vma->vm_end;
> - if ((flags & MS_SYNC) && file &&
> - (vma->vm_flags & VM_SHARED)) {
> +
> + file = vma->vm_file;
> + if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> get_file(file);
> up_read(&mm->mmap_sem);
> error = do_fsync(file, 0);
> fput(file);
> - if (error || start >= end)
> + if (error)

This simplifies, but also does unnecessary down/find_vma/up.

> goto out;
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, start);
> - } else {
> - if (start >= end) {
> - error = 0;
> - goto out_unlock;
> - }
> - vma = vma->vm_next;
> + continue;
> }
> - }
> -out_unlock:
> +
> + vma = vma->vm_next;
> + } while (start < end);
> up_read(&mm->mmap_sem);
> +
> out:
> - return error ? : unmapped_error;
> + return error ? error : unmapped_error;
> }

2008-01-17 11:13:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> Changes for updating the ctime and mtime fields for memory-mapped files:
>
> 1) a new flag triggering update of the inode data;
> 2) a new field in the address_space structure for saving modification time;
> 3) a new helper function to update ctime and mtime when needed;
> 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> 5) implementing lazy ctime and mtime update.

OK, the functionality seems to be there now. As a next step, I think
you should try to simplify the patch, removing everything, that is not
strictly necessary.

>
> Signed-off-by: Anton Salikhmetov <[email protected]>
> ---
> fs/buffer.c | 3 ++
> fs/fs-writeback.c | 2 +
> fs/inode.c | 43 +++++++++++++++++++++++----------
> fs/sync.c | 2 +
> include/linux/fs.h | 13 +++++++++-
> include/linux/pagemap.h | 3 +-
> mm/msync.c | 61 +++++++++++++++++++++++++++++++++-------------
> mm/page-writeback.c | 54 ++++++++++++++++++++++-------------------
> 8 files changed, 124 insertions(+), 57 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 7249e01..3967aa7 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
> if (unlikely(!mapping))
> return !TestSetPageDirty(page);
>
> + mapping->mtime = CURRENT_TIME;

Why is this needed? POSIX explicitly states, that the modification
time can be set to anywhere between the first write and the msync.

> + set_bit(AS_MCTIME, &mapping->flags);

A bigger problem is that doing this in __set_page_dirty() and friends
will mean, that the flag will be set for non-mapped writes as well,
which we definitely don't want.

A better place to put it is do_wp_page and __do_fault, where
set_page_dirty_balance() is called.

> +
> if (TestSetPageDirty(page))
> return 0;
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 300324b..affd291 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
>
> spin_unlock(&inode_lock);
>
> + mapping_update_time(mapping);
> +

I think this is unnecessary. Rather put the update into remove_vma().

> ret = do_writepages(mapping, wbc);
>
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> diff --git a/fs/inode.c b/fs/inode.c
> index ed35383..edd5bf4 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
> EXPORT_SYMBOL(touch_atime);
>
> /**
> - * file_update_time - update mtime and ctime time
> - * @file: file accessed
> + * inode_update_time - update mtime and ctime time
> + * @inode: inode accessed
> + * @ts: time when inode was accessed
> + * @sync: whether to do synchronous update
> *
> * Update the mtime and ctime members of an inode and mark the inode
> * for writeback. Note that this function is meant exclusively for
> @@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
> * S_NOCTIME inode flag, e.g. for network filesystem where these
> * timestamps are handled by the server.
> */
> -
> -void file_update_time(struct file *file)
> +void inode_update_time(struct inode *inode, struct timespec *ts)
> {
> - struct inode *inode = file->f_path.dentry->d_inode;
> - struct timespec now;
> int sync_it = 0;
>
> if (IS_NOCMTIME(inode))
> @@ -1265,22 +1264,41 @@ void file_update_time(struct file *file)
> if (IS_RDONLY(inode))
> return;
>
> - now = current_fs_time(inode->i_sb);
> - if (!timespec_equal(&inode->i_mtime, &now)) {
> - inode->i_mtime = now;
> + if (timespec_compare(&inode->i_mtime, ts) < 0) {
> + inode->i_mtime = *ts;
> sync_it = 1;
> }
>
> - if (!timespec_equal(&inode->i_ctime, &now)) {
> - inode->i_ctime = now;
> + if (timespec_compare(&inode->i_ctime, ts) < 0) {
> + inode->i_ctime = *ts;
> sync_it = 1;
> }
>
> if (sync_it)
> mark_inode_dirty_sync(inode);
> }
> +EXPORT_SYMBOL(inode_update_time);
>
> -EXPORT_SYMBOL(file_update_time);
> +/*
> + * Update the ctime and mtime stamps after checking if they are to be updated.
> + */
> +void mapping_update_time(struct address_space *mapping)
> +{
> + if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
> + struct inode *inode = mapping->host;
> + struct timespec *ts = &mapping->mtime;
> +
> + if (S_ISBLK(inode->i_mode)) {
> + struct block_device *bdev = inode->i_bdev;
> +
> + mutex_lock(&bdev->bd_mutex);
> + list_for_each_entry(inode, &bdev->bd_inodes, i_devices)
> + inode_update_time(inode, ts);
> + mutex_unlock(&bdev->bd_mutex);
> + } else
> + inode_update_time(inode, ts);
> + }
> +}

All these changes to inode.c are unnecessary, I think.

>
> int inode_needs_sync(struct inode *inode)
> {
> @@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode)
> return 1;
> return 0;
> }
> -
> EXPORT_SYMBOL(inode_needs_sync);
>
> int inode_wait(void *word)
> diff --git a/fs/sync.c b/fs/sync.c
> index 7cd005e..5561464 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
> goto out;
> }
>
> + mapping_update_time(mapping);
> +
> ret = filemap_fdatawrite(mapping);
>
> /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b3ec4a4..f0d3ced 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -511,6 +511,7 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> + struct timespec mtime; /* modification time */
> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> @@ -1977,7 +1978,17 @@ extern int buffer_migrate_page(struct address_space *,
> extern int inode_change_ok(struct inode *, struct iattr *);
> extern int __must_check inode_setattr(struct inode *, struct iattr *);
>
> -extern void file_update_time(struct file *file);
> +extern void inode_update_time(struct inode *, struct timespec *);
> +
> +static inline void file_update_time(struct file *file)
> +{
> + struct inode *inode = file->f_dentry->d_inode;
> + struct timespec ts = current_fs_time(inode->i_sb);
> +
> + inode_update_time(inode, &ts);
> +}
> +
> +extern void mapping_update_time(struct address_space *);

As is this.

>
> static inline ino_t parent_ino(struct dentry *dentry)
> {
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index db8a410..bf0f9e7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -17,8 +17,9 @@
> * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
> * allocation mode flags.
> */
> -#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> +#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
> +#define AS_MCTIME (__GFP_BITS_SHIFT + 2) /* mtime and ctime to update */
>
> static inline void mapping_set_error(struct address_space *mapping, int error)
> {
> diff --git a/mm/msync.c b/mm/msync.c
> index 44997bf..7657776 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -13,16 +13,37 @@
> #include <linux/syscalls.h>
>
> /*
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + spinlock_t *ptl;
> + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> + pud_t *pud = pud_offset(pgd, addr);
> + pmd_t *pmd = pmd_offset(pud, addr);
> + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> + if (pte_dirty(*pte) && pte_write(*pte))
> + *pte = pte_wrprotect(*pte);
> + pte_unmap_unlock(pte, ptl);
> + }
> +}
> +

There might be a more efficient way to do this, than getting and
releasing the lock for each pte.

> +/*
> * MS_SYNC syncs the entire file - including mappings.
> *
> - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> - * Now it doesn't do anything, since dirty pages are properly tracked.
> + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> + * read-only by calling vma_wrprotect(). This is needed to catch the next
> + * write reference to the mapped region and update the file times
> + * accordingly.
> *
> - * The application may now run fsync() to
> - * write out the dirty pages and wait on the writeout and check the result.
> - * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
> - * async writeout immediately.
> + * The application may now run fsync() to write out the dirty pages and
> + * wait on the writeout and check the result. Or the application may run
> + * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
> * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
> * applications.
> */
> @@ -80,16 +101,22 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> start = vma->vm_end;
>
> file = vma->vm_file;
> - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> - get_file(file);
> - up_read(&mm->mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error)
> - goto out;
> - down_read(&mm->mmap_sem);
> - vma = find_vma(mm, start);
> - continue;
> + if (file && (vma->vm_flags & VM_SHARED)) {
> + if (flags & MS_ASYNC) {
> + vma_wrprotect(vma);
> + mapping_update_time(file->f_mapping);
> + }
> + if (flags & MS_SYNC) {
> + get_file(file);
> + up_read(&mm->mmap_sem);
> + error = do_fsync(file, 0);
> + fput(file);
> + if (error)
> + goto out;
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, start);
> + continue;
> + }
> }
>
> vma = vma->vm_next;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3d3848f..53d0e34 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page)
> */
> int __set_page_dirty_nobuffers(struct page *page)
> {
> - if (!TestSetPageDirty(page)) {
> - struct address_space *mapping = page_mapping(page);
> - struct address_space *mapping2;
> + struct address_space *mapping = page_mapping(page);
> + struct address_space *mapping2;
>
> - if (!mapping)
> - return 1;
> + if (!mapping)
> + return 1;
>
> - write_lock_irq(&mapping->tree_lock);
> - mapping2 = page_mapping(page);
> - if (mapping2) { /* Race with truncate? */
> - BUG_ON(mapping2 != mapping);
> - WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> - if (mapping_cap_account_dirty(mapping)) {
> - __inc_zone_page_state(page, NR_FILE_DIRTY);
> - __inc_bdi_stat(mapping->backing_dev_info,
> - BDI_RECLAIMABLE);
> - task_io_account_write(PAGE_CACHE_SIZE);
> - }
> - radix_tree_tag_set(&mapping->page_tree,
> - page_index(page), PAGECACHE_TAG_DIRTY);
> - }
> - write_unlock_irq(&mapping->tree_lock);
> - if (mapping->host) {
> - /* !PageAnon && !swapper_space */
> - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> + mapping->mtime = CURRENT_TIME;
> + set_bit(AS_MCTIME, &mapping->flags);
> +
> + if (TestSetPageDirty(page))
> + return 0;
> +
> + write_lock_irq(&mapping->tree_lock);
> + mapping2 = page_mapping(page);
> + if (mapping2) {
> + /* Race with truncate? */
> + BUG_ON(mapping2 != mapping);
> + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> + if (mapping_cap_account_dirty(mapping)) {
> + __inc_zone_page_state(page, NR_FILE_DIRTY);
> + __inc_bdi_stat(mapping->backing_dev_info,
> + BDI_RECLAIMABLE);
> + task_io_account_write(PAGE_CACHE_SIZE);
> }
> - return 1;
> + radix_tree_tag_set(&mapping->page_tree,
> + page_index(page), PAGECACHE_TAG_DIRTY);
> }
> - return 0;
> + write_unlock_irq(&mapping->tree_lock);
> +
> + if (mapping->host)
> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> + return 1;
> }
> EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>
> --
> 1.4.4.4
>
>

2008-01-17 11:47:48

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008/1/17, Miklos Szeredi <[email protected]>:
> > Substantial code cleanup of the sys_msync() function:
> >
> > 1) using the PAGE_ALIGN() macro instead of "manual" alignment;
> > 2) improved readability of the loop traversing the process memory regions.
> >
> > Signed-off-by: Anton Salikhmetov <[email protected]>
> > ---
> > mm/msync.c | 74 +++++++++++++++++++++++++++++------------------------------
> > 1 files changed, 36 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..44997bf 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -1,24 +1,22 @@
> > /*
> > - * linux/mm/msync.c
> > + * The msync() system call.
> > *
> > - * Copyright (C) 1994-1999 Linus Torvalds
> > + * Copyright (C) 1994-1999 Linus Torvalds
> > + * Copyright (C) 2008 Anton Salikhmetov <[email protected]>
> > */
> >
> > -/*
> > - * The msync() system call.
> > - */
> > +#include <linux/file.h>
> > #include <linux/fs.h>
> > #include <linux/mm.h>
> > #include <linux/mman.h>
> > -#include <linux/file.h>
> > -#include <linux/syscalls.h>
> > #include <linux/sched.h>
> > +#include <linux/syscalls.h>
> >
> > /*
> > * MS_SYNC syncs the entire file - including mappings.
> > *
> > * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> > + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> > * Now it doesn't do anything, since dirty pages are properly tracked.
> > *
> > * The application may now run fsync() to
> > @@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> > unsigned long end;
> > struct mm_struct *mm = current->mm;
> > struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + int error = -EINVAL, unmapped_error = 0;
>
> I prefer multi-line variable declarations, especially for ones with an
> initializer.
>
> >
> > if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> > goto out;
> > @@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> > goto out;
> > if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> > goto out;
> > - error = -ENOMEM;
> > - len = (len + ~PAGE_MASK) & PAGE_MASK;
> > +
> > + len = PAGE_ALIGN(len);
> > end = start + len;
> > - if (end < start)
> > + if (end < start) {
> > + error = -ENOMEM;
>
> The usual style is to have the error assignment outside the
> conditional. That way is shorter, clearer, as well as possibly
> generating better code.
>
> > goto out;
> > + }
> > +
> > error = 0;
> > +
>
> Unnecessary empty line here, these two statements actually belong
> together.
>
> > if (end == start)
> > goto out;
> > +
> > /*
> > * If the interval [start,end) covers some unmapped address ranges,
> > * just ignore them, but return -ENOMEM at the end.
> > */
> > down_read(&mm->mmap_sem);
> > vma = find_vma(mm, start);
> > - for (;;) {
> > + do {
> > struct file *file;
> >
> > - /* Still start < end. */
> > - error = -ENOMEM;
> > - if (!vma)
> > - goto out_unlock;
> > - /* Here start < vma->vm_end. */
> > + if (!vma) {
> > + error = -ENOMEM;
> > + break;
> > + }
>
> Again, error asignment should be outside the conditional. This of
> course means, you'll have to set the error back to zero at the end of
> the loop, but that's fine.
>
> > if (start < vma->vm_start) {
> > start = vma->vm_start;
> > - if (start >= end)
> > - goto out_unlock;
> > + if (start >= end) {
> > + error = -ENOMEM;
> > + break;
> > + }
>
> Ditto.
>
> > unmapped_error = -ENOMEM;
> > }
> > - /* Here vma->vm_start <= start < vma->vm_end. */
> > - if ((flags & MS_INVALIDATE) &&
> > - (vma->vm_flags & VM_LOCKED)) {
> > + if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
> > error = -EBUSY;
>
> Ditto2.
>
> > - goto out_unlock;
> > + break;
> > }
> > - file = vma->vm_file;
> > start = vma->vm_end;
> > - if ((flags & MS_SYNC) && file &&
> > - (vma->vm_flags & VM_SHARED)) {
> > +
> > + file = vma->vm_file;
> > + if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> > get_file(file);
> > up_read(&mm->mmap_sem);
> > error = do_fsync(file, 0);
> > fput(file);
> > - if (error || start >= end)
> > + if (error)
>
> This simplifies, but also does unnecessary down/find_vma/up.
>
> > goto out;
> > down_read(&mm->mmap_sem);
> > vma = find_vma(mm, start);
> > - } else {
> > - if (start >= end) {
> > - error = 0;
> > - goto out_unlock;
> > - }
> > - vma = vma->vm_next;
> > + continue;
> > }
> > - }
> > -out_unlock:
> > +
> > + vma = vma->vm_next;
> > + } while (start < end);
> > up_read(&mm->mmap_sem);
> > +
> > out:
> > - return error ? : unmapped_error;
> > + return error ? error : unmapped_error;
> > }
>

Thanks for your recommendations!

I'll take them into account for the next version.

2008-01-17 12:17:15

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008/1/17, Miklos Szeredi <[email protected]>:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Changes for updating the ctime and mtime fields for memory-mapped files:
> >
> > 1) a new flag triggering update of the inode data;
> > 2) a new field in the address_space structure for saving modification time;
> > 3) a new helper function to update ctime and mtime when needed;
> > 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> > 5) implementing lazy ctime and mtime update.
>
> OK, the functionality seems to be there now. As a next step, I think
> you should try to simplify the patch, removing everything, that is not
> strictly necessary.
>
> >
> > Signed-off-by: Anton Salikhmetov <[email protected]>
> > ---
> > fs/buffer.c | 3 ++
> > fs/fs-writeback.c | 2 +
> > fs/inode.c | 43 +++++++++++++++++++++++----------
> > fs/sync.c | 2 +
> > include/linux/fs.h | 13 +++++++++-
> > include/linux/pagemap.h | 3 +-
> > mm/msync.c | 61 +++++++++++++++++++++++++++++++++-------------
> > mm/page-writeback.c | 54 ++++++++++++++++++++++-------------------
> > 8 files changed, 124 insertions(+), 57 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 7249e01..3967aa7 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
> > if (unlikely(!mapping))
> > return !TestSetPageDirty(page);
> >
> > + mapping->mtime = CURRENT_TIME;
>
> Why is this needed? POSIX explicitly states, that the modification
> time can be set to anywhere between the first write and the msync.

I've already described this change in one of my previous emails:

> 4. Recording the time was the file data changed
>
> Finally, I noticed yet another issue with the previous version of my patch.
> Specifically, the time stamps were set to the current time of the moment
> when syncing but not the write reference was being done. This led to the
> following adverse effect on my development system:
>
> 1) a text file A was updated by process B;
> 2) process B exits without calling any of the *sync() functions;
> 3) vi editor opens the file A;
> 4) file data synced, file times updated;
> 5) vi is confused by "thinking" that the file was changed after 3).
>
> This version overcomes this problem by introducing another field into the
> address_space structure. This field is used to "remember" the time of
> dirtying, and then this time value is propagated to the file metadata.
>
> This approach is based upon the following suggestion given by Peter
> Staubach during one of our previous discussions:
>
> http://lkml.org/lkml/2008/1/9/267
>
> > A better architecture would be to arrange for the file times
> > to be updated when the page makes the transition from being
> > unmodified to modified. This is not straightforward due to
> > the current locking, but should be doable, I think. Perhaps
> > recording the current time and then using it to update the
> > file times at a more suitable time (no pun intended) might
> > work.
>
> The solution I propose now proves the viability of the latter
> approach.

See:

http://lkml.org/lkml/2008/1/15/202

>
> > + set_bit(AS_MCTIME, &mapping->flags);
>
> A bigger problem is that doing this in __set_page_dirty() and friends
> will mean, that the flag will be set for non-mapped writes as well,
> which we definitely don't want.
>
> A better place to put it is do_wp_page and __do_fault, where
> set_page_dirty_balance() is called.

Thanks for your suggestion, I'll consider how I can apply it.

>
> > +
> > if (TestSetPageDirty(page))
> > return 0;
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 300324b..affd291 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> >
> > spin_unlock(&inode_lock);
> >
> > + mapping_update_time(mapping);
> > +
>
> I think this is unnecessary. Rather put the update into remove_vma().

Thanks for your suggestion, I'll consider how I can apply it.

However, I suspect that this is a bit problematic. Let me check it out.

>
> > ret = do_writepages(mapping, wbc);
> >
> > /* Don't write the inode if only I_DIRTY_PAGES was set */
> > diff --git a/fs/inode.c b/fs/inode.c
> > index ed35383..edd5bf4 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
> > EXPORT_SYMBOL(touch_atime);
> >
> > /**
> > - * file_update_time - update mtime and ctime time
> > - * @file: file accessed
> > + * inode_update_time - update mtime and ctime time
> > + * @inode: inode accessed
> > + * @ts: time when inode was accessed
> > + * @sync: whether to do synchronous update
> > *
> > * Update the mtime and ctime members of an inode and mark the inode
> > * for writeback. Note that this function is meant exclusively for
> > @@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
> > * S_NOCTIME inode flag, e.g. for network filesystem where these
> > * timestamps are handled by the server.
> > */
> > -
> > -void file_update_time(struct file *file)
> > +void inode_update_time(struct inode *inode, struct timespec *ts)
> > {
> > - struct inode *inode = file->f_path.dentry->d_inode;
> > - struct timespec now;
> > int sync_it = 0;
> >
> > if (IS_NOCMTIME(inode))
> > @@ -1265,22 +1264,41 @@ void file_update_time(struct file *file)
> > if (IS_RDONLY(inode))
> > return;
> >
> > - now = current_fs_time(inode->i_sb);
> > - if (!timespec_equal(&inode->i_mtime, &now)) {
> > - inode->i_mtime = now;
> > + if (timespec_compare(&inode->i_mtime, ts) < 0) {
> > + inode->i_mtime = *ts;
> > sync_it = 1;
> > }
> >
> > - if (!timespec_equal(&inode->i_ctime, &now)) {
> > - inode->i_ctime = now;
> > + if (timespec_compare(&inode->i_ctime, ts) < 0) {
> > + inode->i_ctime = *ts;
> > sync_it = 1;
> > }
> >
> > if (sync_it)
> > mark_inode_dirty_sync(inode);
> > }
> > +EXPORT_SYMBOL(inode_update_time);
> >
> > -EXPORT_SYMBOL(file_update_time);
> > +/*
> > + * Update the ctime and mtime stamps after checking if they are to be updated.
> > + */
> > +void mapping_update_time(struct address_space *mapping)
> > +{
> > + if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
> > + struct inode *inode = mapping->host;
> > + struct timespec *ts = &mapping->mtime;
> > +
> > + if (S_ISBLK(inode->i_mode)) {
> > + struct block_device *bdev = inode->i_bdev;
> > +
> > + mutex_lock(&bdev->bd_mutex);
> > + list_for_each_entry(inode, &bdev->bd_inodes, i_devices)
> > + inode_update_time(inode, ts);
> > + mutex_unlock(&bdev->bd_mutex);
> > + } else
> > + inode_update_time(inode, ts);
> > + }
> > +}
>
> All these changes to inode.c are unnecessary, I think.

The first part is necessary to account for "remembering" the modification time.

The second part is for handling block device files. I cannot see any other
sane way to update file times for them.

>
> >
> > int inode_needs_sync(struct inode *inode)
> > {
> > @@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode)
> > return 1;
> > return 0;
> > }
> > -
> > EXPORT_SYMBOL(inode_needs_sync);
> >
> > int inode_wait(void *word)
> > diff --git a/fs/sync.c b/fs/sync.c
> > index 7cd005e..5561464 100644
> > --- a/fs/sync.c
> > +++ b/fs/sync.c
> > @@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
> > goto out;
> > }
> >
> > + mapping_update_time(mapping);
> > +
> > ret = filemap_fdatawrite(mapping);
> >
> > /*
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b3ec4a4..f0d3ced 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -511,6 +511,7 @@ struct address_space {
> > spinlock_t private_lock; /* for use by the address_space */
> > struct list_head private_list; /* ditto */
> > struct address_space *assoc_mapping; /* ditto */
> > + struct timespec mtime; /* modification time */
> > } __attribute__((aligned(sizeof(long))));
> > /*
> > * On most architectures that alignment is already the case; but
> > @@ -1977,7 +1978,17 @@ extern int buffer_migrate_page(struct address_space *,
> > extern int inode_change_ok(struct inode *, struct iattr *);
> > extern int __must_check inode_setattr(struct inode *, struct iattr *);
> >
> > -extern void file_update_time(struct file *file);
> > +extern void inode_update_time(struct inode *, struct timespec *);
> > +
> > +static inline void file_update_time(struct file *file)
> > +{
> > + struct inode *inode = file->f_dentry->d_inode;
> > + struct timespec ts = current_fs_time(inode->i_sb);
> > +
> > + inode_update_time(inode, &ts);
> > +}
> > +
> > +extern void mapping_update_time(struct address_space *);
>
> As is this.

This change is a logical consequence of introducing
the new inode_update_time() routine, which was explained above.

>
> >
> > static inline ino_t parent_ino(struct dentry *dentry)
> > {
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index db8a410..bf0f9e7 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -17,8 +17,9 @@
> > * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
> > * allocation mode flags.
> > */
> > -#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> > +#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> > #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
> > +#define AS_MCTIME (__GFP_BITS_SHIFT + 2) /* mtime and ctime to update */
> >
> > static inline void mapping_set_error(struct address_space *mapping, int error)
> > {
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 44997bf..7657776 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -13,16 +13,37 @@
> > #include <linux/syscalls.h>
> >
> > /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + spinlock_t *ptl;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > + pud_t *pud = pud_offset(pgd, addr);
> > + pmd_t *pmd = pmd_offset(pud, addr);
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte))
> > + *pte = pte_wrprotect(*pte);
> > + pte_unmap_unlock(pte, ptl);
> > + }
> > +}
> > +
>
> There might be a more efficient way to do this, than getting and
> releasing the lock for each pte.

I haven't found any locks with bigger granularity here. Frankly, I do
not think that this should be a big performance hit if we acquire and
release the lock for each PTE. However, I'll be grateful for any further
suggestions.

Rik, can you please comment on this?

>
> > +/*
> > * MS_SYNC syncs the entire file - including mappings.
> > *
> > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> > - * Now it doesn't do anything, since dirty pages are properly tracked.
> > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> > + * read-only by calling vma_wrprotect(). This is needed to catch the next
> > + * write reference to the mapped region and update the file times
> > + * accordingly.
> > *
> > - * The application may now run fsync() to
> > - * write out the dirty pages and wait on the writeout and check the result.
> > - * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
> > - * async writeout immediately.
> > + * The application may now run fsync() to write out the dirty pages and
> > + * wait on the writeout and check the result. Or the application may run
> > + * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
> > * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
> > * applications.
> > */
> > @@ -80,16 +101,22 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
> > start = vma->vm_end;
> >
> > file = vma->vm_file;
> > - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> > - get_file(file);
> > - up_read(&mm->mmap_sem);
> > - error = do_fsync(file, 0);
> > - fput(file);
> > - if (error)
> > - goto out;
> > - down_read(&mm->mmap_sem);
> > - vma = find_vma(mm, start);
> > - continue;
> > + if (file && (vma->vm_flags & VM_SHARED)) {
> > + if (flags & MS_ASYNC) {
> > + vma_wrprotect(vma);
> > + mapping_update_time(file->f_mapping);
> > + }
> > + if (flags & MS_SYNC) {
> > + get_file(file);
> > + up_read(&mm->mmap_sem);
> > + error = do_fsync(file, 0);
> > + fput(file);
> > + if (error)
> > + goto out;
> > + down_read(&mm->mmap_sem);
> > + vma = find_vma(mm, start);
> > + continue;
> > + }
> > }
> >
> > vma = vma->vm_next;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 3d3848f..53d0e34 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page)
> > */
> > int __set_page_dirty_nobuffers(struct page *page)
> > {
> > - if (!TestSetPageDirty(page)) {
> > - struct address_space *mapping = page_mapping(page);
> > - struct address_space *mapping2;
> > + struct address_space *mapping = page_mapping(page);
> > + struct address_space *mapping2;
> >
> > - if (!mapping)
> > - return 1;
> > + if (!mapping)
> > + return 1;
> >
> > - write_lock_irq(&mapping->tree_lock);
> > - mapping2 = page_mapping(page);
> > - if (mapping2) { /* Race with truncate? */
> > - BUG_ON(mapping2 != mapping);
> > - WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> > - if (mapping_cap_account_dirty(mapping)) {
> > - __inc_zone_page_state(page, NR_FILE_DIRTY);
> > - __inc_bdi_stat(mapping->backing_dev_info,
> > - BDI_RECLAIMABLE);
> > - task_io_account_write(PAGE_CACHE_SIZE);
> > - }
> > - radix_tree_tag_set(&mapping->page_tree,
> > - page_index(page), PAGECACHE_TAG_DIRTY);
> > - }
> > - write_unlock_irq(&mapping->tree_lock);
> > - if (mapping->host) {
> > - /* !PageAnon && !swapper_space */
> > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > + mapping->mtime = CURRENT_TIME;
> > + set_bit(AS_MCTIME, &mapping->flags);
> > +
> > + if (TestSetPageDirty(page))
> > + return 0;
> > +
> > + write_lock_irq(&mapping->tree_lock);
> > + mapping2 = page_mapping(page);
> > + if (mapping2) {
> > + /* Race with truncate? */
> > + BUG_ON(mapping2 != mapping);
> > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> > + if (mapping_cap_account_dirty(mapping)) {
> > + __inc_zone_page_state(page, NR_FILE_DIRTY);
> > + __inc_bdi_stat(mapping->backing_dev_info,
> > + BDI_RECLAIMABLE);
> > + task_io_account_write(PAGE_CACHE_SIZE);
> > }
> > - return 1;
> > + radix_tree_tag_set(&mapping->page_tree,
> > + page_index(page), PAGECACHE_TAG_DIRTY);
> > }
> > - return 0;
> > + write_unlock_irq(&mapping->tree_lock);
> > +
> > + if (mapping->host)
> > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > + return 1;
> > }
> > EXPORT_SYMBOL(__set_page_dirty_nobuffers);
> >
> > --
> > 1.4.4.4
> >
> >
>

2008-01-17 12:46:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

> > 4. Recording the time was the file data changed
> >
> > Finally, I noticed yet another issue with the previous version of my patch.
> > Specifically, the time stamps were set to the current time of the moment
> > when syncing but not the write reference was being done. This led to the
> > following adverse effect on my development system:
> >
> > 1) a text file A was updated by process B;
> > 2) process B exits without calling any of the *sync() functions;
> > 3) vi editor opens the file A;
> > 4) file data synced, file times updated;
> > 5) vi is confused by "thinking" that the file was changed after 3).

Updating the time in remove_vma() would fix this, no?

> > All these changes to inode.c are unnecessary, I think.
>
> The first part is necessary to account for "remembering" the modification time.
>
> The second part is for handling block device files. I cannot see any other
> sane way to update file times for them.

Use file_update_time(), which will do the right thing. It will in
fact do the same thing as write(2) on the device, which is really what
we want.

Block devices being mapped for write through different device
nodes..., well, I don't think we really need to handle such weird
corner cases 100% acurately.

Miklos

2008-01-17 12:51:49

by Rogier Wolff

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

On Thu, Jan 17, 2008 at 01:45:43PM +0100, Miklos Szeredi wrote:
> > > 4. Recording the time was the file data changed
> > >
> > > Finally, I noticed yet another issue with the previous version of my patch.
> > > Specifically, the time stamps were set to the current time of the moment
> > > when syncing but not the write reference was being done. This led to the
> > > following adverse effect on my development system:
> > >
> > > 1) a text file A was updated by process B;
> > > 2) process B exits without calling any of the *sync() functions;
> > > 3) vi editor opens the file A;
> > > 4) file data synced, file times updated;
> > > 5) vi is confused by "thinking" that the file was changed after 3).
>
> Updating the time in remove_vma() would fix this, no?

That sounds to me as the right thing to do. Although not explcitly
mentioned in the standard, it is the logical (latest allowable)
timestamp to put on the modifications by process B.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2008-01-17 13:16:56

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008/1/17, Miklos Szeredi <[email protected]>:
> > > 4. Recording the time was the file data changed
> > >
> > > Finally, I noticed yet another issue with the previous version of my patch.
> > > Specifically, the time stamps were set to the current time of the moment
> > > when syncing but not the write reference was being done. This led to the
> > > following adverse effect on my development system:
> > >
> > > 1) a text file A was updated by process B;
> > > 2) process B exits without calling any of the *sync() functions;
> > > 3) vi editor opens the file A;
> > > 4) file data synced, file times updated;
> > > 5) vi is confused by "thinking" that the file was changed after 3).
>
> Updating the time in remove_vma() would fix this, no?

We need to save modification time. Otherwise, updating time stamps
will be confusing the vi editor.

>
> > > All these changes to inode.c are unnecessary, I think.
> >
> > The first part is necessary to account for "remembering" the modification time.
> >
> > The second part is for handling block device files. I cannot see any other
> > sane way to update file times for them.
>
> Use file_update_time(), which will do the right thing. It will in
> fact do the same thing as write(2) on the device, which is really what
> we want.
>
> Block devices being mapped for write through different device
> nodes..., well, I don't think we really need to handle such weird
> corner cases 100% acurately.

The file_update_time() cannot be used for implementing
the "auto-update" feature, because the sync() system call
doesn't "know" about the file which was memory-mapped.

>
> Miklos
>

2008-01-17 13:24:41

by Rogier Wolff

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

On Thu, Jan 17, 2008 at 04:16:47PM +0300, Anton Salikhmetov wrote:
> 2008/1/17, Miklos Szeredi <[email protected]>:
> > > > 4. Recording the time was the file data changed
> > > >
> > > > Finally, I noticed yet another issue with the previous version of my patch.
> > > > Specifically, the time stamps were set to the current time of the moment
> > > > when syncing but not the write reference was being done. This led to the
> > > > following adverse effect on my development system:
> > > >
> > > > 1) a text file A was updated by process B;
> > > > 2) process B exits without calling any of the *sync() functions;
> > > > 3) vi editor opens the file A;
> > > > 4) file data synced, file times updated;
> > > > 5) vi is confused by "thinking" that the file was changed after 3).
> >
> > Updating the time in remove_vma() would fix this, no?
>
> We need to save modification time. Otherwise, updating time stamps
> will be confusing the vi editor.

If process B exits before vi opens the file, the timestamp should at
the latest be the time that process B exits. There is no excuse for
setting the timestamp later than the time that B exits.

If process B no longer modifies the file, but still keeps it mapped
until after vi starts, then the system can't help the
situation. Wether or not B acesses those pages is unknown to the
system. So you get what you deserve.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2008-01-17 13:33:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

> 2008/1/17, Miklos Szeredi <[email protected]>:
> > > > 4. Recording the time was the file data changed
> > > >
> > > > Finally, I noticed yet another issue with the previous version of my patch.
> > > > Specifically, the time stamps were set to the current time of the moment
> > > > when syncing but not the write reference was being done. This led to the
> > > > following adverse effect on my development system:
> > > >
> > > > 1) a text file A was updated by process B;
> > > > 2) process B exits without calling any of the *sync() functions;
> > > > 3) vi editor opens the file A;
> > > > 4) file data synced, file times updated;
> > > > 5) vi is confused by "thinking" that the file was changed after 3).
> >
> > Updating the time in remove_vma() would fix this, no?
>
> We need to save modification time. Otherwise, updating time stamps
> will be confusing the vi editor.

remove_vma() will be called when process B exits, so if the times are
updated there, and the flag is cleared, the times won't be updated
later.

> >
> > > > All these changes to inode.c are unnecessary, I think.
> > >
> > > The first part is necessary to account for "remembering" the modification time.
> > >
> > > The second part is for handling block device files. I cannot see any other
> > > sane way to update file times for them.
> >
> > Use file_update_time(), which will do the right thing. It will in
> > fact do the same thing as write(2) on the device, which is really what
> > we want.
> >
> > Block devices being mapped for write through different device
> > nodes..., well, I don't think we really need to handle such weird
> > corner cases 100% acurately.
>
> The file_update_time() cannot be used for implementing
> the "auto-update" feature, because the sync() system call
> doesn't "know" about the file which was memory-mapped.

I'm not sure this auto-updating is really needed (POSIX doesn't
mandate it).

At least split it out into a separate patch, so it can be considered
separately on it's own merit.

I think doing the same with the page-table reprotecting in MS_ASYNC is
also a good idea. That will leave us with

1) a base patch: update time just from fsync() and remove_vma()
2) update time on sync(2) as well
3) update time on MS_ASYNC as well

I'd happily ack the first one, which would solve the most serious
issues, but have some reservations about the other two.

Miklos

2008-01-17 13:34:35

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008/1/17, Rogier Wolff <[email protected]>:
> On Thu, Jan 17, 2008 at 04:16:47PM +0300, Anton Salikhmetov wrote:
> > 2008/1/17, Miklos Szeredi <[email protected]>:
> > > > > 4. Recording the time was the file data changed
> > > > >
> > > > > Finally, I noticed yet another issue with the previous version of my patch.
> > > > > Specifically, the time stamps were set to the current time of the moment
> > > > > when syncing but not the write reference was being done. This led to the
> > > > > following adverse effect on my development system:
> > > > >
> > > > > 1) a text file A was updated by process B;
> > > > > 2) process B exits without calling any of the *sync() functions;
> > > > > 3) vi editor opens the file A;
> > > > > 4) file data synced, file times updated;
> > > > > 5) vi is confused by "thinking" that the file was changed after 3).
> > >
> > > Updating the time in remove_vma() would fix this, no?
> >
> > We need to save modification time. Otherwise, updating time stamps
> > will be confusing the vi editor.
>
> If process B exits before vi opens the file, the timestamp should at
> the latest be the time that process B exits. There is no excuse for
> setting the timestamp later than the time that B exits.
>
> If process B no longer modifies the file, but still keeps it mapped
> until after vi starts, then the system can't help the
> situation. Wether or not B acesses those pages is unknown to the
> system. So you get what you deserve.

The exit() system call closes the file memory-mapped file. Therefore,
it's better to catch the close() system call. However, the close() system call
does not trigger unmapping the file. The mapped area for the file may be
used after closing the file. So, we should catch only the unmap() call,
not close() or exit().

>
> Roger.
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
>

2008-01-17 13:40:46

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008/1/17, Miklos Szeredi <[email protected]>:
> > 2008/1/17, Miklos Szeredi <[email protected]>:
> > > > > 4. Recording the time was the file data changed
> > > > >
> > > > > Finally, I noticed yet another issue with the previous version of my patch.
> > > > > Specifically, the time stamps were set to the current time of the moment
> > > > > when syncing but not the write reference was being done. This led to the
> > > > > following adverse effect on my development system:
> > > > >
> > > > > 1) a text file A was updated by process B;
> > > > > 2) process B exits without calling any of the *sync() functions;
> > > > > 3) vi editor opens the file A;
> > > > > 4) file data synced, file times updated;
> > > > > 5) vi is confused by "thinking" that the file was changed after 3).
> > >
> > > Updating the time in remove_vma() would fix this, no?
> >
> > We need to save modification time. Otherwise, updating time stamps
> > will be confusing the vi editor.
>
> remove_vma() will be called when process B exits, so if the times are
> updated there, and the flag is cleared, the times won't be updated
> later.
>
> > >
> > > > > All these changes to inode.c are unnecessary, I think.
> > > >
> > > > The first part is necessary to account for "remembering" the modification time.
> > > >
> > > > The second part is for handling block device files. I cannot see any other
> > > > sane way to update file times for them.
> > >
> > > Use file_update_time(), which will do the right thing. It will in
> > > fact do the same thing as write(2) on the device, which is really what
> > > we want.
> > >
> > > Block devices being mapped for write through different device
> > > nodes..., well, I don't think we really need to handle such weird
> > > corner cases 100% acurately.
> >
> > The file_update_time() cannot be used for implementing
> > the "auto-update" feature, because the sync() system call
> > doesn't "know" about the file which was memory-mapped.
>
> I'm not sure this auto-updating is really needed (POSIX doesn't
> mandate it).

Peter Shtaubach, author of the first solution for this bug,
and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
feature to be implemented.

>
> At least split it out into a separate patch, so it can be considered
> separately on it's own merit.
>
> I think doing the same with the page-table reprotecting in MS_ASYNC is
> also a good idea. That will leave us with
>
> 1) a base patch: update time just from fsync() and remove_vma()
> 2) update time on sync(2) as well
> 3) update time on MS_ASYNC as well
>
> I'd happily ack the first one, which would solve the most serious
> issues, but have some reservations about the other two.
>
> Miklos
>

2008-01-17 15:45:55

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

> > I'm not sure this auto-updating is really needed (POSIX doesn't
> > mandate it).
>
> Peter Shtaubach, author of the first solution for this bug,
> and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
> feature to be implemented.

Can they state their reasons for the insistence?

> 1) a base patch: update time just from fsync() and remove_vma()
> 2) update time on sync(2) as well
> 3) update time on MS_ASYNC as well

Oh, and the four-liner I posted the other day will give you 1) + 2) +
even more at a small fraction of the complexity. And tacking on the
reprotect code will solve the MS_ASYNC issue just the same.

I agree, that having the timestamp updated on sync() is nice, and that
trivial patch will give you that, and will also update the timestamp
at least each 30 seconds if the file is being constantly modified,
even if no explicit syncing is done.

So maybe it's worth a little effort benchmarking how much that patch
affects the cost of writing to a page.

You could write a little test program like this (if somebody hasn't
yet done so):

- do some preparation:

echo 80 > dirty_ratio
echo 80 > dirty_background_ratio
echo 30000 > dirty_expire_centisecs
sync

- map a large file, one that fits comfortably into free memory
- bring the whole file in, by reading a byte from each page
- start the timer
- write a byte to each page
- stop the timer

It would be most interesting to try this on a filesystem supporting
nanosecond timestamps. Anyone know which these are?

Miklos
----

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2008-01-09 21:16:30.000000000 +0100
+++ linux/mm/memory.c 2008-01-15 21:16:14.000000000 +0100
@@ -1680,6 +1680,8 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
/*
* Yes, Virginia, this is actually required to prevent a race
* with clear_page_dirty_for_io() from clearing the page dirty
@@ -2313,6 +2315,8 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}

2008-01-17 16:21:15

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008/1/17, Miklos Szeredi <[email protected]>:
> > > I'm not sure this auto-updating is really needed (POSIX doesn't
> > > mandate it).
> >
> > Peter Shtaubach, author of the first solution for this bug,
> > and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
> > feature to be implemented.
>
> Can they state their reasons for the insistence?
>
> > 1) a base patch: update time just from fsync() and remove_vma()
> > 2) update time on sync(2) as well
> > 3) update time on MS_ASYNC as well
>
> Oh, and the four-liner I posted the other day will give you 1) + 2) +
> even more at a small fraction of the complexity. And tacking on the
> reprotect code will solve the MS_ASYNC issue just the same.
>
> I agree, that having the timestamp updated on sync() is nice, and that
> trivial patch will give you that, and will also update the timestamp
> at least each 30 seconds if the file is being constantly modified,
> even if no explicit syncing is done.
>
> So maybe it's worth a little effort benchmarking how much that patch
> affects the cost of writing to a page.
>
> You could write a little test program like this (if somebody hasn't
> yet done so):
>
> - do some preparation:
>
> echo 80 > dirty_ratio
> echo 80 > dirty_background_ratio
> echo 30000 > dirty_expire_centisecs
> sync
>
> - map a large file, one that fits comfortably into free memory
> - bring the whole file in, by reading a byte from each page
> - start the timer
> - write a byte to each page
> - stop the timer
>
> It would be most interesting to try this on a filesystem supporting
> nanosecond timestamps. Anyone know which these are?

The do_wp_page() function is called in mm/memory.c after locking PTE.
And the file_update_time() routine calls the filesystem operation that can
sleep. It's not accepted, I guess.

>
> Miklos
> ----
>
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c 2008-01-09 21:16:30.000000000 +0100
> +++ linux/mm/memory.c 2008-01-15 21:16:14.000000000 +0100
> @@ -1680,6 +1680,8 @@ gotten:
> unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> /*
> * Yes, Virginia, this is actually required to prevent a race
> * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2313,6 +2315,8 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
>
>
>

2008-01-17 16:26:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

> The do_wp_page() function is called in mm/memory.c after locking PTE.
> And the file_update_time() routine calls the filesystem operation that can
> sleep. It's not accepted, I guess.

do_wp_page() is called with the pte lock but drops it, so that's fine.

Miklos

2008-01-17 16:33:26

by Anton Salikhmetov

[permalink] [raw]
Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008/1/17, Miklos Szeredi <[email protected]>:
> > The do_wp_page() function is called in mm/memory.c after locking PTE.
> > And the file_update_time() routine calls the filesystem operation that can
> > sleep. It's not accepted, I guess.
>
> do_wp_page() is called with the pte lock but drops it, so that's fine.

OK, I agree.

I'll take into account your suggestion to move updating time stamps from
the __set_page_dirty() and __set_page_dirty_nobuffers() routines to
do_wp_page(). Thank you!

>
> Miklos
>