Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755075AbYAQMRP (ORCPT ); Thu, 17 Jan 2008 07:17:15 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751968AbYAQMQ7 (ORCPT ); Thu, 17 Jan 2008 07:16:59 -0500 Received: from wa-out-1112.google.com ([209.85.146.181]:47839 "EHLO wa-out-1112.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751498AbYAQMQ6 (ORCPT ); Thu, 17 Jan 2008 07:16:58 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=I9mwK2lqiEDWbXBfDb2OI8rDht75S9VJ0vlIPuRHgm+z04WFYqGCMgMr1Ej6718qgolEAkub/N6cnoy/VcdfwK1lxvHtXWZHqezgrYxgwyJ5/tkT4k3AgVX1MlEbgj6wUFyKu3l7qoDMG0gu0XGuKU1arW0sLUCID//dEbQA9tA= Message-ID: <4df4ef0c0801170416s5581ae28h90d91578baa77738@mail.gmail.com> Date: Thu, 17 Jan 2008 15:16:55 +0300 From: "Anton Salikhmetov" To: "Miklos Szeredi" Subject: Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing Cc: linux-mm@kvack.org, jakob@unthought.net, linux-kernel@vger.kernel.org, valdis.kletnieks@vt.edu, riel@redhat.com, ksm@42.dk, staubach@redhat.com, jesper.juhl@gmail.com, torvalds@linux-foundation.org, a.p.zijlstra@chello.nl, akpm@linux-foundation.org, protasnb@gmail.com, r.e.wolff@bitwizard.nl, hidave.darkstar@gmail.com, hch@infradead.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <12005314662518-git-send-email-salikhmetov@gmail.com> <1200531471556-git-send-email-salikhmetov@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17637 Lines: 452 2008/1/17, Miklos Szeredi : > > http://bugzilla.kernel.org/show_bug.cgi?id=2645 > > > > Changes for updating the ctime and mtime fields for memory-mapped files: > > > > 1) a new flag triggering update of the inode data; > > 2) a new field in the address_space structure for saving modification time; > > 3) a new helper function to update ctime and mtime when needed; > > 4) updating time stamps for mapped files in sys_msync() and do_fsync(); > > 5) implementing lazy ctime and mtime update. > > OK, the functionality seems to be there now. As a next step, I think > you should try to simplify the patch, removing everything, that is not > strictly necessary. > > > > > Signed-off-by: Anton Salikhmetov > > --- > > fs/buffer.c | 3 ++ > > fs/fs-writeback.c | 2 + > > fs/inode.c | 43 +++++++++++++++++++++++---------- > > fs/sync.c | 2 + > > include/linux/fs.h | 13 +++++++++- > > include/linux/pagemap.h | 3 +- > > mm/msync.c | 61 +++++++++++++++++++++++++++++++++------------- > > mm/page-writeback.c | 54 ++++++++++++++++++++++------------------- > > 8 files changed, 124 insertions(+), 57 deletions(-) > > > > diff --git a/fs/buffer.c b/fs/buffer.c > > index 7249e01..3967aa7 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page, > > if (unlikely(!mapping)) > > return !TestSetPageDirty(page); > > > > + mapping->mtime = CURRENT_TIME; > > Why is this needed? POSIX explicitly states, that the modification > time can be set to anywhere between the first write and the msync. I've already described this change in one of my previous emails: > 4. Recording the time was the file data changed > > Finally, I noticed yet another issue with the previous version of my patch. > Specifically, the time stamps were set to the current time of the moment > when syncing but not the write reference was being done. This led to the > following adverse effect on my development system: > > 1) a text file A was updated by process B; > 2) process B exits without calling any of the *sync() functions; > 3) vi editor opens the file A; > 4) file data synced, file times updated; > 5) vi is confused by "thinking" that the file was changed after 3). > > This version overcomes this problem by introducing another field into the > address_space structure. This field is used to "remember" the time of > dirtying, and then this time value is propagated to the file metadata. > > This approach is based upon the following suggestion given by Peter > Staubach during one of our previous discussions: > > http://lkml.org/lkml/2008/1/9/267 > > > A better architecture would be to arrange for the file times > > to be updated when the page makes the transition from being > > unmodified to modified. This is not straightforward due to > > the current locking, but should be doable, I think. Perhaps > > recording the current time and then using it to update the > > file times at a more suitable time (no pun intended) might > > work. > > The solution I propose now proves the viability of the latter > approach. See: http://lkml.org/lkml/2008/1/15/202 > > > + set_bit(AS_MCTIME, &mapping->flags); > > A bigger problem is that doing this in __set_page_dirty() and friends > will mean, that the flag will be set for non-mapped writes as well, > which we definitely don't want. > > A better place to put it is do_wp_page and __do_fault, where > set_page_dirty_balance() is called. Thanks for your suggestion, I'll consider how I can apply it. > > > + > > if (TestSetPageDirty(page)) > > return 0; > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > index 300324b..affd291 100644 > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc) > > > > spin_unlock(&inode_lock); > > > > + mapping_update_time(mapping); > > + > > I think this is unnecessary. Rather put the update into remove_vma(). Thanks for your suggestion, I'll consider how I can apply it. However, I suspect that this is a bit problematic. Let me check it out. > > > ret = do_writepages(mapping, wbc); > > > > /* Don't write the inode if only I_DIRTY_PAGES was set */ > > diff --git a/fs/inode.c b/fs/inode.c > > index ed35383..edd5bf4 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry) > > EXPORT_SYMBOL(touch_atime); > > > > /** > > - * file_update_time - update mtime and ctime time > > - * @file: file accessed > > + * inode_update_time - update mtime and ctime time > > + * @inode: inode accessed > > + * @ts: time when inode was accessed > > + * @sync: whether to do synchronous update > > * > > * Update the mtime and ctime members of an inode and mark the inode > > * for writeback. Note that this function is meant exclusively for > > @@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime); > > * S_NOCTIME inode flag, e.g. for network filesystem where these > > * timestamps are handled by the server. > > */ > > - > > -void file_update_time(struct file *file) > > +void inode_update_time(struct inode *inode, struct timespec *ts) > > { > > - struct inode *inode = file->f_path.dentry->d_inode; > > - struct timespec now; > > int sync_it = 0; > > > > if (IS_NOCMTIME(inode)) > > @@ -1265,22 +1264,41 @@ void file_update_time(struct file *file) > > if (IS_RDONLY(inode)) > > return; > > > > - now = current_fs_time(inode->i_sb); > > - if (!timespec_equal(&inode->i_mtime, &now)) { > > - inode->i_mtime = now; > > + if (timespec_compare(&inode->i_mtime, ts) < 0) { > > + inode->i_mtime = *ts; > > sync_it = 1; > > } > > > > - if (!timespec_equal(&inode->i_ctime, &now)) { > > - inode->i_ctime = now; > > + if (timespec_compare(&inode->i_ctime, ts) < 0) { > > + inode->i_ctime = *ts; > > sync_it = 1; > > } > > > > if (sync_it) > > mark_inode_dirty_sync(inode); > > } > > +EXPORT_SYMBOL(inode_update_time); > > > > -EXPORT_SYMBOL(file_update_time); > > +/* > > + * Update the ctime and mtime stamps after checking if they are to be updated. > > + */ > > +void mapping_update_time(struct address_space *mapping) > > +{ > > + if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) { > > + struct inode *inode = mapping->host; > > + struct timespec *ts = &mapping->mtime; > > + > > + if (S_ISBLK(inode->i_mode)) { > > + struct block_device *bdev = inode->i_bdev; > > + > > + mutex_lock(&bdev->bd_mutex); > > + list_for_each_entry(inode, &bdev->bd_inodes, i_devices) > > + inode_update_time(inode, ts); > > + mutex_unlock(&bdev->bd_mutex); > > + } else > > + inode_update_time(inode, ts); > > + } > > +} > > All these changes to inode.c are unnecessary, I think. The first part is necessary to account for "remembering" the modification time. The second part is for handling block device files. I cannot see any other sane way to update file times for them. > > > > > int inode_needs_sync(struct inode *inode) > > { > > @@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode) > > return 1; > > return 0; > > } > > - > > EXPORT_SYMBOL(inode_needs_sync); > > > > int inode_wait(void *word) > > diff --git a/fs/sync.c b/fs/sync.c > > index 7cd005e..5561464 100644 > > --- a/fs/sync.c > > +++ b/fs/sync.c > > @@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync) > > goto out; > > } > > > > + mapping_update_time(mapping); > > + > > ret = filemap_fdatawrite(mapping); > > > > /* > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index b3ec4a4..f0d3ced 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -511,6 +511,7 @@ struct address_space { > > spinlock_t private_lock; /* for use by the address_space */ > > struct list_head private_list; /* ditto */ > > struct address_space *assoc_mapping; /* ditto */ > > + struct timespec mtime; /* modification time */ > > } __attribute__((aligned(sizeof(long)))); > > /* > > * On most architectures that alignment is already the case; but > > @@ -1977,7 +1978,17 @@ extern int buffer_migrate_page(struct address_space *, > > extern int inode_change_ok(struct inode *, struct iattr *); > > extern int __must_check inode_setattr(struct inode *, struct iattr *); > > > > -extern void file_update_time(struct file *file); > > +extern void inode_update_time(struct inode *, struct timespec *); > > + > > +static inline void file_update_time(struct file *file) > > +{ > > + struct inode *inode = file->f_dentry->d_inode; > > + struct timespec ts = current_fs_time(inode->i_sb); > > + > > + inode_update_time(inode, &ts); > > +} > > + > > +extern void mapping_update_time(struct address_space *); > > As is this. This change is a logical consequence of introducing the new inode_update_time() routine, which was explained above. > > > > > static inline ino_t parent_ino(struct dentry *dentry) > > { > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > > index db8a410..bf0f9e7 100644 > > --- a/include/linux/pagemap.h > > +++ b/include/linux/pagemap.h > > @@ -17,8 +17,9 @@ > > * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page > > * allocation mode flags. > > */ > > -#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */ > > +#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */ > > #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */ > > +#define AS_MCTIME (__GFP_BITS_SHIFT + 2) /* mtime and ctime to update */ > > > > static inline void mapping_set_error(struct address_space *mapping, int error) > > { > > diff --git a/mm/msync.c b/mm/msync.c > > index 44997bf..7657776 100644 > > --- a/mm/msync.c > > +++ b/mm/msync.c > > @@ -13,16 +13,37 @@ > > #include > > > > /* > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > > + * It will force a pagefault on the next write access. > > + */ > > +static void vma_wrprotect(struct vm_area_struct *vma) > > +{ > > + unsigned long addr; > > + > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > > + spinlock_t *ptl; > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > + pud_t *pud = pud_offset(pgd, addr); > > + pmd_t *pmd = pmd_offset(pud, addr); > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + > > + if (pte_dirty(*pte) && pte_write(*pte)) > > + *pte = pte_wrprotect(*pte); > > + pte_unmap_unlock(pte, ptl); > > + } > > +} > > + > > There might be a more efficient way to do this, than getting and > releasing the lock for each pte. I haven't found any locks with bigger granularity here. Frankly, I do not think that this should be a big performance hit if we acquire and release the lock for each PTE. However, I'll be grateful for any further suggestions. Rik, can you please comment on this? > > > +/* > > * MS_SYNC syncs the entire file - including mappings. > > * > > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67). > > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17). > > - * Now it doesn't do anything, since dirty pages are properly tracked. > > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages > > + * read-only by calling vma_wrprotect(). This is needed to catch the next > > + * write reference to the mapped region and update the file times > > + * accordingly. > > * > > - * The application may now run fsync() to > > - * write out the dirty pages and wait on the writeout and check the result. > > - * Or the application may run fadvise(FADV_DONTNEED) against the fd to start > > - * async writeout immediately. > > + * The application may now run fsync() to write out the dirty pages and > > + * wait on the writeout and check the result. Or the application may run > > + * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately. > > * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to > > * applications. > > */ > > @@ -80,16 +101,22 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags) > > start = vma->vm_end; > > > > file = vma->vm_file; > > - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) { > > - get_file(file); > > - up_read(&mm->mmap_sem); > > - error = do_fsync(file, 0); > > - fput(file); > > - if (error) > > - goto out; > > - down_read(&mm->mmap_sem); > > - vma = find_vma(mm, start); > > - continue; > > + if (file && (vma->vm_flags & VM_SHARED)) { > > + if (flags & MS_ASYNC) { > > + vma_wrprotect(vma); > > + mapping_update_time(file->f_mapping); > > + } > > + if (flags & MS_SYNC) { > > + get_file(file); > > + up_read(&mm->mmap_sem); > > + error = do_fsync(file, 0); > > + fput(file); > > + if (error) > > + goto out; > > + down_read(&mm->mmap_sem); > > + vma = find_vma(mm, start); > > + continue; > > + } > > } > > > > vma = vma->vm_next; > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > index 3d3848f..53d0e34 100644 > > --- a/mm/page-writeback.c > > +++ b/mm/page-writeback.c > > @@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page) > > */ > > int __set_page_dirty_nobuffers(struct page *page) > > { > > - if (!TestSetPageDirty(page)) { > > - struct address_space *mapping = page_mapping(page); > > - struct address_space *mapping2; > > + struct address_space *mapping = page_mapping(page); > > + struct address_space *mapping2; > > > > - if (!mapping) > > - return 1; > > + if (!mapping) > > + return 1; > > > > - write_lock_irq(&mapping->tree_lock); > > - mapping2 = page_mapping(page); > > - if (mapping2) { /* Race with truncate? */ > > - BUG_ON(mapping2 != mapping); > > - WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > > - if (mapping_cap_account_dirty(mapping)) { > > - __inc_zone_page_state(page, NR_FILE_DIRTY); > > - __inc_bdi_stat(mapping->backing_dev_info, > > - BDI_RECLAIMABLE); > > - task_io_account_write(PAGE_CACHE_SIZE); > > - } > > - radix_tree_tag_set(&mapping->page_tree, > > - page_index(page), PAGECACHE_TAG_DIRTY); > > - } > > - write_unlock_irq(&mapping->tree_lock); > > - if (mapping->host) { > > - /* !PageAnon && !swapper_space */ > > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > + mapping->mtime = CURRENT_TIME; > > + set_bit(AS_MCTIME, &mapping->flags); > > + > > + if (TestSetPageDirty(page)) > > + return 0; > > + > > + write_lock_irq(&mapping->tree_lock); > > + mapping2 = page_mapping(page); > > + if (mapping2) { > > + /* Race with truncate? */ > > + BUG_ON(mapping2 != mapping); > > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); > > + if (mapping_cap_account_dirty(mapping)) { > > + __inc_zone_page_state(page, NR_FILE_DIRTY); > > + __inc_bdi_stat(mapping->backing_dev_info, > > + BDI_RECLAIMABLE); > > + task_io_account_write(PAGE_CACHE_SIZE); > > } > > - return 1; > > + radix_tree_tag_set(&mapping->page_tree, > > + page_index(page), PAGECACHE_TAG_DIRTY); > > } > > - return 0; > > + write_unlock_irq(&mapping->tree_lock); > > + > > + if (mapping->host) > > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > + > > + return 1; > > } > > EXPORT_SYMBOL(__set_page_dirty_nobuffers); > > > > -- > > 1.4.4.4 > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/