Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030239AbVLVR4A (ORCPT ); Thu, 22 Dec 2005 12:56:00 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1030240AbVLVR4A (ORCPT ); Thu, 22 Dec 2005 12:56:00 -0500 Received: from 217-133-42-200.b2b.tiscali.it ([217.133.42.200]:10604 "EHLO opteron.random") by vger.kernel.org with ESMTP id S1030239AbVLVRz7 (ORCPT ); Thu, 22 Dec 2005 12:55:59 -0500 Date: Thu, 22 Dec 2005 18:55:51 +0100 From: Andrea Arcangeli To: Andrew Morton , linux-kernel@vger.kernel.org Subject: nfs invalidates lose pte dirty bits Message-ID: <20051222175550.GT9576@opteron.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3464 Lines: 82 Hello Andrew, this is yet another problem exposed by the new invalidate_inode_pages2 semantics. If a mapping has dirty data (pte dirty and page clean), and an invalidate triggers, the filemap_write_and_wait will do nothing, but the invalidate_inode_pages2 will destroy the pte dirty bit and discard the pagecache. This is an attempted untested fix, however I'm not sure if it's enough because we'd be still losing some dirty bit if the application keeps touching the mapping in between the unmap_mapping_range and the invalidate_inode_pages2. So perhaps this should be combined with a non destructive version of invalidate_inode_pages2 that will simply skip over any dirty page (because we have the guarantee that the dirty information has been generated by userspace after the filemap_write_and_wait). Invalidates may be spurious, it's not like with O_DIRECT were we're guaranteed we overwritten the stuff in the atomic context of the write() syscall. However for the application being tested I think this should be enough, since it looked single threaded. An incremental patch could change the call to invalidate_inode_pages2 to something else like invalidate_inode_clean_pages so that multithreaded programs won't risk to lose dirty bits in the ptes too. Or you may change your mind and we can go back to a much simpler relaxed semantic of invalidates, where people notices changes in mapping only after an munmap (like 2.4), that only requires allowing PG_uptodate pages to be mapped in userspace like 2.4 (you know I tend to prefer those semantics, even though they're not even close to distributed shared memory ;). I'm also wondering why you added this -EIO in invalidate_inode_pages2: if (!invalidate_complete_page(mapping, page)) { if (was_dirty) set_page_dirty(page); ret = -EIO; } When -EIO failures triggers because of a racy invalidate, the db write will get -EIO as retval. Especially the PageDirty check in invalidate_complete_page could return 0, but it doesn't make sense to check the PageDirty bit inside invalidate_inode_pages2, and we're not required to do the set_page_dirty either, the whole point of invalidate_inode_pages2 is to destroy dirty bits. We should go ahead and never returned -EIO there if the page was found dirty again, right? I guess it's ok but it doesn't make much sense and I'd rather avoid -EIO unless they're strictly necessary, because every time a db gets a -EIO we get a critical bugreport. Only a potential invalidate_inode_clean_pages for nfs invalidates, would have to preserve the dirty bits and never call test_clear_page_dirty in the first place. Thanks! Signed-off-by: Andrea Arcangeli diff -r 994d5b26cde2 fs/nfs/inode.c --- a/fs/nfs/inode.c Mon Dec 19 23:00:04 2005 +0000 +++ b/fs/nfs/inode.c Thu Dec 22 18:37:25 2005 +0100 @@ -1180,8 +1180,9 @@ if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { if (S_ISREG(inode->i_mode)) { - if (filemap_fdatawrite(mapping) == 0) - filemap_fdatawait(mapping); + if (mapping_mapped(inode->i_mapping)) + unmap_mapping_range(inode->i_mapping, 0, -1, 0); + filemap_write_and_wait(inode->i_mapping); nfs_wb_all(inode); } invalidate_inode_pages2(mapping); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/