Hi all,
This is v3.2 of the stable-page-writes patchset for ext4 and xfs. The purpose
of this patchset is to prohibit processes from writing on memory pages that are
currently being written to disk because certain storage setups (e.g. SCSI disks
with DIF integrity checksums) will fail a write if the page contents don't
match the checksum. btrfs already guarantees page stability, so it does not
use these changes.
The technique used is fairly simple -- whenever a page is about to become
writable (either because of a write fault to a mapped page, or a buffered write
is in progress), wait for the page writeback flag to be clear, indicating that
the page is not being written to disk. This means that it is necessary (1) to
add wait for writeback code to grab_cache_page_write_begin to take care of
buffered writes, and (2) all filesystems must have a page_mkwrite that locks a
page, waits for writeback, and returns the locked page. For filesystems that
piggyback on the generic block_page_mkwrite, the patchset adds the writeback
wait to that function; for filesystems that do not use the page_mkwrite hook at
all, the patchset provides a stub page_mkwrite.
I ran my write-after-checksum ("wac") reproducer program to try to create the
DIF checksum errors by madly rewriting the same memory pages. In fact, I tried
the following combinations against ext4, xfs, and btrfs:
a. 64 write() threads + sync_file_range
b. 64 mmap write threads + msync
c. 32 write() threads + sync_file_range + 32 mmap write threads + msync
d. Same as C, but with all threads in directio mode
e. Same as A, but with all threads in directio mode
f. Same as B, but with all threads in directio mode
After running profiles A-F for 30 minutes each on 6 different machines, ext4
and xfs report no errors. btrfs eventually reports -ENOSPC and fails the
test, though it does that even without any of the patches applied.
To assess the performance impact of stable page writes, I moved to a disk that
doesn't have DIF support so that I could measure just the impact of waiting for
writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
saw about a 12 percent performance decrease. I then reran the wac program with
64 threads and a 64MB file and saw about the same performance numbers. As I
suspected, the patchset only seems to impact workloads that rewrite the same
memory page frequently.
Per various comments regarding v3 of this patchset, I've integrated his
suggestions, reworked the patch descriptions to make it clearer which ones
touch all the filesystems and which ones are to fix remaining holes in specific
filesystems, and expanded the scope of filesystems that got fixed.
As always, questions and comments are welcome; and thank you to all the
previous reviewers of this patchset. I am also soliciting people's opinions on
whether or not these patches could go upstream for .40.
This latest iteration of the patchset focuses solely on the generic changes
necessary to provide stable pages. It is being sent to Al Viro (just like v3.1
was).
--D
When grabbing a page for a buffered IO write, the mm should wait for writeback
on the page to complete so that the page does not become writable during the IO
operation. This change is needed to provide page stability during writes for
all filesystems.
Signed-off-by: Darrick J. Wong <[email protected]>
---
mm/filemap.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..fd0e7f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2288,7 +2288,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
repeat:
page = find_lock_page(mapping, index);
if (page)
- return page;
+ goto found;
page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
if (!page)
@@ -2301,6 +2301,8 @@ repeat:
goto repeat;
return NULL;
}
+found:
+ wait_on_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
For filesystems such as nilfs2 and xfs that use block_page_mkwrite, modify that
function to wait for pending writeback before allowing the page to become
writable. This is needed to stabilize pages during writeback for those two
filesystems.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..0e7fa16 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2367,8 +2367,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
out:
return ret;
For filesystems that do not provide any page_mkwrite handler, provide a stub
page_mkwrite function that locks the page and waits for pending writeback to
complete. This is needed to stabilize pages during writes for a large variety
of filesystem drivers (ext2, ext3, vfat, hfs...).
Signed-off-by: Darrick J. Wong <[email protected]>
---
mm/filemap.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index fd0e7f2..2a922b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1713,8 +1713,27 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);
+static int stub_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+ loff_t size;
+
+ lock_page(page);
+ size = i_size_read(inode);
+ if ((page->mapping != inode->i_mapping) ||
+ (page_offset(page) > size)) {
+ /* page got truncated out from underneath us */
+ unlock_page(page);
+ return VM_FAULT_NOPAGE;
+ }
+ wait_on_page_writeback(page);
+ return VM_FAULT_LOCKED;
+}
+
const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .page_mkwrite = stub_page_mkwrite,
};
/* This is used for a general mmap of a disk file */
Can you resend patches 1 and 2 ontop of current Linus' tree with Jans
page_mkwrite changes? I don't think there's much point of patch 3 until
we get a user for simple_page_mkwrite.
On Fri, May 27, 2011 at 03:33:26AM -0400, Christoph Hellwig wrote:
> Can you resend patches 1 and 2 ontop of current Linus' tree with Jans
> page_mkwrite changes? I don't think there's much point of patch 3 until
> we get a user for simple_page_mkwrite.
Sure thing. I'll have something ready by the afternoon (here).
--D