2009-09-17 15:21:40

by Jan Kara

[permalink] [raw]
Subject: [RFC] [PATCH 0/7] Improve VFS to handle better mmaps when blocksize < pagesize (v3)


Hi,

here is my next attempt to solve a problems arising with mmaped writes when
blocksize < pagesize. To recall what's the problem:

We'd like to use page_mkwrite() to allocate blocks under a page which is
becoming writeably mmapped in some process address space. This allows a
filesystem to return a page fault if there is not enough space available, user
exceeds quota or similar problem happens, rather than silently discarding data
later when writepage is called.

On filesystems where blocksize < pagesize the situation is complicated though.
Think for example that blocksize = 1024, pagesize = 4096 and a process does:
ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a'; ----> page_mkwrite() for index 0 is called
ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
fsync(fd); ----> writepage() for index 0 is called

At the moment page_mkwrite() is called, filesystem can allocate only one block
for the page because i_size == 1024. Otherwise it would create blocks beyond
i_size which is generally undesirable. But later at writepage() time, we would
like to have blocks allocated for the whole page (and in principle we have to
allocate them because user could have filled the page with data after the
second ftruncate()).
---

The patches depend on Nick's truncate calling convention rewrite. The first
three patches in the patchset are just cleanups. The series converts ext4 and
ext2 filesystems just to give an idea how conversion of a filesystem will
look like.
A few notes to the changes the main patch (patch number 4) does:
1) zeroing of tail of the last block now does not happen in writepage (which is
racy anyway as Nick pointed out) and foo_truncate_page but rather when i_size
is going to be extended.
2) writeback path does not care about i_size anymore, it uses buffer flags
instead. An exception is a nobh case where we have to use i_size. Thus
filesystems not using nobh code can update i_size in write_end without holding
page_lock. Filesystems using nobh code still have to update i_size under the
page_lock since otherwise __mpage_writepage could come early, write just part
of the page, and clear all dirty bits, thus causing a data loss.
3) converted filesystems have to make sure that the buffers with valid data
to write are either mapped or delay before they call block_write_full_page.
The idea is that they should use page_mkwrite() to setup buffers.

Both ext2 and ext4 have survived some beating with fsx-linux so they should
be at least moderately safe to use :). Any comments?
Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2009-09-17 15:21:41

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/7] fs: buffer_head writepage no invalidate

From: Nick Piggin <[email protected]>

invalidate should not be required in the writeout path. The truncate
sequence will first reduce i_size, then clean and discard any existing
pagecache (and no new dirty pagecache can be added because i_size was
reduced and i_mutex is being held), then filesystem data structures
are updated.

Filesystem needs to be able to handle writeout at any point before
the last step, and once the 2nd step completes, there should be no
unfreeable dirty buffers anyway (truncate performs the do_invalidatepage).

Having filesystem changes depend on reading i_size without holding
i_mutex is confusing at least. There is still a case in writepage
paths in buffer.c uses i_size (testing which block to write out), but
this is a small improvement.
---
fs/buffer.c | 20 ++------------------
1 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d8d1b46..67b260a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2666,18 +2666,8 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
/* Is the page fully outside i_size? (truncate in progress) */
offset = i_size & (PAGE_CACHE_SIZE-1);
if (page->index >= end_index+1 || !offset) {
- /*
- * The page may have dirty, unmapped buffers. For example,
- * they may have been added in ext3_writepage(). Make them
- * freeable here, so the page does not leak.
- */
-#if 0
- /* Not really sure about this - do we need this ? */
- if (page->mapping->a_ops->invalidatepage)
- page->mapping->a_ops->invalidatepage(page, offset);
-#endif
unlock_page(page);
- return 0; /* don't care */
+ return 0;
}

/*
@@ -2870,14 +2860,8 @@ int block_write_full_page_endio(struct page *page, get_block_t *get_block,
/* Is the page fully outside i_size? (truncate in progress) */
offset = i_size & (PAGE_CACHE_SIZE-1);
if (page->index >= end_index+1 || !offset) {
- /*
- * The page may have dirty, unmapped buffers. For example,
- * they may have been added in ext3_writepage(). Make them
- * freeable here, so the page does not leak.
- */
- do_invalidatepage(page, 0);
unlock_page(page);
- return 0; /* don't care */
+ return 0;
}

/*
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-17 15:21:47

by Jan Kara

[permalink] [raw]
Subject: [PATCH 7/7] ext2: Convert ext2 to new mkwrite code

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/file.c | 28 +++++++++++++++++++++++++++-
fs/ext2/inode.c | 54 ++++++++++++++++++++++++++++++++++++++++--------------
2 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 806def9..317cfa0 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -19,6 +19,8 @@
*/

#include <linux/time.h>
+#include <linux/mm.h>
+#include <linux/buffer_head.h>
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
@@ -38,6 +40,30 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
return 0;
}

+static struct vm_operations_struct ext2_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = noalloc_page_mkwrite,
+};
+
+static struct vm_operations_struct ext2_nobh_file_vm_ops = {
+ .fault = filemap_fault,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ if (!test_opt(mapping->host->i_sb, NOBH))
+ vma->vm_ops = &ext2_file_vm_ops;
+ else
+ vma->vm_ops = &ext2_nobh_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
/*
* We have mostly NULL's here: the current defaults are ok for
* the ext2 filesystem.
@@ -52,7 +78,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = generic_file_open,
.release = ext2_release_file,
.fsync = simple_fsync,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 8ff0f44..75881dd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -815,10 +815,35 @@ ext2_nobh_write_begin(struct file *file, struct address_space *mapping,
return ret;
}

+/*
+ * This is mostly used as a fallback function for __mpage_writepage(). We have
+ * to prepare buffers for block_write_full_page() so that all the buffers that
+ * should be written are either mapped or delay.
+ */
static int ext2_nobh_writepage(struct page *page,
- struct writeback_control *wbc)
+ struct writeback_control *wbc)
{
- return nobh_writepage(page, ext2_get_block, wbc);
+ struct inode *inode = page->mapping->host;
+ int blocksize = 1 << inode->i_blkbits;
+ loff_t i_size = i_size_read(inode), start;
+ struct buffer_head *head, *bh;
+
+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ }
+ head = bh = page_buffers(page);
+ start = page_offset(page);
+ do {
+ if (start >= i_size)
+ break;
+ if (!buffer_mapped(bh) && buffer_dirty(bh))
+ set_buffer_delay(bh);
+ start += blocksize;
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ return block_write_full_page(page, ext2_get_block, wbc);
}

static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
@@ -850,6 +875,7 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
}

const struct address_space_operations ext2_aops = {
+ .new_writepage = 1,
.readpage = ext2_readpage,
.readpages = ext2_readpages,
.writepage = ext2_writepage,
@@ -864,11 +890,13 @@ const struct address_space_operations ext2_aops = {
};

const struct address_space_operations ext2_aops_xip = {
+ .new_writepage = 1,
.bmap = ext2_bmap,
.get_xip_mem = ext2_get_xip_mem,
};

const struct address_space_operations ext2_nobh_aops = {
+ .new_writepage = 1,
.readpage = ext2_readpage,
.readpages = ext2_readpages,
.writepage = ext2_nobh_writepage,
@@ -1167,7 +1195,7 @@ static void ext2_truncate_blocks(struct inode *inode, loff_t offset)

int ext2_setsize(struct inode *inode, loff_t newsize)
{
- loff_t oldsize;
+ loff_t oldsize = inode->i_size;
int error;

error = inode_newsize_ok(inode, newsize);
@@ -1182,21 +1210,19 @@ int ext2_setsize(struct inode *inode, loff_t newsize)
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
return -EPERM;

- if (mapping_is_xip(inode->i_mapping))
- error = xip_truncate_page(inode->i_mapping, newsize);
- else if (test_opt(inode->i_sb, NOBH))
- error = nobh_truncate_page(inode->i_mapping,
- newsize, ext2_get_block);
- else
- error = block_truncate_page(inode->i_mapping,
- newsize, ext2_get_block);
- if (error)
- return error;
+ if (newsize > oldsize) {
+ error = block_prepare_hole(inode, oldsize, newsize, 0,
+ ext2_get_block);
+ if (error)
+ return error;
+ }

- oldsize = inode->i_size;
i_size_write(inode, newsize);
truncate_pagecache(inode, oldsize, newsize);

+ if (newsize > oldsize)
+ block_finish_hole(inode, oldsize, newsize);
+
__ext2_truncate_blocks(inode, newsize);

inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-17 15:21:43

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/7] ext4: Deprecate nobh mount option

This option doesn't do anything interesting for ext4 anymore since we attach
buffers to the page in page_mkwrite and in write_begin to support delayed
allocation and properly handle ENOSPC caused by mmaped writes.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 42 ++++++++++++------------------------------
fs/ext4/super.c | 10 +++-------
2 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f9c642b..58492ab 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2481,20 +2481,17 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
}

/*
- * This function is used as a standard get_block_t calback function
- * when there is no desire to allocate any blocks. It is used as a
- * callback function for block_prepare_write(), nobh_writepage(), and
- * block_write_full_page(). These functions should only try to map a
- * single block at a time.
+ * This function is used as a standard get_block_t calback function when there
+ * is no desire to allocate any blocks. It is used as a callback function for
+ * block_prepare_write() and block_write_full_page(). These functions should
+ * only try to map a single block at a time.
*
- * Since this function doesn't do block allocations even if the caller
- * requests it by passing in create=1, it is critically important that
- * any caller checks to make sure that any buffer heads are returned
- * by this function are either all already mapped or marked for
- * delayed allocation before calling nobh_writepage() or
- * block_write_full_page(). Otherwise, b_blocknr could be left
- * unitialized, and the page write functions will be taken by
- * surprise.
+ * Since this function doesn't do block allocations even if the caller requests
+ * it by passing in create=1, it is critically important that any caller checks
+ * to make sure that any buffer heads are returned by this function are either
+ * all already mapped or marked for delayed allocation before calling
+ * block_write_full_page(). Otherwise, b_blocknr could be left unitialized,
+ * and the page write functions will be taken by surprise.
*/
static int noalloc_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
@@ -2690,11 +2687,7 @@ static int ext4_writepage(struct page *page,
return __ext4_journalled_writepage(page, wbc, len);
}

- if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))
- ret = nobh_writepage(page, noalloc_get_block_write, wbc);
- else
- ret = block_write_full_page(page, noalloc_get_block_write,
- wbc);
+ ret = block_write_full_page(page, noalloc_get_block_write, wbc);

return ret;
}
@@ -3463,17 +3456,6 @@ int ext4_block_truncate_page(handle_t *handle,
length = blocksize - (offset & (blocksize - 1));
iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);

- /*
- * For "nobh" option, we can only work if we don't need to
- * read-in the page - otherwise we create buffers to do the IO.
- */
- if (!page_has_buffers(page) && test_opt(inode->i_sb, NOBH) &&
- ext4_should_writeback_data(inode) && PageUptodate(page)) {
- zero_user(page, offset, length);
- set_page_dirty(page);
- goto unlock;
- }
-
if (!page_has_buffers(page))
create_empty_buffers(page, blocksize, 0);

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8f4f079..160051f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -846,8 +846,6 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, test_opt(sb, BARRIER) ? "1" : "0");
if (test_opt(sb, JOURNAL_ASYNC_COMMIT))
seq_puts(seq, ",journal_async_commit");
- if (test_opt(sb, NOBH))
- seq_puts(seq, ",nobh");
if (test_opt(sb, I_VERSION))
seq_puts(seq, ",i_version");
if (!test_opt(sb, DELALLOC))
@@ -2775,11 +2773,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
no_journal:

if (test_opt(sb, NOBH)) {
- if (!(test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)) {
- ext4_msg(sb, KERN_WARNING, "Ignoring nobh option - "
- "its supported only with writeback mode");
- clear_opt(sbi->s_mount_opt, NOBH);
- }
+ ext4_msg(sb, KERN_WARNING, "nobh option is deprecated. "
+ "Ignoring it.");
+ clear_opt(sbi->s_mount_opt, NOBH);
}
/*
* The jbd2_journal_load will have done any necessary log recovery,
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-17 15:21:45

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/7] ext4: Convert filesystem to the new truncate calling convention

CC: [email protected]
CC: [email protected]
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 2 +-
fs/ext4/inode.c | 166 ++++++++++++++++++++++++++++++++----------------------
2 files changed, 99 insertions(+), 69 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..22f49d7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -198,7 +198,7 @@ const struct file_operations ext4_file_operations = {
};

const struct inode_operations ext4_file_inode_operations = {
- .truncate = ext4_truncate,
+ .new_truncate = 1,
.setattr = ext4_setattr,
.getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 58492ab..be25874 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3354,6 +3354,7 @@ static int ext4_journalled_set_page_dirty(struct page *page)
}

static const struct address_space_operations ext4_ordered_aops = {
+ .new_writepage = 1,
.readpage = ext4_readpage,
.readpages = ext4_readpages,
.writepage = ext4_writepage,
@@ -3369,6 +3370,7 @@ static const struct address_space_operations ext4_ordered_aops = {
};

static const struct address_space_operations ext4_writeback_aops = {
+ .new_writepage = 1,
.readpage = ext4_readpage,
.readpages = ext4_readpages,
.writepage = ext4_writepage,
@@ -3384,6 +3386,7 @@ static const struct address_space_operations ext4_writeback_aops = {
};

static const struct address_space_operations ext4_journalled_aops = {
+ .new_writepage = 1,
.readpage = ext4_readpage,
.readpages = ext4_readpages,
.writepage = ext4_writepage,
@@ -3398,6 +3401,7 @@ static const struct address_space_operations ext4_journalled_aops = {
};

static const struct address_space_operations ext4_da_aops = {
+ .new_writepage = 1,
.readpage = ext4_readpage,
.readpages = ext4_readpages,
.writepage = ext4_writepage,
@@ -4682,28 +4686,97 @@ int ext4_write_inode(struct inode *inode, int wait)
}

/*
- * ext4_setattr()
+ * ext4_setsize()
+ *
+ * This is a helper for ext4_setattr(). It sets i_size, truncates page cache
+ * and truncates inode blocks if they are over i_size.
*
- * Called from notify_change.
+ * We take care of updating i_disksize and adding inode to the orphan list.
+ * That makes sure that we can guarantee that any commit will leave the blocks
+ * being truncated in an unused state on disk. (On recovery, the inode will
+ * get truncated and the blocks will be freed, so we have a strong guarantee
+ * that no future commit will leave these blocks visible to the user.)
*
- * We want to trap VFS attempts to truncate the file as soon as
- * possible. In particular, we want to make sure that when the VFS
- * shrinks i_size, we put the inode on the orphan list and modify
- * i_disksize immediately, so that during the subsequent flushing of
- * dirty pages and freeing of disk blocks, we can guarantee that any
- * commit will leave the blocks being flushed in an unused state on
- * disk. (On recovery, the inode will get truncated and the blocks will
- * be freed, so we have a strong guarantee that no future commit will
- * leave these blocks visible to the user.)
+ * Another thing we have to assure is that if we are in ordered mode and inode
+ * is still attached to the committing transaction, we must we start writeout
+ * of all the dirty pages which are being truncated. This way we are sure that
+ * all the data written in the previous transaction are already on disk
+ * (truncate waits for pages under writeback).
+ */
+static int ext4_setsize(struct inode *inode, loff_t newsize)
+{
+ int error = 0, rc;
+ loff_t oldsize = inode->i_size;
+ handle_t *handle;
+
+ error = inode_newsize_ok(inode, newsize);
+ if (error)
+ goto out;
+ /* VFS should have checked these and return error... */
+ WARN_ON(!S_ISREG(inode->i_mode) || IS_APPEND(inode) ||
+ IS_IMMUTABLE(inode));
+
+ if (newsize < oldsize) {
+ handle = ext4_journal_start(inode, 3);
+ if (IS_ERR(handle)) {
+ error = PTR_ERR(handle);
+ goto err_out;
+ }
+
+ error = ext4_orphan_add(handle, inode);
+ EXT4_I(inode)->i_disksize = newsize;
+ rc = ext4_mark_inode_dirty(handle, inode);
+ if (!error)
+ error = rc;
+ ext4_journal_stop(handle);
+
+ if (ext4_should_order_data(inode)) {
+ error = ext4_begin_ordered_truncate(inode, newsize);
+ if (error) {
+ /* Do as much error cleanup as possible */
+ handle = ext4_journal_start(inode, 3);
+ if (IS_ERR(handle)) {
+ ext4_orphan_del(NULL, inode);
+ goto err_out;
+ }
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ goto err_out;
+ }
+ }
+ } else if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) {
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ if (newsize > sbi->s_bitmap_maxbytes) {
+ error = -EFBIG;
+ goto out;
+ }
+ }
+
+ i_size_write(inode, newsize);
+ truncate_pagecache(inode, oldsize, newsize);
+ ext4_truncate(inode);
+
+ /*
+ * If we failed to get a transaction handle at all, we need to clean up
+ * the in-core orphan list manually.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+err_out:
+ ext4_std_error(inode->i_sb, error);
+out:
+ return error;
+}
+
+
+/*
+ * ext4_setattr()
*
- * Another thing we have to assure is that if we are in ordered mode
- * and inode is still attached to the committing transaction, we must
- * we start writeout of all the dirty pages which are being truncated.
- * This way we are sure that all the data written in the previous
- * transaction are already on disk (truncate waits for pages under
- * writeback).
+ * Handle special things ext4 needs for changing owner of the file, changing
+ * ACLs, or truncating file.
*
- * Called with inode->i_mutex down.
+ * Called from notify_change with inode->i_mutex down.
*/
int ext4_setattr(struct dentry *dentry, struct iattr *attr)
{
@@ -4743,61 +4816,18 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
}

if (attr->ia_valid & ATTR_SIZE) {
- if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) {
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
- if (attr->ia_size > sbi->s_bitmap_maxbytes) {
- error = -EFBIG;
- goto err_out;
- }
- }
- }
-
- if (S_ISREG(inode->i_mode) &&
- attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
- handle_t *handle;
-
- handle = ext4_journal_start(inode, 3);
- if (IS_ERR(handle)) {
- error = PTR_ERR(handle);
- goto err_out;
- }
-
- error = ext4_orphan_add(handle, inode);
- EXT4_I(inode)->i_disksize = attr->ia_size;
- rc = ext4_mark_inode_dirty(handle, inode);
- if (!error)
- error = rc;
- ext4_journal_stop(handle);
-
- if (ext4_should_order_data(inode)) {
- error = ext4_begin_ordered_truncate(inode,
- attr->ia_size);
- if (error) {
- /* Do as much error cleanup as possible */
- handle = ext4_journal_start(inode, 3);
- if (IS_ERR(handle)) {
- ext4_orphan_del(NULL, inode);
- goto err_out;
- }
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- goto err_out;
- }
- }
+ error = ext4_setsize(inode, attr->ia_size);
+ if (error)
+ return error;
}

- rc = inode_setattr(inode, attr);
-
- /* If inode_setattr's call to ext4_truncate failed to get a
- * transaction handle at all, we need to clean up the in-core
- * orphan list manually. */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
+ generic_setattr(inode, attr);

- if (!rc && (ia_valid & ATTR_MODE))
+ if (ia_valid & ATTR_MODE)
rc = ext4_acl_chmod(inode);

+ /* Mark inode dirty due to changes done by generic_setattr() */
+ mark_inode_dirty(inode);
err_out:
ext4_std_error(inode->i_sb, error);
if (!error)
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-17 15:21:44

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/7] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

page_mkwrite() is meant to be used by filesystems to allocate blocks under a
page which is becoming writeably mmapped in some process address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than silently
discarding data later when writepage is called.

On filesystems where blocksize < pagesize the situation is more complicated.
Think for example that blocksize = 1024, pagesize = 4096 and a process does:
ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a'; ----> page_mkwrite() for index 0 is called
ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
fsync(fd); ----> writepage() for index 0 is called

At the moment page_mkwrite() is called, filesystem can allocate only one block
for the page because i_size == 1024. Otherwise it would create blocks beyond
i_size which is generally undesirable. But later at writepage() time, we would
like to have blocks allocated for the whole page (and in principle we have to
allocate them because user could have filled the page with data after the
second ftruncate()).

This patch introduces a framework which allows filesystems to handle this with
a reasonable effort. We change block_write_full_page() to check all the
buffers in the page (not only those inside i_size) but write only those which
are delayed or mapped. Since block_create_hole() writeprotects the page, the
filesystem is guaranteed that no block is submitted for IO unless it went
through get_block either from ->write_begin() or ->page_mkwrite().

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 375 ++++++++++++++++++++++++++++++++++++++++++-
fs/libfs.c | 38 +++++
fs/mpage.c | 3 +-
include/linux/buffer_head.h | 19 ++-
include/linux/fs.h | 2 +
5 files changed, 430 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0eaa961..abe105a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
#include <linux/cpu.h>
#include <linux/bitops.h>
#include <linux/mpage.h>
+#include <linux/rmap.h>
#include <linux/bit_spinlock.h>

static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -1688,7 +1689,7 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
if (PageUptodate(page)) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
- continue;
+ continue;
}
if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
!buffer_unwritten(bh) &&
@@ -1771,7 +1772,15 @@ int block_write_begin(struct file *file, struct address_space *mapping,

page = *pagep;
if (page == NULL) {
+ struct inode *inode = mapping->host;
+
ownpage = 1;
+ if (unlikely(pos > inode->i_size)) {
+ status = block_prepare_hole(inode, inode->i_size, pos,
+ flags, get_block);
+ if (status)
+ goto out;
+ }
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page) {
status = -ENOMEM;
@@ -1852,6 +1861,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
{
struct inode *inode = mapping->host;
int i_size_changed = 0;
+ loff_t oldsize = inode->i_size;

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);

@@ -1879,6 +1889,9 @@ int generic_write_end(struct file *file, struct address_space *mapping,
if (i_size_changed)
mark_inode_dirty(inode);

+ if (oldsize < pos)
+ block_finish_hole(inode, oldsize, pos);
+
return copied;
}
EXPORT_SYMBOL(generic_write_end);
@@ -2168,6 +2181,285 @@ int block_commit_write(struct page *page, unsigned from, unsigned to)
return 0;
}

+/**
+ * block_prepare_hole_bh - prepare creation of a hole, return partial bh
+ * @inode: inode where the hole is created
+ * @from: offset in bytes where the hole starts
+ * @to: offset in bytes where the hole ends.
+ * @get_block: filesystem's function for mapping blocks
+ *
+ * Prepare creation of a hole in a file caused either by extending truncate or
+ * by write starting after current i_size. The function finds a buffer head
+ * straddling @from, maps it, loads it from disk if needed and returns a
+ * reference to it. The function also zeroes the page after the returned
+ * buffer head upto @to. When @from is on a block boundary or block straddling
+ * @from is a hole, the function just zeroes the page from @from to @to
+ * and returns NULL.
+ *
+ * This function is usually called from write_begin function of filesystems
+ * which need to do some more complicated stuff before / after writing data
+ * to a block.
+ *
+ * This function must be called with i_mutex locked.
+ */
+struct buffer_head *block_prepare_hole_bh(struct inode *inode, loff_t from,
+ loff_t to, unsigned flags,
+ get_block_t *get_block)
+{
+ int bsize = 1 << inode->i_blkbits;
+ struct page *page;
+ pgoff_t index;
+ int start, len, page_off;
+ int err = 0;
+ struct buffer_head *head, *bh;
+
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+
+ if (from >= to)
+ goto out;
+ index = from >> PAGE_CACHE_SHIFT;
+ page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
+ if (!page) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ start = from & (PAGE_CACHE_SIZE - 1);
+ /* Block boundary? No need to flush anything to disk */
+ if (!(start & (bsize - 1)))
+ goto zero_page;
+
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, bsize, 0);
+
+ /* Find the buffer that contains "offset" */
+ head = bh = page_buffers(page);
+ page_off = bsize;
+ while (start >= page_off) {
+ bh = bh->b_this_page;
+ page_off += bsize;
+ }
+
+ if (!buffer_mapped(bh) && !buffer_delay(bh)) {
+ WARN_ON(bh->b_size != bsize);
+ err = get_block(inode, from >> inode->i_blkbits, bh, 0);
+ if (err)
+ goto unlock;
+ /* Unmapped? It's a hole - nothing to write */
+ if (!buffer_mapped(bh) && !buffer_delay(bh))
+ goto zero_page;
+ }
+
+ if (!buffer_uptodate(bh) && !buffer_unwritten(bh)) {
+ ll_rw_block(READ, 1, &bh);
+ wait_on_buffer(bh);
+ /* Uhhuh. Read error. Complain and punt. */
+ if (!buffer_uptodate(bh)) {
+ err = -EIO;
+ goto unlock;
+ }
+ }
+ /* Zero the page after returned buffer */
+ len = min_t(int, PAGE_CACHE_SIZE - page_off, to - from);
+ zero_user(page, page_off, len);
+ get_bh(bh);
+ unlock_page(page);
+ page_cache_release(page);
+ return bh;
+
+zero_page:
+ len = min_t(int, PAGE_CACHE_SIZE - start, to - from);
+ zero_user(page, start, len);
+unlock:
+ unlock_page(page);
+ page_cache_release(page);
+out:
+ if (err)
+ return ERR_PTR(err);
+ return NULL;
+}
+EXPORT_SYMBOL(block_prepare_hole_bh);
+
+/**
+ * block_prepare_hole - prepare creation of a hole
+ * @inode: inode where the hole is created
+ * @from: offset in bytes where the hole starts
+ * @to: offset in bytes where the hole ends.
+ * @get_block: filesystem's function for mapping blocks
+ *
+ * Prepare creation of a hole in a file caused either by extending truncate or
+ * by write starting after current i_size. We zero-out tail of the page
+ * straddling @from and also mark buffer straddling @from as dirty.
+ *
+ * This function is usually called from write_begin function.
+ *
+ * This function must be called with i_mutex locked.
+ */
+int block_prepare_hole(struct inode *inode, loff_t from, loff_t to,
+ unsigned flags, get_block_t *get_block)
+{
+ struct buffer_head *bh;
+ int bsize = 1 << inode->i_blkbits;
+ int start, len;
+
+ bh = block_prepare_hole_bh(inode, from, to, flags, get_block);
+ if (!bh)
+ return 0;
+ if (IS_ERR(bh))
+ return PTR_ERR(bh);
+
+ /* Zero the buffer we got */
+ start = from & (PAGE_CACHE_SIZE - 1);
+ len = min_t(int, bsize - (start & (bsize - 1)), to - from);
+ zero_user(bh->b_page, start, len);
+ mark_buffer_dirty(bh);
+ brelse(bh);
+
+ return 0;
+}
+EXPORT_SYMBOL(block_prepare_hole);
+
+/**
+ * nobh_prepare_hole - prepare creation of a hole without buffer heads
+ * @inode: inode where the hole is created
+ * @from: offset in bytes where the hole starts
+ * @to: offset in bytes where the hole ends.
+ * @get_block: filesystem's function for mapping blocks
+ *
+ * Prepare creation of a hole in a file caused either by extending truncate or
+ * by write starting after current i_size. We zero-out tail of the page
+ * straddling @from and also mark buffer straddling @from as dirty.
+ *
+ * This function is called from nobh_write_begin function.
+ *
+ * This function must be called with i_mutex locked.
+ */
+int nobh_prepare_hole(struct inode *inode, loff_t from, loff_t to,
+ unsigned flags, get_block_t *get_block)
+{
+ int bsize = 1 << inode->i_blkbits;
+ struct page *page;
+ pgoff_t index;
+ int start, len;
+ int err = 0;
+ struct buffer_head map_bh;
+
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+
+ if (from >= to)
+ goto out;
+ index = from >> PAGE_CACHE_SHIFT;
+ page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
+ if (!page) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ start = from & (PAGE_CACHE_SIZE - 1);
+ len = min_t(int, PAGE_CACHE_SIZE - start, to - from);
+ /* Block boundary? No need to flush anything to disk */
+ if (!(start & (bsize - 1)))
+ goto zero_page;
+
+ if (page_has_buffers(page)) {
+has_buffers:
+ unlock_page(page);
+ page_cache_release(page);
+ return block_prepare_hole(inode, from, to, flags, get_block);
+ }
+
+ map_bh.b_size = bsize;
+ map_bh.b_state = 0;
+ err = get_block(inode, from >> inode->i_blkbits, &map_bh, 0);
+ if (err)
+ goto unlock;
+ /* Unmapped? It's a hole - nothing to write */
+ if (!buffer_mapped(&map_bh))
+ goto zero_page;
+
+ /* Ok, it's mapped. Make sure it's up-to-date */
+ if (!PageUptodate(page)) {
+ err = inode->i_mapping->a_ops->readpage(NULL, page);
+ if (err) {
+ page_cache_release(page);
+ goto out;
+ }
+ lock_page(page);
+ if (!PageUptodate(page)) {
+ err = -EIO;
+ goto unlock;
+ }
+ if (page_has_buffers(page))
+ goto has_buffers;
+ }
+ /*
+ * We have to send the boundary buffer to disk so mark the page dirty.
+ * We do it before actually zeroing the page but that does not matter
+ * since we hold the page lock.
+ */
+ set_page_dirty(page);
+zero_page:
+ zero_user(page, start, len);
+unlock:
+ unlock_page(page);
+ page_cache_release(page);
+out:
+ return err;
+}
+EXPORT_SYMBOL(nobh_prepare_hole);
+
+#ifdef CONFIG_MMU
+/**
+ * block_finish_hole - finish creation of a hole
+ * @inode: inode where the hole is created
+ * @from: offset in bytes where the hole starts
+ * @to: offset in bytes where the hole ends.
+ *
+ * Finish creation of a hole in a file either caused by extending truncate or
+ * by write starting after current i_size. We mark the page straddling @from RO
+ * so that page_mkwrite() is called on the nearest write access to the page.
+ * This way filesystem can be sure that page_mkwrite() is called on the page
+ * before user writes to the page via mmap after the i_size has been changed
+ * (we hold i_mutex here and while we hold it user has no chance finding i_size
+ * is being changed).
+ *
+ * This function must be called after i_size is updated so that page_mkwrite()
+ * happenning immediately after we unlock the page initializes it correctly.
+ */
+void block_finish_hole(struct inode *inode, loff_t from, loff_t to)
+{
+ int bsize = 1 << inode->i_blkbits;
+ loff_t rounded_from;
+ struct page *page;
+ pgoff_t index;
+
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+ WARN_ON(to > inode->i_size);
+
+ if (from >= to || bsize == PAGE_CACHE_SIZE)
+ return;
+ /* Currently last page will not have any hole block created? */
+ rounded_from = ALIGN(from, bsize);
+ if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+ return;
+
+ index = from >> PAGE_CACHE_SHIFT;
+ page = find_lock_page(inode->i_mapping, index);
+ /* Page not cached? Nothing to do */
+ if (!page)
+ return;
+ /*
+ * See clear_page_dirty_for_io() for details why set_page_dirty()
+ * is needed.
+ */
+ if (page_mkclean(page))
+ set_page_dirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+}
+EXPORT_SYMBOL(block_finish_hole);
+#endif
+
/*
* block_page_mkwrite() is not allowed to change the file size as it gets
* called from a page fault handler when a page is first dirtied. Hence we must
@@ -2225,6 +2517,56 @@ out:
return ret;
}

+/**
+ * noalloc_page_mkwrite - fault in the page without allocating any blocks
+ * @vma: vma where the fault happened
+ * @vmf: information about the fault
+ *
+ * This is a page_mkwrite function for filesystems that want to retain the old
+ * behavior and not allocate any blocks on page fault. It just marks all
+ * unmapped buffers in the page as delayed so that block_write_full_page()
+ * writes them.
+ */
+int noalloc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+ int bsize;
+ loff_t size, off;
+ int ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
+ struct buffer_head *head, *bh;
+
+ lock_page(page);
+ size = i_size_read(inode);
+ bsize = 1 << inode->i_blkbits;
+ if ((page->mapping != inode->i_mapping) ||
+ (page_offset(page) > size)) {
+ /* page got truncated out from underneath us */
+ unlock_page(page);
+ goto out;
+ }
+
+ ret = VM_FAULT_LOCKED;
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, bsize, 0);
+ head = bh = page_buffers(page);
+ off = ((loff_t)page->index) << PAGE_CACHE_SHIFT;
+ do {
+ /*
+ * Using BH_Delay is a slight hack but in fact it makes
+ * sence since the filesystem really delays the allocation
+ * to the moment writeout happens
+ */
+ if (!buffer_mapped(bh) && off < size)
+ set_buffer_delay(bh);
+ bh = bh->b_this_page;
+ off += bsize;
+ } while (bh != head);
+out:
+ return ret;
+}
+EXPORT_SYMBOL(noalloc_page_mkwrite);
+
/*
* nobh_write_begin()'s prereads are special: the buffer_heads are freed
* immediately, while under the page lock. So it needs a special end_io
@@ -2286,6 +2628,13 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;

+ if (unlikely(pos > inode->i_size)) {
+ ret = nobh_prepare_hole(inode, inode->i_size, pos,
+ flags, get_block);
+ if (ret)
+ return ret;
+ }
+
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
@@ -2416,6 +2765,8 @@ int nobh_write_end(struct file *file, struct address_space *mapping,
struct inode *inode = page->mapping->host;
struct buffer_head *head = fsdata;
struct buffer_head *bh;
+ loff_t oldsize = inode->i_size;
+
BUG_ON(fsdata != NULL && page_has_buffers(page));

if (unlikely(copied < len) && head)
@@ -2434,6 +2785,9 @@ int nobh_write_end(struct file *file, struct address_space *mapping,
unlock_page(page);
page_cache_release(page);

+ if (oldsize < pos)
+ block_finish_hole(inode, oldsize, pos);
+
while (head) {
bh = head;
head = head->b_this_page;
@@ -2477,8 +2831,8 @@ int nobh_truncate_page(struct address_space *mapping,
blocksize = 1 << inode->i_blkbits;
length = offset & (blocksize - 1);

- /* Block boundary? Nothing to do */
- if (!length)
+ /* Page boundary? Nothing to do */
+ if (!offset)
return 0;

length = blocksize - length;
@@ -2503,6 +2857,10 @@ has_buffers:
pos += blocksize;
}

+ /* Page is no longer fully mapped? */
+ if (pos < PAGE_CACHE_SIZE)
+ ClearPageMappedToDisk(page);
+
map_bh.b_size = blocksize;
map_bh.b_state = 0;
err = get_block(inode, iblock, &map_bh, 0);
@@ -2667,11 +3025,12 @@ int block_write_full_page_endio(struct page *page, get_block_t *get_block,
int nr_underway = 0;
int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
WRITE_SYNC_PLUG : WRITE);
+ int new_writepage = page->mapping->a_ops->new_writepage;

BUG_ON(!PageLocked(page));

/* Is the page fully inside i_size? */
- if (page->index < end_index)
+ if (page->index < end_index || new_writepage)
goto write_page;

/* Is the page fully outside i_size? (truncate in progress) */
@@ -2716,6 +3075,12 @@ write_page:
* handle any aliases from the underlying blockdev's mapping.
*/
do {
+ if (new_writepage) {
+ if (buffer_dirty(bh) && buffer_delay(bh))
+ goto map_it;
+ else
+ goto next;
+ }
if (block > last_block) {
/*
* mapped buffers outside i_size will occur, because
@@ -2729,6 +3094,7 @@ write_page:
set_buffer_uptodate(bh);
} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
buffer_dirty(bh)) {
+map_it:
WARN_ON(bh->b_size != blocksize);
err = get_block(inode, block, bh, 1);
if (err)
@@ -2741,6 +3107,7 @@ write_page:
bh->b_blocknr);
}
}
+next:
bh = bh->b_this_page;
block++;
} while (bh != head);
diff --git a/fs/libfs.c b/fs/libfs.c
index 28c45b0..e3741df 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -331,6 +331,38 @@ int simple_rename(struct inode *old_dir, struct dentry *old_dentry,
}

/**
+ * simple_create_hole - handle creation of a hole on memory-backed filesystem
+ * @inode: inode where the hole is created
+ * @from: offset in bytes where the hole starts
+ * @to: offset in bytes where the hole ends.
+ *
+ * This function does necessary zeroing when a hole is created on a memory
+ * backed filesystem (non-zeros can happen to be beyond EOF because the
+ * page was written via mmap).
+ */
+void simple_create_hole(struct inode *inode, loff_t from, loff_t to)
+{
+ struct page *page;
+ pgoff_t index;
+ int start, len;
+
+ if (from >= to)
+ return;
+ index = from >> PAGE_CACHE_SHIFT;
+ page = find_lock_page(inode->i_mapping, index);
+ /* Page not cached? Nothing to do */
+ if (!page)
+ return;
+ start = from & (PAGE_CACHE_SIZE - 1);
+ len = min_t(int, PAGE_CACHE_SIZE - start, to - from);
+ zero_user(page, start, len);
+ unlock_page(page);
+ page_cache_release(page);
+ return;
+}
+EXPORT_SYMBOL(simple_create_hole);
+
+/**
* simple_setsize - handle core mm and vfs requirements for file size change
* @inode: inode
* @newsize: new file size
@@ -368,6 +400,8 @@ int simple_setsize(struct inode *inode, loff_t newsize)
oldsize = inode->i_size;
i_size_write(inode, newsize);
truncate_pagecache(inode, oldsize, newsize);
+ if (newsize > oldsize)
+ simple_create_hole(inode, oldsize, newsize);

return error;
}
@@ -440,6 +474,10 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
struct page *page;
pgoff_t index;
unsigned from;
+ struct inode *inode = mapping->host;
+
+ if (pos > inode->i_size)
+ simple_create_hole(inode, inode->i_size, pos);

index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
diff --git a/fs/mpage.c b/fs/mpage.c
index 42381bd..cb0ebee 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -470,6 +470,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
int ret = 0;
+ int new_writepage = mapping->a_ops->new_writepage;

if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -558,7 +559,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,

page_is_mapped:
end_index = i_size >> PAGE_CACHE_SHIFT;
- if (page->index >= end_index) {
+ if (page->index >= end_index && !new_writepage) {
/*
* The page straddles i_size. It must be zeroed out on each
* and every writepage invokation because it may be mmapped.
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 16ed028..eb79342 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -151,8 +151,7 @@ void set_bh_page(struct buffer_head *bh,
int try_to_free_buffers(struct page *);
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
int retry);
-void create_empty_buffers(struct page *, unsigned long,
- unsigned long b_state);
+void create_empty_buffers(struct page *, unsigned long, unsigned long b_state);
void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
void end_buffer_async_write(struct buffer_head *bh, int uptodate);
@@ -221,6 +220,20 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size);
int block_commit_write(struct page *page, unsigned from, unsigned to);
int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block);
+int noalloc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+int block_prepare_hole(struct inode *inode, loff_t from, loff_t to,
+ unsigned flags, get_block_t *get_block);
+struct buffer_head *block_prepare_hole_bh(struct inode *inode, loff_t from,
+ loff_t to, unsigned flags,
+ get_block_t *get_block);
+#ifdef CONFIG_MMU
+void block_finish_hole(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void block_finish_hole(struct inode *inode, loff_t from, loff_t to)
+{
+}
+#endif
+
void block_sync_page(struct page *);
sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
int block_truncate_page(struct address_space *, loff_t, get_block_t *);
@@ -232,6 +245,8 @@ int nobh_write_end(struct file *, struct address_space *,
loff_t, unsigned, unsigned,
struct page *, void *);
int nobh_truncate_page(struct address_space *, loff_t, get_block_t *);
+int nobh_prepare_hole(struct inode *inode, loff_t from, loff_t to,
+ unsigned flags, get_block_t *get_block);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c9d5f89..e1cc0c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -595,6 +595,8 @@ struct address_space_operations {
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
+ int new_writepage; /* A hack until filesystems are converted to
+ * use new block_write_full_page */
};

/*
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-17 15:21:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/7] fs: Remove zeroing from nobh_writepage

It isn't necessary to zero page in nobh_writepage() since mpage_writepage()
takes care of that. We just have to use block_write_full_page() instead
of __block_write_full_page() to properly zero the page in case we have
to fallback to ordinary writepage.

Since __block_write_full_page is then only used from
block_write_full_page_endio() join them.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 426 +++++++++++++++++++++++++++--------------------------------
1 files changed, 196 insertions(+), 230 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 67b260a..0eaa961 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1589,207 +1589,6 @@ void unmap_underlying_metadata(struct block_device *bdev, sector_t block)
EXPORT_SYMBOL(unmap_underlying_metadata);

/*
- * NOTE! All mapped/uptodate combinations are valid:
- *
- * Mapped Uptodate Meaning
- *
- * No No "unknown" - must do get_block()
- * No Yes "hole" - zero-filled
- * Yes No "allocated" - allocated on disk, not read in
- * Yes Yes "valid" - allocated and up-to-date in memory.
- *
- * "Dirty" is valid only with the last case (mapped+uptodate).
- */
-
-/*
- * While block_write_full_page is writing back the dirty buffers under
- * the page lock, whoever dirtied the buffers may decide to clean them
- * again at any time. We handle that by only looking at the buffer
- * state inside lock_buffer().
- *
- * If block_write_full_page() is called for regular writeback
- * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
- * locked buffer. This only can happen if someone has written the buffer
- * directly, with submit_bh(). At the address_space level PageWriteback
- * prevents this contention from occurring.
- *
- * If block_write_full_page() is called with wbc->sync_mode ==
- * WB_SYNC_ALL, the writes are posted using WRITE_SYNC_PLUG; this
- * causes the writes to be flagged as synchronous writes, but the
- * block device queue will NOT be unplugged, since usually many pages
- * will be pushed to the out before the higher-level caller actually
- * waits for the writes to be completed. The various wait functions,
- * such as wait_on_writeback_range() will ultimately call sync_page()
- * which will ultimately call blk_run_backing_dev(), which will end up
- * unplugging the device queue.
- */
-static int __block_write_full_page(struct inode *inode, struct page *page,
- get_block_t *get_block, struct writeback_control *wbc,
- bh_end_io_t *handler)
-{
- int err;
- sector_t block;
- sector_t last_block;
- struct buffer_head *bh, *head;
- const unsigned blocksize = 1 << inode->i_blkbits;
- int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
- WRITE_SYNC_PLUG : WRITE);
-
- BUG_ON(!PageLocked(page));
-
- last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
-
- if (!page_has_buffers(page)) {
- create_empty_buffers(page, blocksize,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
- }
-
- /*
- * Be very careful. We have no exclusion from __set_page_dirty_buffers
- * here, and the (potentially unmapped) buffers may become dirty at
- * any time. If a buffer becomes dirty here after we've inspected it
- * then we just miss that fact, and the page stays dirty.
- *
- * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
- * handle that here by just cleaning them.
- */
-
- block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
- head = page_buffers(page);
- bh = head;
-
- /*
- * Get all the dirty buffers mapped to disk addresses and
- * handle any aliases from the underlying blockdev's mapping.
- */
- do {
- if (block > last_block) {
- /*
- * mapped buffers outside i_size will occur, because
- * this page can be outside i_size when there is a
- * truncate in progress.
- */
- /*
- * The buffer was zeroed by block_write_full_page()
- */
- clear_buffer_dirty(bh);
- set_buffer_uptodate(bh);
- } else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
- buffer_dirty(bh)) {
- WARN_ON(bh->b_size != blocksize);
- err = get_block(inode, block, bh, 1);
- if (err)
- goto recover;
- clear_buffer_delay(bh);
- if (buffer_new(bh)) {
- /* blockdev mappings never come here */
- clear_buffer_new(bh);
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
- }
- }
- bh = bh->b_this_page;
- block++;
- } while (bh != head);
-
- do {
- if (!buffer_mapped(bh))
- continue;
- /*
- * If it's a fully non-blocking write attempt and we cannot
- * lock the buffer then redirty the page. Note that this can
- * potentially cause a busy-wait loop from pdflush and kswapd
- * activity, but those code paths have their own higher-level
- * throttling.
- */
- if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
- lock_buffer(bh);
- } else if (!trylock_buffer(bh)) {
- redirty_page_for_writepage(wbc, page);
- continue;
- }
- if (test_clear_buffer_dirty(bh)) {
- mark_buffer_async_write_endio(bh, handler);
- } else {
- unlock_buffer(bh);
- }
- } while ((bh = bh->b_this_page) != head);
-
- /*
- * The page and its buffers are protected by PageWriteback(), so we can
- * drop the bh refcounts early.
- */
- BUG_ON(PageWriteback(page));
- set_page_writeback(page);
-
- do {
- struct buffer_head *next = bh->b_this_page;
- if (buffer_async_write(bh)) {
- submit_bh(write_op, bh);
- nr_underway++;
- }
- bh = next;
- } while (bh != head);
- unlock_page(page);
-
- err = 0;
-done:
- if (nr_underway == 0) {
- /*
- * The page was marked dirty, but the buffers were
- * clean. Someone wrote them back by hand with
- * ll_rw_block/submit_bh. A rare case.
- */
- end_page_writeback(page);
-
- /*
- * The page and buffer_heads can be released at any time from
- * here on.
- */
- }
- return err;
-
-recover:
- /*
- * ENOSPC, or some other error. We may already have added some
- * blocks to the file, so we need to write these out to avoid
- * exposing stale data.
- * The page is currently locked and not marked for writeback
- */
- bh = head;
- /* Recovery: lock and submit the mapped buffers */
- do {
- if (buffer_mapped(bh) && buffer_dirty(bh) &&
- !buffer_delay(bh)) {
- lock_buffer(bh);
- mark_buffer_async_write_endio(bh, handler);
- } else {
- /*
- * The buffer may have been set dirty during
- * attachment to a dirty page.
- */
- clear_buffer_dirty(bh);
- }
- } while ((bh = bh->b_this_page) != head);
- SetPageError(page);
- BUG_ON(PageWriteback(page));
- mapping_set_error(page->mapping, err);
- set_page_writeback(page);
- do {
- struct buffer_head *next = bh->b_this_page;
- if (buffer_async_write(bh)) {
- clear_buffer_dirty(bh);
- submit_bh(write_op, bh);
- nr_underway++;
- }
- bh = next;
- } while (bh != head);
- unlock_page(page);
- goto done;
-}
-
-/*
* If a page has any new buffers, zero them out here, and mark them uptodate
* and dirty so they'll be written out (in order to prevent uninitialised
* block data from leaking). And clear the new bit.
@@ -2653,36 +2452,11 @@ EXPORT_SYMBOL(nobh_write_end);
int nobh_writepage(struct page *page, get_block_t *get_block,
struct writeback_control *wbc)
{
- struct inode * const inode = page->mapping->host;
- loff_t i_size = i_size_read(inode);
- const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
- unsigned offset;
int ret;

- /* Is the page fully inside i_size? */
- if (page->index < end_index)
- goto out;
-
- /* Is the page fully outside i_size? (truncate in progress) */
- offset = i_size & (PAGE_CACHE_SIZE-1);
- if (page->index >= end_index+1 || !offset) {
- unlock_page(page);
- return 0;
- }
-
- /*
- * The page straddles i_size. It must be zeroed out on each and every
- * writepage invocation because it may be mmapped. "A file is mapped
- * in multiples of the page size. For a file that is not a multiple of
- * the page size, the remaining memory is zeroed when mapped, and
- * writes to that region are not written out to the file."
- */
- zero_user_segment(page, offset, PAGE_CACHE_SIZE);
-out:
ret = mpage_writepage(page, get_block, wbc);
if (ret == -EAGAIN)
- ret = __block_write_full_page(inode, page, get_block, wbc,
- end_buffer_async_write);
+ ret = block_write_full_page(page, get_block, wbc);
return ret;
}
EXPORT_SYMBOL(nobh_writepage);
@@ -2841,21 +2615,64 @@ out:
}

/*
+ * NOTE! All mapped/uptodate combinations are valid:
+ *
+ * Mapped Uptodate Meaning
+ *
+ * No No "unknown" - must do get_block()
+ * No Yes "hole" - zero-filled
+ * Yes No "allocated" - allocated on disk, not read in
+ * Yes Yes "valid" - allocated and up-to-date in memory.
+ *
+ * "Dirty" is valid only with the last case (mapped+uptodate).
+ */
+
+/*
* The generic ->writepage function for buffer-backed address_spaces
* this form passes in the end_io handler used to finish the IO.
+ *
+ * While block_write_full_page is writing back the dirty buffers under
+ * the page lock, whoever dirtied the buffers may decide to clean them
+ * again at any time. We handle that by only looking at the buffer
+ * state inside lock_buffer().
+ *
+ * If block_write_full_page() is called for regular writeback
+ * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
+ * locked buffer. This only can happen if someone has written the buffer
+ * directly, with submit_bh(). At the address_space level PageWriteback
+ * prevents this contention from occurring.
+ *
+ * If block_write_full_page() is called with wbc->sync_mode ==
+ * WB_SYNC_ALL, the writes are posted using WRITE_SYNC_PLUG; this
+ * causes the writes to be flagged as synchronous writes, but the
+ * block device queue will NOT be unplugged, since usually many pages
+ * will be pushed to the out before the higher-level caller actually
+ * waits for the writes to be completed. The various wait functions,
+ * such as wait_on_writeback_range() will ultimately call sync_page()
+ * which will ultimately call blk_run_backing_dev(), which will end up
+ * unplugging the device queue.
*/
int block_write_full_page_endio(struct page *page, get_block_t *get_block,
struct writeback_control *wbc, bh_end_io_t *handler)
{
+ int err;
struct inode * const inode = page->mapping->host;
loff_t i_size = i_size_read(inode);
const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
unsigned offset;
+ sector_t block;
+ sector_t last_block;
+ struct buffer_head *bh, *head;
+ const unsigned blocksize = 1 << inode->i_blkbits;
+ int nr_underway = 0;
+ int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
+ WRITE_SYNC_PLUG : WRITE);
+
+ BUG_ON(!PageLocked(page));

/* Is the page fully inside i_size? */
if (page->index < end_index)
- return __block_write_full_page(inode, page, get_block, wbc,
- handler);
+ goto write_page;

/* Is the page fully outside i_size? (truncate in progress) */
offset = i_size & (PAGE_CACHE_SIZE-1);
@@ -2872,7 +2689,156 @@ int block_write_full_page_endio(struct page *page, get_block_t *get_block,
* writes to that region are not written out to the file."
*/
zero_user_segment(page, offset, PAGE_CACHE_SIZE);
- return __block_write_full_page(inode, page, get_block, wbc, handler);
+write_page:
+ last_block = i_size >> inode->i_blkbits;
+
+ if (!page_has_buffers(page)) {
+ create_empty_buffers(page, blocksize,
+ (1 << BH_Dirty)|(1 << BH_Uptodate));
+ }
+
+ /*
+ * Be very careful. We have no exclusion from __set_page_dirty_buffers
+ * here, and the (potentially unmapped) buffers may become dirty at
+ * any time. If a buffer becomes dirty here after we've inspected it
+ * then we just miss that fact, and the page stays dirty.
+ *
+ * Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
+ * handle that here by just cleaning them.
+ */
+
+ block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ head = page_buffers(page);
+ bh = head;
+
+ /*
+ * Get all the dirty buffers mapped to disk addresses and
+ * handle any aliases from the underlying blockdev's mapping.
+ */
+ do {
+ if (block > last_block) {
+ /*
+ * mapped buffers outside i_size will occur, because
+ * this page can be outside i_size when there is a
+ * truncate in progress.
+ */
+ /*
+ * The buffer was zeroed by block_write_full_page()
+ */
+ clear_buffer_dirty(bh);
+ set_buffer_uptodate(bh);
+ } else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
+ buffer_dirty(bh)) {
+ WARN_ON(bh->b_size != blocksize);
+ err = get_block(inode, block, bh, 1);
+ if (err)
+ goto recover;
+ clear_buffer_delay(bh);
+ if (buffer_new(bh)) {
+ /* blockdev mappings never come here */
+ clear_buffer_new(bh);
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
+ }
+ }
+ bh = bh->b_this_page;
+ block++;
+ } while (bh != head);
+
+ do {
+ if (!buffer_mapped(bh))
+ continue;
+ /*
+ * If it's a fully non-blocking write attempt and we cannot
+ * lock the buffer then redirty the page. Note that this can
+ * potentially cause a busy-wait loop from pdflush and kswapd
+ * activity, but those code paths have their own higher-level
+ * throttling.
+ */
+ if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
+ lock_buffer(bh);
+ } else if (!trylock_buffer(bh)) {
+ redirty_page_for_writepage(wbc, page);
+ continue;
+ }
+ if (test_clear_buffer_dirty(bh)) {
+ mark_buffer_async_write_endio(bh, handler);
+ } else {
+ unlock_buffer(bh);
+ }
+ } while ((bh = bh->b_this_page) != head);
+
+ /*
+ * The page and its buffers are protected by PageWriteback(), so we can
+ * drop the bh refcounts early.
+ */
+ BUG_ON(PageWriteback(page));
+ set_page_writeback(page);
+
+ do {
+ struct buffer_head *next = bh->b_this_page;
+ if (buffer_async_write(bh)) {
+ submit_bh(write_op, bh);
+ nr_underway++;
+ }
+ bh = next;
+ } while (bh != head);
+ unlock_page(page);
+
+ err = 0;
+done:
+ if (nr_underway == 0) {
+ /*
+ * The page was marked dirty, but the buffers were
+ * clean. Someone wrote them back by hand with
+ * ll_rw_block/submit_bh. A rare case.
+ */
+ end_page_writeback(page);
+
+ /*
+ * The page and buffer_heads can be released at any time from
+ * here on.
+ */
+ }
+ return err;
+
+recover:
+ /*
+ * ENOSPC, or some other error. We may already have added some
+ * blocks to the file, so we need to write these out to avoid
+ * exposing stale data.
+ * The page is currently locked and not marked for writeback
+ */
+ bh = head;
+ /* Recovery: lock and submit the mapped buffers */
+ do {
+ if (buffer_mapped(bh) && buffer_dirty(bh) &&
+ !buffer_delay(bh)) {
+ lock_buffer(bh);
+ mark_buffer_async_write_endio(bh, handler);
+ } else {
+ /*
+ * The buffer may have been set dirty during
+ * attachment to a dirty page.
+ */
+ clear_buffer_dirty(bh);
+ }
+ } while ((bh = bh->b_this_page) != head);
+ SetPageError(page);
+ BUG_ON(PageWriteback(page));
+ mapping_set_error(page->mapping, err);
+ set_page_writeback(page);
+ do {
+ struct buffer_head *next = bh->b_this_page;
+ if (buffer_async_write(bh)) {
+ clear_buffer_dirty(bh);
+ submit_bh(write_op, bh);
+ nr_underway++;
+ }
+ bh = next;
+ } while (bh != head);
+ unlock_page(page);
+ goto done;
}

/*
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-17 15:21:46

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/7] ext4: Convert ext4 to new mkwrite code

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 2 -
fs/ext4/extents.c | 4 -
fs/ext4/inode.c | 239 +++++++++++++++++++++++------------------------------
3 files changed, 102 insertions(+), 143 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9714db3..9110114 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1374,8 +1374,6 @@ extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
-extern int ext4_block_truncate_page(handle_t *handle,
- struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t ext4_get_reserved_space(struct inode *inode);

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 73ebfb4..76ece1b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2996,7 +2996,6 @@ out2:

void ext4_ext_truncate(struct inode *inode)
{
- struct address_space *mapping = inode->i_mapping;
struct super_block *sb = inode->i_sb;
ext4_lblk_t last_block;
handle_t *handle;
@@ -3010,9 +3009,6 @@ void ext4_ext_truncate(struct inode *inode)
if (IS_ERR(handle))
return;

- if (inode->i_size & (sb->s_blocksize - 1))
- ext4_block_truncate_page(handle, mapping, inode->i_size);
-
if (ext4_orphan_add(handle, inode))
goto out_stop;

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index be25874..1df027c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1451,6 +1451,53 @@ static int do_journal_get_write_access(handle_t *handle,
return ext4_journal_get_write_access(handle, bh);
}

+/*
+ * If we create a hole by this write, zero out a block partially under
+ * the hole created
+ */
+static int ext4_prepare_hole(handle_t *handle, struct inode *inode,
+ loff_t from, loff_t to, unsigned flags)
+{
+ struct buffer_head *bh;
+ unsigned hole_start, hole_len;
+ unsigned bsize = 1 << inode->i_blkbits;
+ int ret;
+
+ /* No hole created? */
+ if (from >= to)
+ return 0;
+
+ bh = block_prepare_hole_bh(inode, from, to, flags, ext4_get_block);
+ if (IS_ERR(bh))
+ return PTR_ERR(bh);
+ if (!bh)
+ return 0;
+
+ if (ext4_should_journal_data(inode)) {
+ ret = ext4_journal_get_write_access(handle, bh);
+ if (ret)
+ goto out;
+ }
+ /* Zero the tail of the block upto 'from' */
+ hole_start = from & (PAGE_CACHE_SIZE - 1);
+ if (to > ALIGN(from, bsize))
+ hole_len = bsize - (hole_start & (bsize - 1));
+ else
+ hole_len = to - from;
+
+ zero_user(bh->b_page, hole_start, hole_len);
+ if (ext4_should_order_data(inode)) {
+ ret = ext4_jbd2_file_inode(handle, inode);
+ mark_buffer_dirty(bh);
+ } else if (ext4_should_journal_data(inode))
+ ret = ext4_handle_dirty_metadata(handle, inode, bh);
+ else
+ mark_buffer_dirty(bh);
+out:
+ brelse(bh);
+ return ret;
+}
+
static int ext4_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
@@ -1465,10 +1512,13 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,

trace_ext4_write_begin(inode, pos, len, flags);
/*
- * Reserve one block more for addition to orphan list in case
- * we allocate blocks but write fails for some reason
+ * Reserve one block for addition to orphan list in case
+ * we allocate blocks but write fails for some reason and
+ * one block for zeroed block in case we create hole
+ * (needed only if we journal data)
*/
- needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
+ needed_blocks = ext4_writepage_trans_blocks(inode) + 1 +
+ (pos > inode->i_size && ext4_should_journal_data(inode));
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -1484,6 +1534,12 @@ retry:
* started */
flags |= AOP_FLAG_NOFS;

+ ret = ext4_prepare_hole(handle, inode, inode->i_size, pos, flags);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page) {
ext4_journal_stop(handle);
@@ -1548,23 +1604,17 @@ static int ext4_generic_write_end(struct file *file,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
{
- int i_size_changed = 0;
struct inode *inode = mapping->host;
handle_t *handle = ext4_journal_current_handle();
+ loff_t old_i_size = inode->i_size;

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);

- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- *
- * But it's important to update i_size while still holding page lock:
- * page writeout could otherwise come in and zero beyond i_size.
- */
- if (pos + copied > inode->i_size) {
+ unlock_page(page);
+ page_cache_release(page);
+
+ if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);
- i_size_changed = 1;
- }

if (pos + copied > EXT4_I(inode)->i_disksize) {
/* We need to mark inode dirty even if
@@ -1572,19 +1622,9 @@ static int ext4_generic_write_end(struct file *file,
* bu greater than i_disksize.(hint delalloc)
*/
ext4_update_i_disksize(inode, (pos + copied));
- i_size_changed = 1;
- }
- unlock_page(page);
- page_cache_release(page);
-
- /*
- * Don't mark the inode dirty under page lock. First, it unnecessarily
- * makes the holding time of page lock longer. Second, it forces lock
- * ordering of page lock and transaction start for journaling
- * filesystems.
- */
- if (i_size_changed)
ext4_mark_inode_dirty(handle, inode);
+ }
+ block_finish_hole(inode, old_i_size, pos);

return copied;
}
@@ -1691,7 +1731,7 @@ static int ext4_journalled_write_end(struct file *file,
int ret = 0, ret2;
int partial = 0;
unsigned from, to;
- loff_t new_i_size;
+ loff_t new_i_size, old_i_size = inode->i_size;

trace_ext4_journalled_write_end(inode, pos, len, copied);
from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1707,6 +1747,10 @@ static int ext4_journalled_write_end(struct file *file,
to, &partial, write_end_fn);
if (!partial)
SetPageUptodate(page);
+
+ unlock_page(page);
+ page_cache_release(page);
+
new_i_size = pos + copied;
if (new_i_size > inode->i_size)
i_size_write(inode, pos+copied);
@@ -1718,8 +1762,8 @@ static int ext4_journalled_write_end(struct file *file,
ret = ret2;
}

- unlock_page(page);
- page_cache_release(page);
+ block_finish_hole(inode, old_i_size, pos);
+
if (pos + len > inode->i_size && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
@@ -2967,6 +3011,12 @@ retry:
* started */
flags |= AOP_FLAG_NOFS;

+ ret = ext4_prepare_hole(handle, inode, inode->i_size, pos, flags);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page) {
ext4_journal_stop(handle);
@@ -3434,100 +3484,6 @@ void ext4_set_aops(struct inode *inode)
}

/*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
- struct address_space *mapping, loff_t from)
-{
- ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize, length, pos;
- ext4_lblk_t iblock;
- struct inode *inode = mapping->host;
- struct buffer_head *bh;
- struct page *page;
- int err = 0;
-
- page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
- mapping_gfp_mask(mapping) & ~__GFP_FS);
- if (!page)
- return -EINVAL;
-
- blocksize = inode->i_sb->s_blocksize;
- length = blocksize - (offset & (blocksize - 1));
- iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
-
- if (!page_has_buffers(page))
- create_empty_buffers(page, blocksize, 0);
-
- /* Find the buffer that contains "offset" */
- bh = page_buffers(page);
- pos = blocksize;
- while (offset >= pos) {
- bh = bh->b_this_page;
- iblock++;
- pos += blocksize;
- }
-
- err = 0;
- if (buffer_freed(bh)) {
- BUFFER_TRACE(bh, "freed: skip");
- goto unlock;
- }
-
- if (!buffer_mapped(bh)) {
- BUFFER_TRACE(bh, "unmapped");
- ext4_get_block(inode, iblock, bh, 0);
- /* unmapped? It's a hole - nothing to do */
- if (!buffer_mapped(bh)) {
- BUFFER_TRACE(bh, "still unmapped");
- goto unlock;
- }
- }
-
- /* Ok, it's mapped. Make sure it's up-to-date */
- if (PageUptodate(page))
- set_buffer_uptodate(bh);
-
- if (!buffer_uptodate(bh)) {
- err = -EIO;
- ll_rw_block(READ, 1, &bh);
- wait_on_buffer(bh);
- /* Uhhuh. Read error. Complain and punt. */
- if (!buffer_uptodate(bh))
- goto unlock;
- }
-
- if (ext4_should_journal_data(inode)) {
- BUFFER_TRACE(bh, "get write access");
- err = ext4_journal_get_write_access(handle, bh);
- if (err)
- goto unlock;
- }
-
- zero_user(page, offset, length);
-
- BUFFER_TRACE(bh, "zeroed end of block");
-
- err = 0;
- if (ext4_should_journal_data(inode)) {
- err = ext4_handle_dirty_metadata(handle, inode, bh);
- } else {
- if (ext4_should_order_data(inode))
- err = ext4_jbd2_file_inode(handle, inode);
- mark_buffer_dirty(bh);
- }
-
-unlock:
- unlock_page(page);
- page_cache_release(page);
- return err;
-}
-
-/*
* Probably it should be a library function... search for first non-zero word
* or memcmp with zero_page, whatever is better for particular architecture.
* Linus?
@@ -3932,7 +3888,6 @@ void ext4_truncate(struct inode *inode)
struct ext4_inode_info *ei = EXT4_I(inode);
__le32 *i_data = ei->i_data;
int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
- struct address_space *mapping = inode->i_mapping;
ext4_lblk_t offsets[4];
Indirect chain[4];
Indirect *partial;
@@ -3960,10 +3915,6 @@ void ext4_truncate(struct inode *inode)
last_block = (inode->i_size + blocksize-1)
>> EXT4_BLOCK_SIZE_BITS(inode->i_sb);

- if (inode->i_size & (blocksize - 1))
- if (ext4_block_truncate_page(handle, mapping, inode->i_size))
- goto out_stop;
-
n = ext4_block_to_path(inode, last_block, offsets, NULL);
if (n == 0)
goto out_stop; /* error */
@@ -4744,16 +4695,30 @@ static int ext4_setsize(struct inode *inode, loff_t newsize)
goto err_out;
}
}
- } else if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) {
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ } else {
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) {
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

- if (newsize > sbi->s_bitmap_maxbytes) {
- error = -EFBIG;
- goto out;
+ if (newsize > sbi->s_bitmap_maxbytes) {
+ error = -EFBIG;
+ goto out;
+ }
}
+ handle = ext4_journal_start(inode, 1);
+ if (IS_ERR(handle)) {
+ error = PTR_ERR(handle);
+ goto err_out;
+ }
+ error = ext4_prepare_hole(handle, inode, oldsize, newsize,
+ AOP_FLAG_NOFS);
+ ext4_journal_stop(handle);
+ if (error)
+ goto err_out;
}

i_size_write(inode, newsize);
+ if (newsize > oldsize)
+ block_finish_hole(inode, oldsize, newsize);
truncate_pagecache(inode, oldsize, newsize);
ext4_truncate(inode);

@@ -5301,10 +5266,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
}
/*
* OK, we need to fill the hole... Do write_begin write_end
- * to do block allocation/reservation.We are not holding
- * inode.i__mutex here. That allow * parallel write_begin,
- * write_end call. lock_page prevent this from happening
- * on the same page though
+ * to do block allocation/reservation. We are not holding
+ * inode->i_mutex here. That allows parallel write_begin,
+ * write_end calls. lock_page prevent this from happening
+ * on the same page though.
*/
ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
--
1.6.0.2


2009-09-22 14:36:04

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/7] ext4: Convert filesystem to the new truncate calling convention

On Thu, Sep 17, 2009 at 05:21:45PM +0200, Jan Kara wrote:
> CC: [email protected]
> CC: [email protected]
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 2 +-
> fs/ext4/inode.c | 166 ++++++++++++++++++++++++++++++++----------------------
> 2 files changed, 99 insertions(+), 69 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 3f1873f..22f49d7 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -198,7 +198,7 @@ const struct file_operations ext4_file_operations = {
> };
>
> const struct inode_operations ext4_file_inode_operations = {
> - .truncate = ext4_truncate,
> + .new_truncate = 1,
> .setattr = ext4_setattr,
> .getattr = ext4_getattr,
> #ifdef CONFIG_EXT4_FS_XATTR
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 58492ab..be25874 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3354,6 +3354,7 @@ static int ext4_journalled_set_page_dirty(struct page *page)
> }
>
> static const struct address_space_operations ext4_ordered_aops = {
> + .new_writepage = 1,

No. We already have one half-finished series here; mixing it with another
one is not going to happen. Such flags are tolerable only as bisectability
helpers. They *must* disappear by the end of series. Before it can be
submitted for merge.

In effect, you are mixing truncate switchover with your writepage one.
Please, split and reorder.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-09-22 17:16:03

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/7] ext4: Convert filesystem to the new truncate calling convention

On Tue 22-09-09 15:36:04, Al Viro wrote:
> On Thu, Sep 17, 2009 at 05:21:45PM +0200, Jan Kara wrote:
> > CC: [email protected]
> > CC: [email protected]
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/file.c | 2 +-
> > fs/ext4/inode.c | 166 ++++++++++++++++++++++++++++++++----------------------
> > 2 files changed, 99 insertions(+), 69 deletions(-)
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 3f1873f..22f49d7 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -198,7 +198,7 @@ const struct file_operations ext4_file_operations = {
> > };
> >
> > const struct inode_operations ext4_file_inode_operations = {
> > - .truncate = ext4_truncate,
> > + .new_truncate = 1,
> > .setattr = ext4_setattr,
> > .getattr = ext4_getattr,
> > #ifdef CONFIG_EXT4_FS_XATTR
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 58492ab..be25874 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3354,6 +3354,7 @@ static int ext4_journalled_set_page_dirty(struct page *page)
> > }
> >
> > static const struct address_space_operations ext4_ordered_aops = {
> > + .new_writepage = 1,
>
> No. We already have one half-finished series here; mixing it with another
> one is not going to happen. Such flags are tolerable only as bisectability
> helpers. They *must* disappear by the end of series. Before it can be
> submitted for merge.
>
> In effect, you are mixing truncate switchover with your writepage one.
> Please, split and reorder.
Well, this wasn't meant as a final version of those patches. It was
meant as a request for comment whether it makes sence to fix the problem
how I propose to fix it. If we agree on that, I'll go and convert the rest
of filesystems so that we can remove .new_writepage hack. By that time I
hope that new truncate sequence patches will be merged so that dependency
should go away as well...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-22 17:23:44

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/7] ext4: Convert filesystem to the new truncate calling convention

On Tue, Sep 22, 2009 at 07:16:04PM +0200, Jan Kara wrote:
> > No. We already have one half-finished series here; mixing it with another
> > one is not going to happen. Such flags are tolerable only as bisectability
> > helpers. They *must* disappear by the end of series. Before it can be
> > submitted for merge.
> >
> > In effect, you are mixing truncate switchover with your writepage one.
> > Please, split and reorder.

> Well, this wasn't meant as a final version of those patches. It was
> meant as a request for comment whether it makes sence to fix the problem
> how I propose to fix it. If we agree on that, I'll go and convert the rest
> of filesystems so that we can remove .new_writepage hack. By that time I
> hope that new truncate sequence patches will be merged so that dependency
> should go away as well...

Could you carve just the ext4 part of truncate series out of that and
post it separately?

2009-09-22 17:37:04

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/7] ext4: Convert filesystem to the new truncate calling convention

On Tue 22-09-09 18:23:47, Al Viro wrote:
> On Tue, Sep 22, 2009 at 07:16:04PM +0200, Jan Kara wrote:
> > > No. We already have one half-finished series here; mixing it with another
> > > one is not going to happen. Such flags are tolerable only as bisectability
> > > helpers. They *must* disappear by the end of series. Before it can be
> > > submitted for merge.
> > >
> > > In effect, you are mixing truncate switchover with your writepage one.
> > > Please, split and reorder.
>
> > Well, this wasn't meant as a final version of those patches. It was
> > meant as a request for comment whether it makes sence to fix the problem
> > how I propose to fix it. If we agree on that, I'll go and convert the rest
> > of filesystems so that we can remove .new_writepage hack. By that time I
> > hope that new truncate sequence patches will be merged so that dependency
> > should go away as well...
>
> Could you carve just the ext4 part of truncate series out of that and
> post it separately?
Definitely. Actually, I've sent that patch to Nick in private but you're
right I should have posted it to the list as well. Will do it in a moment.
Thanks for a reminder.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>