LinuxLists.cc - [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

2009-05-27 13:01:08

[permalink] [raw]

Subject: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

patches below are an attempt to solve problems filesystems have with
page_mkwrite() when blocksize < pagesize (see the changelog of the third patch
for details).

The series is against 2.6.30-rc7. The first two patches are just small cleanup
and should be merged separately (Ted should have the ext4 cleanup rebased on
top of current ext4 tree). For ext3 the fix is done in two phases, in the first
we make it to correctly allocate space at page-fault time from page_mkwrite().
This has the disadvantage that under random mmaped writes, the file gets much
more fragmented and performance of e.g. Berkeley DB drops by ~20%. Therefore
in the second phase I've implemented delayed allocation for ext3 and blocks
are just reserved during page_mkwrite time and really allocated only during
writepage. This gets the performance back to original numbers for me.

The patches should be fairly complete and sustained quite some testing. OTOH
the area is kind of complex so please review them so that they can get merged.
Thanks.

Honza

2009-05-27 13:01:09

[permalink] [raw]

Subject: [PATCH 08/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped

When we do delayed allocation of some buffer, we want to signal to VFS that
the buffer is new (set buffer_new) so that it properly zeros out everything.
But we don't have the buffer mapped yet so we cannot really unmap underlying
metadata in this state. Make VFS avoid doing unmapping of metadata when the
buffer is not yet mapped.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index ddfcade..3d0bced 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1683,8 +1683,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
if (buffer_new(bh)) {
/* blockdev mappings never come here */
clear_buffer_new(bh);
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
}
}
bh = bh->b_this_page;
@@ -1869,8 +1870,9 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
if (err)
break;
if (buffer_new(bh)) {
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
if (PageUptodate(page)) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
@@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
goto failed;
if (!buffer_mapped(bh))
is_mapped_to_disk = 0;
- if (buffer_new(bh))
+ if (buffer_new(bh) && buffer_mapped(bh))
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
if (PageUptodate(page)) {
set_buffer_uptodate(bh);
--
1.6.0.2

2009-05-27 13:01:09

[permalink] [raw]

Subject: [PATCH 07/11] vfs: Implement generic per-cpu counters for delayed allocation

Implement free blocks and reserved blocks counters for delayed
allocation. These counters are reliable in the sence that when
they return success, the subsequent conversion from reserved to
allocated blocks always succeeds (see comments in the code for
details). This is useful for ext? based filesystems to implement
delayed allocation in particular for allocation in page_mkwrite.

Signed-off-by: Jan Kara <[email protected]>
---
fs/Kconfig | 4 ++
fs/Makefile | 1 +
fs/delalloc_counter.c | 109 ++++++++++++++++++++++++++++++++++++++
fs/ext3/Kconfig | 1 +
include/linux/delalloc_counter.h | 63 ++++++++++++++++++++++
5 files changed, 178 insertions(+), 0 deletions(-)
create mode 100644 fs/delalloc_counter.c
create mode 100644 include/linux/delalloc_counter.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 9f7270f..923251b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -19,6 +19,10 @@ config FS_XIP
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"

+config DELALLOC_COUNTER
+ bool
+ default y if EXT3_FS=y
+
config FS_MBCACHE
# Meta block cache for Extended Attributes (ext2/ext3/ext4)
tristate
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..b5614fc 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -19,6 +19,7 @@ else
obj-y += no-block.o
endif

+obj-$(CONFIG_DELALLOC_COUNTER) += delalloc_counter.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
diff --git a/fs/delalloc_counter.c b/fs/delalloc_counter.c
new file mode 100644
index 0000000..0f575d5
--- /dev/null
+++ b/fs/delalloc_counter.c
@@ -0,0 +1,109 @@
+/*
+ * Per-cpu counters for delayed allocation
+ */
+#include <linux/percpu_counter.h>
+#include <linux/delalloc_counter.h>
+#include <linux/module.h>
+#include <linux/log2.h>
+
+static long dac_error(struct delalloc_counter *c)
+{
+#ifdef CONFIG_SMP
+ return c->batch * nr_cpu_ids;
+#else
+ return 0;
+#endif
+}
+
+/*
+ * Reserve blocks for delayed allocation
+ *
+ * This code is subtle because we want to avoid synchronization of processes
+ * doing allocation in the common case when there's plenty of space in the
+ * filesystem.
+ *
+ * The code maintains the following property: Among all the calls to
+ * dac_reserve() that return 0 there exists a simple sequential ordering of
+ * these calls such that the check (free - reserved >= limit) in each call
+ * succeeds. This guarantees that we never reserve blocks we don't have.
+ *
+ * The proof of the above invariant: The function can return 0 either when the
+ * first if succeeds or when both ifs fail. To the first type of callers we
+ * assign the time of read of c->reserved in the first if, to the second type
+ * of callers we assign the time of read of c->reserved in the second if. We
+ * order callers by their assigned time and claim that this is the ordering
+ * required by the invariant. Suppose that a check (free - reserved >= limit)
+ * fails for caller C in the proposed ordering. We distinguish two cases:
+ * 1) function called by C returned zero because the first if succeeded - in
+ * this case reads of counters in the first if must have seen effects of
+ * __percpu_counter_add of all the callers before C (even their condition
+ * evaluation happened before our). The errors accumulated in cpu-local
+ * variables are clearly < dac_error(c) and thus the condition should fail.
+ * Contradiction.
+ * 2) function called by C returned zero because the second if failed - again
+ * the read of the counters must have seen effects of __percpu_counter_add of
+ * all the callers before C and thus the condition should have succeeded.
+ * Contradiction.
+ */
+int dac_reserve(struct delalloc_counter *c, s32 amount, s64 limit)
+{
+ s64 free, reserved;
+ int ret = 0;
+
+ __percpu_counter_add(&c->reserved, amount, c->batch);
+ /*
+ * This barrier makes sure that when effects of the following read of
+ * c->reserved are observable by another CPU also effects of the
+ * previous store to c->reserved are seen.
+ */
+ smp_mb();
+ if (percpu_counter_read(&c->free) - percpu_counter_read(&c->reserved)
+ - 2 * dac_error(c) >= limit)
+ return ret;
+ /*
+ * Near the limit - sum the counter to avoid returning ENOSPC too
+ * early. Note that we can still "unnecessarily" return ENOSPC when
+ * there are several racing writers. Spinlock in this section would
+ * solve it but let's ignore it for now.
+ */
+ free = percpu_counter_sum_positive(&c->free);
+ reserved = percpu_counter_sum_positive(&c->reserved);
+ if (free - reserved < limit) {
+ __percpu_counter_add(&c->reserved, -amount, c->batch);
+ ret = -ENOSPC;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(dac_reserve);
+
+/* Account reserved blocks as allocated */
+void dac_alloc_reserved(struct delalloc_counter *c, s32 amount)
+{
+ __percpu_counter_add(&c->free, -amount, c->batch);
+ /*
+ * Make sure update of free counter is seen before update of
+ * reserved counter.
+ */
+ smp_wmb();
+ __percpu_counter_add(&c->reserved, -amount, c->batch);
+}
+EXPORT_SYMBOL(dac_alloc_reserved);
+
+int dac_init(struct delalloc_counter *c, s64 amount)
+{
+ int err;
+
+ c->batch = 8*(1+ilog2(nr_cpu_ids));
+ err = percpu_counter_init(&c->free, amount);
+ if (!err)
+ err = percpu_counter_init(&c->reserved, 0);
+ return err;
+}
+EXPORT_SYMBOL(dac_init);
+
+void dac_destroy(struct delalloc_counter *c)
+{
+ percpu_counter_destroy(&c->free);
+ percpu_counter_destroy(&c->reserved);
+}
+EXPORT_SYMBOL(dac_destroy);
diff --git a/fs/ext3/Kconfig b/fs/ext3/Kconfig
index fb3c1a2..f4e122f 100644
--- a/fs/ext3/Kconfig
+++ b/fs/ext3/Kconfig
@@ -1,6 +1,7 @@
config EXT3_FS
tristate "Ext3 journalling file system support"
select JBD
+ select DELALLOC_COUNTER
help
This is the journalling version of the Second extended file system
(often called ext3), the de facto standard Linux file system
diff --git a/include/linux/delalloc_counter.h b/include/linux/delalloc_counter.h
new file mode 100644
index 0000000..9d00b6c
--- /dev/null
+++ b/include/linux/delalloc_counter.h
@@ -0,0 +1,63 @@
+#ifndef _LINUX_DELALLOC_COUNTER_H
+#define _LINUX_DELALLOC_COUNTER_H
+
+#include <linux/percpu_counter.h>
+
+struct delalloc_counter {
+ struct percpu_counter free;
+ struct percpu_counter reserved;
+ int batch;
+};
+
+int dac_reserve(struct delalloc_counter *c, s32 amount, s64 limit);
+void dac_alloc_reserved(struct delalloc_counter *c, s32 amount);
+
+static inline int dac_alloc(struct delalloc_counter *c, s32 amount, s64 limit)
+{
+ int ret = dac_reserve(c, amount, limit);
+ if (!ret)
+ dac_alloc_reserved(c, amount);
+ return ret;
+}
+
+static inline void dac_free(struct delalloc_counter *c, s32 amount)
+{
+ __percpu_counter_add(&c->free, amount, c->batch);
+}
+
+static inline void dac_cancel_reserved(struct delalloc_counter *c, s32 amount)
+{
+ __percpu_counter_add(&c->reserved, -amount, c->batch);
+}
+
+int dac_init(struct delalloc_counter *c, s64 amount);
+void dac_destroy(struct delalloc_counter *c);
+
+static inline s64 dac_get_avail(struct delalloc_counter *c)
+{
+ s64 ret = percpu_counter_read(&c->free) -
+ percpu_counter_read(&c->reserved);
+ if (ret < 0)
+ return 0;
+ return ret;
+}
+
+static inline s64 dac_get_avail_sum(struct delalloc_counter *c)
+{
+ s64 ret = percpu_counter_sum(&c->free) -
+ percpu_counter_sum(&c->reserved);
+ if (ret < 0)
+ return 0;
+ return ret;
+}
+
+static inline s64 dac_get_reserved(struct delalloc_counter *c)
+{
+ return percpu_counter_read_positive(&c->reserved);
+}
+
+static inline s64 dac_get_reserved_sum(struct delalloc_counter *c)
+{
+ return percpu_counter_sum_positive(&c->reserved);
+}
+#endif
--
1.6.0.2

2009-05-27 13:01:09

[permalink] [raw]

Subject: [PATCH 05/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize

In a situation like:
truncate(f, 1024);
a = mmap(f, 0, 4096);
a[0] = 'a';
truncate(f, 4096);

we end up with a dirty page which does not have all blocks allocated /
reserved. Fix the problem by using new VFS infrastructure.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 2 +-
fs/ext4/inode.c | 20 ++++++++++++++++++--
2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6b0eeaf..ef587f7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3082,7 +3082,7 @@ static void ext4_falloc_update_inode(struct inode *inode,
*/
if (!(mode & FALLOC_FL_KEEP_SIZE)) {
if (new_size > i_size_read(inode))
- i_size_write(inode, new_size);
+ block_extend_i_size(inode, new_size, 0);
if (new_size > EXT4_I(inode)->i_disksize)
ext4_update_i_disksize(inode, new_size);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7fcceb0..6547788 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1436,7 +1436,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
-
+ block_lock_hole_extend(inode, pos);
retry:
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -1480,6 +1480,8 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
+ if (ret)
+ block_unlock_hole_extend(inode);
return ret;
}

@@ -1622,6 +1624,7 @@ static int ext4_journalled_write_end(struct file *file,
if (!ret)
ret = ret2;
page_cache_release(page);
+ block_unlock_hole_extend(inode);

return ret ? ret : copied;
}
@@ -2733,6 +2736,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
"dev %s ino %lu pos %llu len %u flags %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, flags);
+ block_lock_hole_extend(inode, pos);
retry:
/*
* With delayed allocation, we don't log the i_disksize update
@@ -2775,6 +2779,8 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
+ if (ret)
+ block_unlock_hole_extend(inode);
return ret;
}

@@ -3323,7 +3329,7 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
loff_t end = offset + ret;
if (end > inode->i_size) {
ei->i_disksize = end;
- i_size_write(inode, end);
+ block_extend_i_size(inode, offset, ret);
/*
* We're going to return a positive `ret'
* here due to non-zero-length I/O, so there's
@@ -3368,6 +3374,7 @@ static const struct address_space_operations ext4_ordered_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_write_begin,
.write_end = ext4_ordered_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
@@ -3383,6 +3390,7 @@ static const struct address_space_operations ext4_writeback_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_write_begin,
.write_end = ext4_writeback_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
@@ -3398,6 +3406,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_write_begin,
.write_end = ext4_journalled_write_end,
+ .extend_i_size = block_extend_i_size,
.set_page_dirty = ext4_journalled_set_page_dirty,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
@@ -3413,6 +3422,7 @@ static const struct address_space_operations ext4_da_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_da_write_begin,
.write_end = ext4_da_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext4_bmap,
.invalidatepage = ext4_da_invalidatepage,
.releasepage = ext4_releasepage,
@@ -5277,6 +5287,12 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
struct address_space *mapping = inode->i_mapping;

/*
+ * Wait for extending of i_size, after this moment, next truncate /
+ * write can create holes under us but they writeprotect our page so
+ * we'll be called again to fill the hole.
+ */
+ block_wait_on_hole_extend(inode, page_offset(page));
+ /*
* Get i_alloc_sem to stop truncates messing with the inode. We cannot
* get i_mutex because we are already holding mmap_sem.
*/
--
1.6.0.2

2009-05-27 13:01:08

[permalink] [raw]

Subject: [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()

Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
be a relict from some old days and setting disksize in this function does not
make much sence. Currently it was set only by ext3_getblk(). Since the
parameter has some effect only if create == 1, it is easy to check that the
three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/dir.c | 3 +--
fs/ext3/inode.c | 13 +++----------
include/linux/ext3_fs.h | 2 +-
3 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..373fa90 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -130,8 +130,7 @@ static int ext3_readdir(struct file * filp,
struct buffer_head *bh = NULL;

map_bh.b_state = 0;
- err = ext3_get_blocks_handle(NULL, inode, blk, 1,
- &map_bh, 0, 0);
+ err = ext3_get_blocks_handle(NULL, inode, blk, 1, &map_bh, 0);
if (err > 0) {
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index fcfa243..60f0feb 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -788,7 +788,7 @@ err_out:
int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks,
struct buffer_head *bh_result,
- int create, int extend_disksize)
+ int create)
{
int err = -EIO;
int offsets[4];
@@ -911,13 +911,6 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (!err)
err = ext3_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
- /*
- * i_disksize growing is protected by truncate_mutex. Don't forget to
- * protect it if you're about to implement concurrent
- * ext3_get_block() -bzzz
- */
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
- ei->i_disksize = inode->i_size;
mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;
@@ -972,7 +965,7 @@ static int ext3_get_block(struct inode *inode, sector_t iblock,
}

ret = ext3_get_blocks_handle(handle, inode, iblock,
- max_blocks, bh_result, create, 0);
+ max_blocks, bh_result, create);
if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
ret = 0;
@@ -1005,7 +998,7 @@ struct buffer_head *ext3_getblk(handle_t *handle, struct inode *inode,
dummy.b_blocknr = -1000;
buffer_trace_init(&dummy.b_history);
err = ext3_get_blocks_handle(handle, inode, block, 1,
- &dummy, create, 1);
+ &dummy, create);
/*
* ext3_get_blocks_handle() returns number of blocks
* mapped. 0 in case of a HOLE.
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..7499b36 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -874,7 +874,7 @@ struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *);
struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *);
int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks, struct buffer_head *bh_result,
- int create, int extend_disksize);
+ int create);

extern struct inode *ext3_iget(struct super_block *, unsigned long);
extern int ext3_write_inode (struct inode *, int);
--
1.6.0.2

2009-05-27 13:01:11

[permalink] [raw]

Subject: [PATCH 04/11] ext2: Allocate space for mmaped file on page fault

So far we've allocated space at ->writepage() time. This has the disadvantage
that when we hit ENOSPC or other error, we cannot do much - either throw
away the data or keep the page indefinitely (and loose the data on reboot).
So allocate space already when a page is faulted in.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/file.c | 26 +++++++++++++++++++++++++-
fs/ext2/inode.c | 1 +
2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 45ed071..74b2c3d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -19,6 +19,8 @@
*/

#include <linux/time.h>
+#include <linux/mm.h>
+#include <linux/buffer_head.h>
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
@@ -38,6 +40,28 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
return 0;
}

+static int ext2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return block_page_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static struct vm_operations_struct ext2_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext2_page_mkwrite,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext2_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
/*
* We have mostly NULL's here: the current defaults are ok for
* the ext2 filesystem.
@@ -52,7 +76,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = generic_file_open,
.release = ext2_release_file,
.fsync = ext2_sync_file,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index acf6788..8217219 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -816,6 +816,7 @@ const struct address_space_operations ext2_aops = {
.sync_page = block_sync_page,
.write_begin = ext2_write_begin,
.write_end = generic_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
--
1.6.0.2

2009-05-27 13:01:11

[permalink] [raw]

Subject: [PATCH 06/11] ext3: Allocate space for mmaped file on page fault

So far we've allocated space at ->writepage() time. This has the disadvantage
that when we hit ENOSPC or other error, we cannot do much - either throw
away the data or keep the page indefinitely (and loose the data on reboot).
So allocate space already when a page is faulted in.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/file.c | 19 ++++-
fs/ext3/inode.c | 220 +++++++++++++++++++++--------------------------
include/linux/ext3_fs.h | 1 +
3 files changed, 116 insertions(+), 124 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a7dce9d 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -110,6 +110,23 @@ force_commit:
return ret;
}

+static struct vm_operations_struct ext3_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext3_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -120,7 +137,7 @@ const struct file_operations ext3_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext3_file_mmap,
.open = generic_file_open,
.release = ext3_release_file,
.fsync = ext3_sync_file,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 60f0feb..cc012fe 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1156,10 +1156,13 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;

+ block_lock_hole_extend(inode, pos);
retry:
page = grab_cache_page_write_begin(mapping, index, flags);
- if (!page)
- return -ENOMEM;
+ if (!page) {
+ ret = -ENOMEM;
+ goto out;
+ }
*pagep = page;

handle = ext3_journal_start(inode, needed_blocks);
@@ -1199,6 +1202,8 @@ write_begin_failed:
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
+ if (ret)
+ block_unlock_hole_extend(inode);
return ret;
}

@@ -1290,6 +1295,7 @@ static int ext3_ordered_write_end(struct file *file,

if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
+ block_unlock_hole_extend(inode);
return ret ? ret : copied;
}

@@ -1316,6 +1322,7 @@ static int ext3_writeback_write_end(struct file *file,

if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
+ block_unlock_hole_extend(inode);
return ret ? ret : copied;
}

@@ -1369,6 +1376,7 @@ static int ext3_journalled_write_end(struct file *file,

if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
+ block_unlock_hole_extend(inode);
return ret ? ret : copied;
}

@@ -1424,18 +1432,6 @@ static sector_t ext3_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping,block,ext3_get_block);
}

-static int bget_one(handle_t *handle, struct buffer_head *bh)
-{
- get_bh(bh);
- return 0;
-}
-
-static int bput_one(handle_t *handle, struct buffer_head *bh)
-{
- put_bh(bh);
- return 0;
-}
-
static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
{
return !buffer_mapped(bh);
@@ -1487,125 +1483,25 @@ static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
* We'll probably need that anyway for journalling writepage() output.
*
* We don't honour synchronous mounts for writepage(). That would be
- * disastrous. Any write() or metadata operation will sync the fs for
+ * disastrous. Any write() or metadata operation will sync the fs for
* us.
*
- * AKPM2: if all the page's buffers are mapped to disk and !data=journal,
- * we don't need to open a transaction here.
+ * Note, even though we try, we *may* end up allocating blocks here because
+ * page_mkwrite() has not allocated blocks yet but dirty buffers were created
+ * under the whole page, not just the part inside old i_size. We could just
+ * ignore writing such buffers but it would be harder to avoid it then just
+ * do it...
*/
-static int ext3_ordered_writepage(struct page *page,
- struct writeback_control *wbc)
-{
- struct inode *inode = page->mapping->host;
- struct buffer_head *page_bufs;
- handle_t *handle = NULL;
- int ret = 0;
- int err;
-
- J_ASSERT(PageLocked(page));
-
- /*
- * We give up here if we're reentered, because it might be for a
- * different filesystem.
- */
- if (ext3_journal_current_handle())
- goto out_fail;
-
- if (!page_has_buffers(page)) {
- create_empty_buffers(page, inode->i_sb->s_blocksize,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
- page_bufs = page_buffers(page);
- } else {
- page_bufs = page_buffers(page);
- if (!walk_page_buffers(NULL, page_bufs, 0, PAGE_CACHE_SIZE,
- NULL, buffer_unmapped)) {
- /* Provide NULL get_block() to catch bugs if buffers
- * weren't really mapped */
- return block_write_full_page(page, NULL, wbc);
- }
- }
- handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out_fail;
- }
-
- walk_page_buffers(handle, page_bufs, 0,
- PAGE_CACHE_SIZE, NULL, bget_one);
-
- ret = block_write_full_page(page, ext3_get_block, wbc);
-
- /*
- * The page can become unlocked at any point now, and
- * truncate can then come in and change things. So we
- * can't touch *page from now on. But *page_bufs is
- * safe due to elevated refcount.
- */
-
- /*
- * And attach them to the current transaction. But only if
- * block_write_full_page() succeeded. Otherwise they are unmapped,
- * and generally junk.
- */
- if (ret == 0) {
- err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
- NULL, journal_dirty_data_fn);
- if (!ret)
- ret = err;
- }
- walk_page_buffers(handle, page_bufs, 0,
- PAGE_CACHE_SIZE, NULL, bput_one);
- err = ext3_journal_stop(handle);
- if (!ret)
- ret = err;
- return ret;
-
-out_fail:
- redirty_page_for_writepage(wbc, page);
- unlock_page(page);
- return ret;
-}
-
-static int ext3_writeback_writepage(struct page *page,
+static int ext3_common_writepage(struct page *page,
struct writeback_control *wbc)
{
struct inode *inode = page->mapping->host;
- handle_t *handle = NULL;
int ret = 0;
- int err;
-
- if (ext3_journal_current_handle())
- goto out_fail;
-
- if (page_has_buffers(page)) {
- if (!walk_page_buffers(NULL, page_buffers(page), 0,
- PAGE_CACHE_SIZE, NULL, buffer_unmapped)) {
- /* Provide NULL get_block() to catch bugs if buffers
- * weren't really mapped */
- return block_write_full_page(page, NULL, wbc);
- }
- }
-
- handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out_fail;
- }

if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
ret = nobh_writepage(page, ext3_get_block, wbc);
else
ret = block_write_full_page(page, ext3_get_block, wbc);
-
- err = ext3_journal_stop(handle);
- if (!ret)
- ret = err;
- return ret;
-
-out_fail:
- redirty_page_for_writepage(wbc, page);
- unlock_page(page);
return ret;
}

@@ -1752,9 +1648,11 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
if (orphan) {
int err;

+ block_lock_hole_extend(inode, offset);
/* Credits for sb + inode write */
handle = ext3_journal_start(inode, 2);
if (IS_ERR(handle)) {
+ block_unlock_hole_extend(inode);
/* This is really bad luck. We've written the data
* but cannot extend i_size. Bail out and pretend
* the write failed... */
@@ -1781,11 +1679,84 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
err = ext3_journal_stop(handle);
if (ret == 0)
ret = err;
+ block_unlock_hole_extend(inode);
}
out:
return ret;
}

+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = file->f_path.dentry->d_inode;
+ int ret = VM_FAULT_NOPAGE;
+ loff_t size;
+ int len;
+ void *fsdata;
+
+ block_wait_on_hole_extend(inode, page_offset(page));
+ /*
+ * Get i_alloc_sem to stop truncates messing with the inode. We cannot
+ * get i_mutex because we are already holding mmap_sem.
+ */
+ down_read(&inode->i_alloc_sem);
+ size = i_size_read(inode);
+ if ((page->mapping != inode->i_mapping) ||
+ (page_offset(page) > size)) {
+ /* page got truncated out from underneath us */
+ goto out_unlock;
+ }
+
+ /* page is wholly or partially inside EOF */
+ if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
+ len = size & ~PAGE_CACHE_MASK;
+ else
+ len = PAGE_CACHE_SIZE;
+
+ /*
+ * Check for the common case that everything is already mapped. We
+ * have to get the page lock so that buffers cannot be released
+ * under us.
+ */
+ lock_page(page);
+ if (page_has_buffers(page)) {
+ if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
+ buffer_unmapped)) {
+ unlock_page(page);
+ ret = 0;
+ goto out_unlock;
+ }
+ }
+ unlock_page(page);
+
+ /*
+ * OK, we may need to fill the hole... Do write_begin write_end to do
+ * block allocation/reservation. We are not holding inode.i_mutex
+ * here. That allows parallel write_begin, write_end call. lock_page
+ * prevent this from happening on the same page though.
+ */
+ ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
+ len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+ if (ret < 0)
+ goto out_unlock;
+ ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
+ len, len, page, fsdata);
+ if (ret < 0)
+ goto out_unlock;
+ ret = 0;
+out_unlock:
+ if (unlikely(ret)) {
+ if (ret == -ENOMEM)
+ ret = VM_FAULT_OOM;
+ else /* -ENOSPC, -EIO, etc */
+ ret = VM_FAULT_SIGBUS;
+ }
+ up_read(&inode->i_alloc_sem);
+ return ret;
+}
+
/*
* Pages can be marked dirty completely asynchronously from ext3's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -1808,10 +1779,11 @@ static int ext3_journalled_set_page_dirty(struct page *page)
static const struct address_space_operations ext3_ordered_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
- .writepage = ext3_ordered_writepage,
+ .writepage = ext3_common_writepage,
.sync_page = block_sync_page,
.write_begin = ext3_write_begin,
.write_end = ext3_ordered_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
@@ -1823,10 +1795,11 @@ static const struct address_space_operations ext3_ordered_aops = {
static const struct address_space_operations ext3_writeback_aops = {
.readpage = ext3_readpage,
.readpages = ext3_readpages,
- .writepage = ext3_writeback_writepage,
+ .writepage = ext3_common_writepage,
.sync_page = block_sync_page,
.write_begin = ext3_write_begin,
.write_end = ext3_writeback_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
@@ -1842,6 +1815,7 @@ static const struct address_space_operations ext3_journalled_aops = {
.sync_page = block_sync_page,
.write_begin = ext3_write_begin,
.write_end = ext3_journalled_write_end,
+ .extend_i_size = block_extend_i_size,
.set_page_dirty = ext3_journalled_set_page_dirty,
.bmap = ext3_bmap,
.invalidatepage = ext3_invalidatepage,
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 7499b36..5051874 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -892,6 +892,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
--
1.6.0.2

2009-05-27 13:01:11

[permalink] [raw]

Subject: [PATCH 09/11] fs: Don't clear dirty bits in block_write_full_page()

If getblock() fails in block_write_full_page(), we don't want to clear
dirty bits on buffers. Actually, we even want to redirty the page. This
way we just won't silently discard users data (written e.g. through mmap)
in case of ENOSPC, EDQUOT, EIO or other write error (which may be just
transient e.g. because we have to commit a transaction to free up some space).
The downside of this approach is that if the error is persistent we have this
page pinned in memory forever and if there are lots of such pages, we can bring
the machine OOM.

We also don't want to clear dirty bits from buffers above i_size because that
is a generally a bussiness of invalidatepage where filesystem might want to do
some additional work. If we clear dirty bits already in block_write_full_page,
memory reclaim can reap the page before invalidatepage is called on the page
and thus confusing the filesystem.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 40 +++++++++++++++++-----------------------
1 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3d0bced..6d18739 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1662,19 +1662,14 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
* handle any aliases from the underlying blockdev's mapping.
*/
do {
- if (block > last_block) {
- /*
- * mapped buffers outside i_size will occur, because
- * this page can be outside i_size when there is a
- * truncate in progress.
- */
- /*
- * The buffer was zeroed by block_write_full_page()
- */
- clear_buffer_dirty(bh);
- set_buffer_uptodate(bh);
- } else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
- buffer_dirty(bh)) {
+ /*
+ * Mapped buffers outside i_size will occur, because
+ * this page can be outside i_size when there is a
+ * truncate in progress.
+ */
+ if (block <= last_block &&
+ (!buffer_mapped(bh) || buffer_delay(bh)) &&
+ buffer_dirty(bh)) {
WARN_ON(bh->b_size != blocksize);
err = get_block(inode, block, bh, 1);
if (err)
@@ -1692,9 +1687,10 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
block++;
} while (bh != head);

+ block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
do {
- if (!buffer_mapped(bh))
- continue;
+ if (!buffer_mapped(bh) || block > last_block)
+ goto next;
/*
* If it's a fully non-blocking write attempt and we cannot
* lock the buffer then redirty the page. Note that this can
@@ -1706,13 +1702,15 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
lock_buffer(bh);
} else if (!trylock_buffer(bh)) {
redirty_page_for_writepage(wbc, page);
- continue;
+ goto next;
}
if (test_clear_buffer_dirty(bh)) {
mark_buffer_async_write_endio(bh, handler);
} else {
unlock_buffer(bh);
}
+next:
+ block++;
} while ((bh = bh->b_this_page) != head);

/*
@@ -1753,9 +1751,11 @@ recover:
/*
* ENOSPC, or some other error. We may already have added some
* blocks to the file, so we need to write these out to avoid
- * exposing stale data.
+ * exposing stale data. We redirty the page so that we don't
+ * loose data we are unable to write.
* The page is currently locked and not marked for writeback
*/
+ redirty_page_for_writepage(wbc, page);
bh = head;
/* Recovery: lock and submit the mapped buffers */
do {
@@ -1763,12 +1763,6 @@ recover:
!buffer_delay(bh)) {
lock_buffer(bh);
mark_buffer_async_write_endio(bh, handler);
- } else {
- /*
- * The buffer may have been set dirty during
- * attachment to a dirty page.
- */
- clear_buffer_dirty(bh);
}
} while ((bh = bh->b_this_page) != head);
SetPageError(page);
--
1.6.0.2

2009-05-27 13:01:14

[permalink] [raw]

Subject: [PATCH 10/11] vfs: Export wakeup_pdflush

When we are running out of free space on a filesystem, we want to flush dirty
pages in the filesystem so that blocks reserved via delayed allocation get
written and unnecessary block reservation is released. Export wakeup_pdflush()
so that filesystem can start such writeback.

Signed-off-by: Jan Kara <[email protected]>
---
mm/page-writeback.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bb553c3..43af1cf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -730,6 +730,7 @@ int wakeup_pdflush(long nr_pages)
global_page_state(NR_UNSTABLE_NFS);
return pdflush_operation(background_writeout, nr_pages);
}
+EXPORT_SYMBOL(wakeup_pdflush);

static void wb_timer_fn(unsigned long unused);
static void laptop_timer_fn(unsigned long unused);
--
1.6.0.2

2009-05-27 13:01:11

[permalink] [raw]

Subject: [PATCH 11/11] ext3: Implement delayed allocation on page_mkwrite time

We don't want to really allocate blocks on page_mkwrite() time because for
random writes via mmap it results is much more fragmented files. So just
reserve enough free blocks in page_mkwrite() and do the real allocation from
writepage().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/balloc.c | 96 ++++++++++++++--------
fs/ext3/ialloc.c | 2 +-
fs/ext3/inode.c | 192 +++++++++++++++++++++++++++-----------------
fs/ext3/resize.c | 2 +-
fs/ext3/super.c | 42 ++++++----
include/linux/ext3_fs.h | 9 ++-
include/linux/ext3_fs_i.h | 1 +
include/linux/ext3_fs_sb.h | 3 +-
8 files changed, 217 insertions(+), 130 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 225202d..ecde4ca 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -19,6 +19,8 @@
#include <linux/ext3_jbd.h>
#include <linux/quotaops.h>
#include <linux/buffer_head.h>
+#include <linux/delalloc_counter.h>
+#include <linux/writeback.h>

/*
* balloc.c contains the blocks allocation and deallocation routines
@@ -632,7 +634,7 @@ do_more:
spin_lock(sb_bgl_lock(sbi, block_group));
le16_add_cpu(&desc->bg_free_blocks_count, group_freed);
spin_unlock(sb_bgl_lock(sbi, block_group));
- percpu_counter_add(&sbi->s_freeblocks_counter, count);
+ dac_free(&sbi->s_alloc_counter, count);

/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1410,23 +1412,19 @@ out:
}

/**
- * ext3_has_free_blocks()
- * @sbi: in-core super block structure.
+ * ext3_free_blocks_limit()
+ * @sb: super block
*
* Check if filesystem has at least 1 free block available for allocation.
*/
-static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
+ext3_fsblk_t ext3_free_blocks_limit(struct super_block *sb)
{
- ext3_fsblk_t free_blocks, root_blocks;
+ struct ext3_sb_info *sbi = EXT3_SB(sb);

- free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
- root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
- if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
- sbi->s_resuid != current_fsuid() &&
- (sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
- return 0;
- }
- return 1;
+ if (!capable(CAP_SYS_RESOURCE) && sbi->s_resuid != current_fsuid() &&
+ (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
+ return le32_to_cpu(sbi->s_es->s_r_blocks_count) + 1;
+ return 0;
}

/**
@@ -1443,12 +1441,23 @@ static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
*/
int ext3_should_retry_alloc(struct super_block *sb, int *retries)
{
- if (!ext3_has_free_blocks(EXT3_SB(sb)) || (*retries)++ > 3)
+ struct ext3_sb_info *sbi = EXT3_SB(sb);
+ ext3_fsblk_t free_blocks, limit, reserved_blocks;
+
+ free_blocks = dac_get_avail(&sbi->s_alloc_counter);
+ limit = ext3_free_blocks_limit(sb);
+ reserved_blocks = dac_get_reserved(&sbi->s_alloc_counter);
+ if (!free_blocks + reserved_blocks < limit || (*retries)++ > 3)
return 0;

jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
-
- return journal_force_commit_nested(EXT3_SB(sb)->s_journal);
+ /*
+ * There's a chance commit will free some blocks and writeback can
+ * write delayed blocks so that excessive reservation gets released.
+ */
+ if (reserved_blocks)
+ wakeup_pdflush(0);
+ return journal_force_commit_nested(sbi->s_journal);
}

/**
@@ -1466,7 +1475,8 @@ int ext3_should_retry_alloc(struct super_block *sb, int *retries)
*
*/
ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
- ext3_fsblk_t goal, unsigned long *count, int *errp)
+ ext3_fsblk_t goal, unsigned long *count,
+ unsigned int *reserved, int *errp)
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *gdp_bh;
@@ -1477,7 +1487,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
ext3_fsblk_t ret_block; /* filesyetem-wide allocated block */
int bgi; /* blockgroup iteration index */
int fatal = 0, err;
- int performed_allocation = 0;
+ int got_quota = 0, got_space = 0;
ext3_grpblk_t free_blocks; /* number of free blocks in a group */
struct super_block *sb;
struct ext3_group_desc *gdp;
@@ -1498,16 +1508,27 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
printk("ext3_new_block: nonexistent device");
return 0;
}
+ sbi = EXT3_SB(sb);

- /*
- * Check quota for allocation of this block.
- */
- if (vfs_dq_alloc_block(inode, num)) {
- *errp = -EDQUOT;
- return 0;
+ if (!*reserved) {
+ /*
+ * Check quota for allocation of this block.
+ */
+ if (vfs_dq_alloc_block(inode, num)) {
+ *errp = -EDQUOT;
+ goto out;
+ }
+ got_quota = 1;
+ if (dac_alloc(&sbi->s_alloc_counter, num,
+ ext3_free_blocks_limit(sb)) < 0) {
+ *errp = -ENOSPC;
+ goto out;
+ }
+ got_space = 1;
+ } else {
+ WARN_ON(*reserved < num);
}

- sbi = EXT3_SB(sb);
es = EXT3_SB(sb)->s_es;
ext3_debug("goal=%lu.\n", goal);
/*
@@ -1522,11 +1543,6 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
if (block_i && ((windowsz = block_i->rsv_window_node.rsv_goal_size) > 0))
my_rsv = &block_i->rsv_window_node;

- if (!ext3_has_free_blocks(sbi)) {
- *errp = -ENOSPC;
- goto out;
- }
-
/*
* First, test whether the goal block is free.
*/
@@ -1650,8 +1666,6 @@ allocated:
goto retry_alloc;
}

- performed_allocation = 1;
-
#ifdef CONFIG_JBD_DEBUG
{
struct buffer_head *debug_bh;
@@ -1701,7 +1715,6 @@ allocated:
spin_lock(sb_bgl_lock(sbi, group_no));
le16_add_cpu(&gdp->bg_free_blocks_count, -num);
spin_unlock(sb_bgl_lock(sbi, group_no));
- percpu_counter_sub(&sbi->s_freeblocks_counter, num);

BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
err = ext3_journal_dirty_metadata(handle, gdp_bh);
@@ -1714,7 +1727,15 @@ allocated:

*errp = 0;
brelse(bitmap_bh);
- vfs_dq_free_block(inode, *count-num);
+ if (*reserved) {
+ dac_alloc_reserved(&sbi->s_alloc_counter, num);
+ vfs_dq_claim_block(inode, num);
+ atomic_sub(num, &EXT3_I(inode)->i_reserved_blocks);
+ *reserved -= num;
+ } else {
+ dac_free(&sbi->s_alloc_counter, *count - num);
+ vfs_dq_free_block(inode, *count - num);
+ }
*count = num;
return ret_block;

@@ -1728,8 +1749,10 @@ out:
/*
* Undo the block allocation
*/
- if (!performed_allocation)
+ if (got_quota)
vfs_dq_free_block(inode, *count);
+ if (got_space)
+ dac_free(&sbi->s_alloc_counter, *count);
brelse(bitmap_bh);
return 0;
}
@@ -1738,8 +1761,9 @@ ext3_fsblk_t ext3_new_block(handle_t *handle, struct inode *inode,
ext3_fsblk_t goal, int *errp)
{
unsigned long count = 1;
+ unsigned int reserved = 0;

- return ext3_new_blocks(handle, inode, goal, &count, errp);
+ return ext3_new_blocks(handle, inode, goal, &count, &reserved, errp);
}

/**
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index dd13d60..97b1b70 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -269,7 +269,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)

freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
avefreei = freei / ngroups;
- freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+ freeb = dac_get_avail(&sbi->s_alloc_counter);
avefreeb = freeb / ngroups;
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index cc012fe..d396006 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
#include <linux/bio.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/mount.h>
#include "xattr.h"
#include "acl.h"

@@ -516,7 +517,8 @@ static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,
*/
static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
ext3_fsblk_t goal, int indirect_blks, int blks,
- ext3_fsblk_t new_blocks[4], int *err)
+ unsigned int *reserved, ext3_fsblk_t new_blocks[4],
+ int *err)
{
int target, i;
unsigned long count = 0;
@@ -537,7 +539,8 @@ static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
while (1) {
count = target;
/* allocating blocks for indirect blocks and direct blocks */
- current_block = ext3_new_blocks(handle,inode,goal,&count,err);
+ current_block = ext3_new_blocks(handle, inode, goal, &count,
+ reserved, err);
if (*err)
goto failed_out;

@@ -591,8 +594,8 @@ failed_out:
* as described above and return 0.
*/
static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
- int indirect_blks, int *blks, ext3_fsblk_t goal,
- int *offsets, Indirect *branch)
+ int indirect_blks, int *blks, unsigned int *reserved,
+ ext3_fsblk_t goal, int *offsets, Indirect *branch)
{
int blocksize = inode->i_sb->s_blocksize;
int i, n = 0;
@@ -603,7 +606,7 @@ static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
ext3_fsblk_t current_block;

num = ext3_alloc_blocks(handle, inode, goal, indirect_blks,
- *blks, new_blocks, &err);
+ *blks, reserved, new_blocks, &err);
if (err)
return err;

@@ -800,6 +803,7 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
int depth;
struct ext3_inode_info *ei = EXT3_I(inode);
int count = 0;
+ unsigned int reserved = 0;
ext3_fsblk_t first_block = 0;

@@ -898,8 +902,22 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
/*
* Block out ext3_truncate while we alter the tree
*/
- err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
- offsets + (partial - chain), partial);
+ if (buffer_delay(bh_result)) {
+ WARN_ON(maxblocks != 1);
+ WARN_ON(!bh_result->b_page || !PageLocked(bh_result->b_page));
+ reserved = EXT3_DA_BLOCK_RESERVE;
+ }
+ err = ext3_alloc_branch(handle, inode, indirect_blks, &count,
+ &reserved, goal, offsets + (partial - chain),
+ partial);
+ /* Release additional reservation we had for this block */
+ if (!err && buffer_delay(bh_result)) {
+ dac_cancel_reserved(&EXT3_SB(inode->i_sb)->s_alloc_counter,
+ reserved);
+ vfs_dq_release_reservation_block(inode, reserved);
+ atomic_sub(reserved, &ei->i_reserved_blocks);
+ clear_buffer_delay(bh_result);
+ }

/*
* The ext3_splice_branch call will free and forget any buffers
@@ -1432,9 +1450,9 @@ static sector_t ext3_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping,block,ext3_get_block);
}

-static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
+static int ext3_bh_delay(handle_t *handle, struct buffer_head *bh)
{
- return !buffer_mapped(bh);
+ return buffer_delay(bh);
}

/*
@@ -1496,12 +1514,31 @@ static int ext3_common_writepage(struct page *page,
struct writeback_control *wbc)
{
struct inode *inode = page->mapping->host;
- int ret = 0;
+ int ret = 0, delay = 1, err;
+ handle_t *handle = NULL;

+ /* Start a transaction only if the page has delayed buffers */
+ if (page_has_buffers(page))
+ delay = walk_page_buffers(NULL, page_buffers(page), 0,
+ PAGE_CACHE_SIZE, NULL, ext3_bh_delay);
+ if (delay) {
+ handle = ext3_journal_start(inode,
+ ext3_writepage_trans_blocks(inode));
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ }
if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
ret = nobh_writepage(page, ext3_get_block, wbc);
else
ret = block_write_full_page(page, ext3_get_block, wbc);
+ if (delay) {
+ err = ext3_journal_stop(handle);
+ if (!ret)
+ ret = err;
+ }
+out:
return ret;
}

@@ -1576,15 +1613,37 @@ ext3_readpages(struct file *file, struct address_space *mapping,
return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
}

+
+static int truncate_delayed_bh(handle_t *handle, struct buffer_head *bh)
+{
+ if (buffer_delay(bh)) {
+ struct inode *inode = bh->b_page->mapping->host;
+ struct ext3_inode_info *ei = EXT3_I(inode);
+
+ atomic_sub(EXT3_DA_BLOCK_RESERVE, &ei->i_reserved_blocks);
+ vfs_dq_release_reservation_block(inode, EXT3_DA_BLOCK_RESERVE);
+ dac_cancel_reserved(&EXT3_SB(inode->i_sb)->s_alloc_counter,
+ EXT3_DA_BLOCK_RESERVE);
+ clear_buffer_delay(bh);
+ }
+ return 0;
+}
+
static void ext3_invalidatepage(struct page *page, unsigned long offset)
{
- journal_t *journal = EXT3_JOURNAL(page->mapping->host);
+ struct inode *inode = page->mapping->host;
+ journal_t *journal = EXT3_JOURNAL(inode);
+ int bsize = 1 << inode->i_blkbits;

/*
* If it's a full truncate we just forget about the pending dirtying
*/
if (offset == 0)
ClearPageChecked(page);
+ if (page_has_buffers(page)) {
+ walk_page_buffers(NULL, page_buffers(page), offset + bsize - 1,
+ PAGE_CACHE_SIZE, NULL, truncate_delayed_bh);
+ }

journal_invalidatepage(journal, page, offset);
}
@@ -1685,75 +1744,58 @@ out:
return ret;
}

-int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+/*
+ * Reserve block writes instead of allocation. Called only on buffer heads
+ * attached to a page (and thus for 1 block).
+ */
+static int ext3_da_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
{
- struct page *page = vmf->page;
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = file->f_path.dentry->d_inode;
- int ret = VM_FAULT_NOPAGE;
- loff_t size;
- int len;
- void *fsdata;
-
- block_wait_on_hole_extend(inode, page_offset(page));
- /*
- * Get i_alloc_sem to stop truncates messing with the inode. We cannot
- * get i_mutex because we are already holding mmap_sem.
- */
- down_read(&inode->i_alloc_sem);
- size = i_size_read(inode);
- if ((page->mapping != inode->i_mapping) ||
- (page_offset(page) > size)) {
- /* page got truncated out from underneath us */
- goto out_unlock;
- }
+ int ret, rsv;
+ struct ext3_sb_info *sbi;

- /* page is wholly or partially inside EOF */
- if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
- len = size & ~PAGE_CACHE_MASK;
- else
- len = PAGE_CACHE_SIZE;
-
- /*
- * Check for the common case that everything is already mapped. We
- * have to get the page lock so that buffers cannot be released
- * under us.
- */
- lock_page(page);
- if (page_has_buffers(page)) {
- if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
- buffer_unmapped)) {
- unlock_page(page);
- ret = 0;
- goto out_unlock;
- }
- }
- unlock_page(page);
+ /* Buffer has already blocks reserved? */
+ if (buffer_delay(bh))
+ return 0;

- /*
- * OK, we may need to fill the hole... Do write_begin write_end to do
- * block allocation/reservation. We are not holding inode.i_mutex
- * here. That allows parallel write_begin, write_end call. lock_page
- * prevent this from happening on the same page though.
- */
- ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
- len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
- if (ret < 0)
- goto out_unlock;
- ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
- len, len, page, fsdata);
+ ret = ext3_get_blocks_handle(NULL, inode, iblock, 1, bh, 0);
if (ret < 0)
- goto out_unlock;
- ret = 0;
-out_unlock:
- if (unlikely(ret)) {
- if (ret == -ENOMEM)
- ret = VM_FAULT_OOM;
- else /* -ENOSPC, -EIO, etc */
- ret = VM_FAULT_SIGBUS;
+ goto out;
+ if (ret > 0 || !create) {
+ ret = 0;
+ goto out;
+ }
+ /* Upperbound on number of needed blocks */
+ rsv = EXT3_DA_BLOCK_RESERVE;
+ /* Delayed allocation needed */
+ if (vfs_dq_reserve_block(inode, rsv)) {
+ ret = -EDQUOT;
+ goto out;
+ }
+ sbi = EXT3_SB(inode->i_sb);
+ ret = dac_reserve(&sbi->s_alloc_counter, rsv,
+ ext3_free_blocks_limit(inode->i_sb));
+ if (ret < 0) {
+ vfs_dq_release_reservation_block(inode, rsv);
+ goto out;
}
- up_read(&inode->i_alloc_sem);
+ set_buffer_delay(bh);
+ set_buffer_new(bh);
+ atomic_add(rsv, &EXT3_I(inode)->i_reserved_blocks);
+out:
+ return ret;
+}
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ int retry = 0;
+ int ret;
+ struct super_block *sb = vma->vm_file->f_path.mnt->mnt_sb;
+
+ do {
+ ret = block_page_mkwrite(vma, vmf, ext3_da_get_block);
+ } while (ret == VM_FAULT_SIGBUS &&
+ ext3_should_retry_alloc(sb, &retry));
return ret;
}

diff --git a/fs/ext3/resize.c b/fs/ext3/resize.c
index 78fdf38..5a35314 100644
--- a/fs/ext3/resize.c
+++ b/fs/ext3/resize.c
@@ -928,7 +928,7 @@ int ext3_group_add(struct super_block *sb, struct ext3_new_group_data *input)
le32_add_cpu(&es->s_r_blocks_count, input->reserved_blocks);

/* Update the free space counts */
- percpu_counter_add(&sbi->s_freeblocks_counter,
+ percpu_counter_add(&sbi->s_alloc_counter.free,
input->free_blocks_count);
percpu_counter_add(&sbi->s_freeinodes_counter,
EXT3_INODES_PER_GROUP(sb));
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 599dbfe..73263eb 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -416,7 +416,7 @@ static void ext3_put_super (struct super_block * sb)
for (i = 0; i < sbi->s_gdb_count; i++)
brelse(sbi->s_group_desc[i]);
kfree(sbi->s_group_desc);
- percpu_counter_destroy(&sbi->s_freeblocks_counter);
+ dac_destroy(&sbi->s_alloc_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
brelse(sbi->s_sbh);
@@ -492,6 +492,7 @@ static void init_once(void *foo)
#ifdef CONFIG_EXT3_FS_XATTR
init_rwsem(&ei->xattr_sem);
#endif
+ atomic_set(&ei->i_reserved_blocks, 0);
mutex_init(&ei->truncate_mutex);
inode_init_once(&ei->vfs_inode);
}
@@ -515,23 +516,26 @@ static void destroy_inodecache(void)

static void ext3_clear_inode(struct inode *inode)
{
- struct ext3_block_alloc_info *rsv = EXT3_I(inode)->i_block_alloc_info;
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ struct ext3_block_alloc_info *rsv = ei->i_block_alloc_info;
#ifdef CONFIG_EXT3_FS_POSIX_ACL
- if (EXT3_I(inode)->i_acl &&
- EXT3_I(inode)->i_acl != EXT3_ACL_NOT_CACHED) {
- posix_acl_release(EXT3_I(inode)->i_acl);
- EXT3_I(inode)->i_acl = EXT3_ACL_NOT_CACHED;
+ if (ei->i_acl && ei->i_acl != EXT3_ACL_NOT_CACHED) {
+ posix_acl_release(ei->i_acl);
+ ei->i_acl = EXT3_ACL_NOT_CACHED;
}
- if (EXT3_I(inode)->i_default_acl &&
- EXT3_I(inode)->i_default_acl != EXT3_ACL_NOT_CACHED) {
- posix_acl_release(EXT3_I(inode)->i_default_acl);
- EXT3_I(inode)->i_default_acl = EXT3_ACL_NOT_CACHED;
+ if (ei->i_default_acl && ei->i_default_acl != EXT3_ACL_NOT_CACHED) {
+ posix_acl_release(ei->i_default_acl);
+ ei->i_default_acl = EXT3_ACL_NOT_CACHED;
}
#endif
ext3_discard_reservation(inode);
- EXT3_I(inode)->i_block_alloc_info = NULL;
+ ei->i_block_alloc_info = NULL;
if (unlikely(rsv))
kfree(rsv);
+ if (atomic_read(&ei->i_reserved_blocks))
+ ext3_warning(inode->i_sb, __func__, "Releasing inode %lu with "
+ "%lu reserved blocks.\n", inode->i_ino,
+ (unsigned long)atomic_read(&ei->i_reserved_blocks));
}

static inline void ext3_show_quota_options(struct seq_file *seq, struct super_block *sb)
@@ -726,10 +730,19 @@ static ssize_t ext3_quota_read(struct super_block *sb, int type, char *data,
static ssize_t ext3_quota_write(struct super_block *sb, int type,
const char *data, size_t len, loff_t off);

+static qsize_t ext3_get_reserved_space(struct inode *inode)
+{
+ return atomic_read(&EXT3_I(inode)->i_reserved_blocks);
+}
+
static struct dquot_operations ext3_quota_operations = {
.initialize = dquot_initialize,
.drop = dquot_drop,
.alloc_space = dquot_alloc_space,
+ .reserve_space = dquot_reserve_space,
+ .claim_space = dquot_claim_space,
+ .release_rsv = dquot_release_reserved_space,
+ .get_reserved_space = ext3_get_reserved_space,
.alloc_inode = dquot_alloc_inode,
.free_space = dquot_free_space,
.free_inode = dquot_free_inode,
@@ -1851,8 +1864,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
get_random_bytes(&sbi->s_next_generation, sizeof(u32));
spin_lock_init(&sbi->s_next_gen_lock);

- err = percpu_counter_init(&sbi->s_freeblocks_counter,
- ext3_count_free_blocks(sb));
+ err = dac_init(&sbi->s_alloc_counter, ext3_count_free_blocks(sb));
if (!err) {
err = percpu_counter_init(&sbi->s_freeinodes_counter,
ext3_count_free_inodes(sb));
@@ -2005,7 +2017,7 @@ cantfind_ext3:
failed_mount4:
journal_destroy(sbi->s_journal);
failed_mount3:
- percpu_counter_destroy(&sbi->s_freeblocks_counter);
+ dac_destroy(&sbi->s_alloc_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
failed_mount2:
@@ -2685,7 +2697,7 @@ static int ext3_statfs (struct dentry * dentry, struct kstatfs * buf)
buf->f_type = EXT3_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
- buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
+ buf->f_bfree = dac_get_avail_sum(&sbi->s_alloc_counter);
es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 5051874..7f28907 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -809,6 +809,11 @@ ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
#define ERR_BAD_DX_DIR -75000

/*
+ * Number of blocks we reserve in delayed allocation for one block
+ */
+#define EXT3_DA_BLOCK_RESERVE 4
+
+/*
* Function prototypes
*/

@@ -821,12 +826,14 @@ ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
# define NORET_AND noreturn,

/* balloc.c */
+extern ext3_fsblk_t ext3_free_blocks_limit(struct super_block *sb);
extern int ext3_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
ext3_fsblk_t goal, int *errp);
extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode,
- ext3_fsblk_t goal, unsigned long *count, int *errp);
+ ext3_fsblk_t goal, unsigned long *count,
+ unsigned int *reserved, int *errp);
extern void ext3_free_blocks (handle_t *handle, struct inode *inode,
ext3_fsblk_t block, unsigned long count);
extern void ext3_free_blocks_sb (handle_t *handle, struct super_block *sb,
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index 7894dd0..ef288bd 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -129,6 +129,7 @@ struct ext3_inode_info {

/* on-disk additional length */
__u16 i_extra_isize;
+ atomic_t i_reserved_blocks;

/*
* truncate_mutex is for serialising ext3_truncate() against
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index f07f34d..f92e162 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
#include <linux/wait.h>
#include <linux/blockgroup_lock.h>
#include <linux/percpu_counter.h>
+#include <linux/delalloc_counter.h>
#endif
#include <linux/rbtree.h>

@@ -58,7 +59,7 @@ struct ext3_sb_info {
u32 s_hash_seed[4];
int s_def_hash_version;
int s_hash_unsigned; /* 3 if hash should be signed, 0 if not */
- struct percpu_counter s_freeblocks_counter;
+ struct delalloc_counter s_alloc_counter;
struct percpu_counter s_freeinodes_counter;
struct percpu_counter s_dirs_counter;
struct blockgroup_lock *s_blockgroup_lock;
--
1.6.0.2

2009-05-27 13:01:11

[permalink] [raw]

Subject: [PATCH 02/11] ext4: Get rid of extend_disksize parameter of ext4_get_blocks_handle()

Get rid of extenddisksize parameter of ext4_get_blocks_handle(). This seems to
be a relict from some old days and setting disksize in this function does not
make much sence. Currently it was set only by ext4_getblk(). Since the
parameter has some effect only if create == 1, it is easy to check that the
three callers which end up calling ext4_getblk() with create == 1 (ext4_append,
ext4_quota_write, ext4_mkdir) do the right thing and set disksize themselves.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/dir.c | 3 +--
fs/ext4/ext4.h | 6 ++----
fs/ext4/extents.c | 16 +++-------------
fs/ext4/inode.c | 43 ++++++++++++-------------------------------
4 files changed, 18 insertions(+), 50 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index b647899..c8e0623 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -131,8 +131,7 @@ static int ext4_readdir(struct file *filp,
struct buffer_head *bh = NULL;

map_bh.b_state = 0;
- err = ext4_get_blocks_wrap(NULL, inode, blk, 1, &map_bh,
- 0, 0, 0);
+ err = ext4_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0);
if (err > 0) {
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d0f15ef..2b02ae0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1338,8 +1338,7 @@ extern int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks,
int chunk);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned int max_blocks,
- struct buffer_head *bh_result,
- int create, int extend_disksize);
+ struct buffer_head *bh_result, int create);
extern void ext4_ext_truncate(struct inode *);
extern void ext4_ext_init(struct super_block *);
extern void ext4_ext_release(struct super_block *);
@@ -1347,8 +1346,7 @@ extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
loff_t len);
extern int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode,
sector_t block, unsigned int max_blocks,
- struct buffer_head *bh, int create,
- int extend_disksize, int flag);
+ struct buffer_head *bh, int create, int flag);
extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
__u64 start, __u64 len);

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e3a55eb..6b0eeaf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2782,9 +2782,8 @@ fix_extent_len:
* return < 0, error case.
*/
int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
- ext4_lblk_t iblock,
- unsigned int max_blocks, struct buffer_head *bh_result,
- int create, int extend_disksize)
+ ext4_lblk_t iblock, unsigned int max_blocks,
+ struct buffer_head *bh_result, int create)
{
struct ext4_ext_path *path = NULL;
struct ext4_extent_header *eh;
@@ -2793,7 +2792,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
int err = 0, depth, ret, cache_type;
unsigned int allocated = 0;
struct ext4_allocation_request ar;
- loff_t disksize;

__clear_bit(BH_New, &bh_result->b_state);
ext_debug("blocks %u/%u requested for inode %u\n",
@@ -2983,14 +2981,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
newblock = ext_pblock(&newex);
allocated = ext4_ext_get_actual_len(&newex);
outnew:
- if (extend_disksize) {
- disksize = ((loff_t) iblock + ar.len) << inode->i_blkbits;
- if (disksize > i_size_read(inode))
- disksize = i_size_read(inode);
- if (disksize > EXT4_I(inode)->i_disksize)
- EXT4_I(inode)->i_disksize = disksize;
- }
-
set_buffer_new(bh_result);

/* Cache only when it is _not_ an uninitialized extent */
@@ -3152,7 +3142,7 @@ retry:
}
ret = ext4_get_blocks_wrap(handle, inode, block,
max_blocks, &map_bh,
- EXT4_CREATE_UNINITIALIZED_EXT, 0, 0);
+ EXT4_CREATE_UNINITIALIZED_EXT, 0);
if (ret <= 0) {
#ifdef EXT4FS_DEBUG
WARN_ON(ret <= 0);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a9ffd5..7fcceb0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -916,8 +916,7 @@ err_out:
*/
static int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned int maxblocks,
- struct buffer_head *bh_result,
- int create, int extend_disksize)
+ struct buffer_head *bh_result, int create)
{
int err = -EIO;
ext4_lblk_t offsets[4];
@@ -927,11 +926,8 @@ static int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
int indirect_blks;
int blocks_to_boundary = 0;
int depth;
- struct ext4_inode_info *ei = EXT4_I(inode);
int count = 0;
ext4_fsblk_t first_block = 0;
- loff_t disksize;
-

J_ASSERT(!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL));
J_ASSERT(handle != NULL || create == 0);
@@ -997,18 +993,6 @@ static int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
if (!err)
err = ext4_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
- /*
- * i_disksize growing is protected by i_data_sem. Don't forget to
- * protect it if you're about to implement concurrent
- * ext4_get_block() -bzzz
- */
- if (!err && extend_disksize) {
- disksize = ((loff_t) iblock + count) << inode->i_blkbits;
- if (disksize > i_size_read(inode))
- disksize = i_size_read(inode);
- if (disksize > ei->i_disksize)
- ei->i_disksize = disksize;
- }
if (err)
goto cleanup;

@@ -1144,7 +1128,7 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
*/
int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
unsigned int max_blocks, struct buffer_head *bh,
- int create, int extend_disksize, int flag)
+ int create, int flag)
{
int retval;

@@ -1158,10 +1142,10 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
down_read((&EXT4_I(inode)->i_data_sem));
if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
- bh, 0, 0);
+ bh, 0);
} else {
retval = ext4_get_blocks_handle(handle,
- inode, block, max_blocks, bh, 0, 0);
+ inode, block, max_blocks, bh, 0);
}
up_read((&EXT4_I(inode)->i_data_sem));

@@ -1213,10 +1197,10 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
*/
if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
- bh, create, extend_disksize);
+ bh, create);
} else {
retval = ext4_get_blocks_handle(handle, inode, block,
- max_blocks, bh, create, extend_disksize);
+ max_blocks, bh, create);

if (retval > 0 && buffer_new(bh)) {
/*
@@ -1269,7 +1253,7 @@ int ext4_get_block(struct inode *inode, sector_t iblock,
}

ret = ext4_get_blocks_wrap(handle, inode, iblock,
- max_blocks, bh_result, create, 0, 0);
+ max_blocks, bh_result, create, 0);
if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
ret = 0;
@@ -1294,8 +1278,7 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
dummy.b_state = 0;
dummy.b_blocknr = -1000;
buffer_trace_init(&dummy.b_history);
- err = ext4_get_blocks_wrap(handle, inode, block, 1,
- &dummy, create, 1, 0);
+ err = ext4_get_blocks_wrap(handle, inode, block, 1, &dummy, create, 0);
/*
* ext4_get_blocks_handle() returns number of blocks
* mapped. 0 in case of a HOLE.
@@ -2002,7 +1985,7 @@ static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
handle = ext4_journal_current_handle();
BUG_ON(!handle);
ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
- bh_result, create, 0, EXT4_DELALLOC_RSVED);
+ bh_result, create, EXT4_DELALLOC_RSVED);
if (ret <= 0)
return ret;

@@ -2020,9 +2003,7 @@ static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
}

/*
- * Update on-disk size along with block allocation we don't
- * use 'extend_disksize' as size may change within already
- * allocated block -bzzz
+ * Update on-disk size along with block allocation.
*/
disksize = ((loff_t) iblock + ret) << inode->i_blkbits;
if (disksize > i_size_read(inode))
@@ -2323,7 +2304,7 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
* preallocated blocks are unmapped but should treated
* the same as allocated blocks.
*/
- ret = ext4_get_blocks_wrap(NULL, inode, iblock, 1, bh_result, 0, 0, 0);
+ ret = ext4_get_blocks_wrap(NULL, inode, iblock, 1, bh_result, 0, 0);
if ((ret == 0) && !buffer_delay(bh_result)) {
/* the block isn't (pre)allocated yet, let's reserve space */
/*
@@ -2373,7 +2354,7 @@ static int ext4_normal_get_block_write(struct inode *inode, sector_t iblock,
* so call get_block_wrap with create = 0
*/
ret = ext4_get_blocks_wrap(NULL, inode, iblock, max_blocks,
- bh_result, 0, 0, 0);
+ bh_result, 0, 0);
if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
ret = 0;
--
1.6.0.2

2009-05-27 13:01:11

[permalink] [raw]

Subject: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

page_mkwrite() is meant to be used by filesystems to allocate blocks under a
page which is becoming writeably mmapped in some process address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than silently
discarding data later when writepage is called.

On filesystems where blocksize < pagesize the situation is more complicated.
Think for example that blocksize = 1024, pagesize = 4096 and a process does:
ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a'; ----> page_mkwrite() for index 0 is called
ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
fsync(fd); ----> writepage() for index 0 is called

At the moment page_mkwrite() is called, filesystem can allocate only one block
for the page because i_size == 1024. Otherwise it would create blocks beyond
i_size which is generally undesirable. But later at writepage() time, we would
like to have blocks allocated for the whole page (and in principle we have to
allocate them because user could have filled the page with data after the
second ftruncate()). This patch introduces a framework which allows filesystems
to handle this with a reasonable effort.

The idea is following: Before we extend i_size, we obtain a special lock blocking
page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
change i_size and unlock the special lock. This way, page_mkwrite() is called for
a page each time a number of blocks needed to be allocated for a page increases.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 143 +++++++++++++++++++++++++++++++++++++++++++
include/linux/buffer_head.h | 4 +
include/linux/fs.h | 11 +++-
mm/filemap.c | 10 +++-
mm/memory.c | 2 +-
5 files changed, 166 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index aed2977..ddfcade 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
#include <linux/cpu.h>
#include <linux/bitops.h>
#include <linux/mpage.h>
+#include <linux/rmap.h>
#include <linux/bit_spinlock.h>

static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -1970,9 +1971,11 @@ int block_write_begin(struct file *file, struct address_space *mapping,
page = *pagep;
if (page == NULL) {
ownpage = 1;
+ block_lock_hole_extend(inode, pos);
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page) {
status = -ENOMEM;
+ block_unlock_hole_extend(inode);
goto out;
}
*pagep = page;
@@ -1987,6 +1990,7 @@ int block_write_begin(struct file *file, struct address_space *mapping,
unlock_page(page);
page_cache_release(page);
*pagep = NULL;
+ block_unlock_hole_extend(inode);

/*
* prepare_write() may have instantiated a few blocks
@@ -2062,6 +2066,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,

unlock_page(page);
page_cache_release(page);
+ block_unlock_hole_extend(inode);

/*
* Don't mark the inode dirty under page lock. First, it unnecessarily
@@ -2368,6 +2373,137 @@ int block_commit_write(struct page *page, unsigned from, unsigned to)
}

/*
+ * Lock inode with I_HOLE_EXTEND if the write is going to create a hole
+ * under a mmapped page. Also mark the page RO so that page_mkwrite()
+ * is called on the nearest write access to the page and clear dirty bits
+ * beyond i_size.
+ *
+ * @pos is offset to which write/truncate is happenning.
+ *
+ * Returns 1 if the lock has been acquired.
+ */
+int block_lock_hole_extend(struct inode *inode, loff_t pos)
+{
+ int bsize = 1 << inode->i_blkbits;
+ loff_t rounded_i_size;
+ struct page *page;
+ pgoff_t index;
+ struct buffer_head *head, *bh;
+ sector_t block, last_block;
+
+ /* Optimize for common case */
+ if (PAGE_CACHE_SIZE == bsize)
+ return 0;
+ /* Currently last page will not have any hole block created? */
+ rounded_i_size = (inode->i_size + bsize - 1) & ~(bsize - 1);
+ if (pos <= rounded_i_size || !(rounded_i_size & (PAGE_CACHE_SIZE - 1)))
+ return 0;
+ /*
+ * Check the mutex here so that we don't warn on things like blockdev
+ * writes which have different locking rules...
+ */
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+ spin_lock(&inode_lock);
+ /*
+ * From now on, block_page_mkwrite() will block on the page straddling
+ * i_size. Note that the page on which it blocks changes with the
+ * change of i_size but that is fine since when new i_size is written
+ * blocks for the hole will be allocated.
+ */
+ inode->i_state |= I_HOLE_EXTEND;
+ spin_unlock(&inode_lock);
+
+ /*
+ * Make sure page_mkwrite() is called on this page before
+ * user is able to write any data beyond current i_size via
+ * mmap.
+ *
+ * See clear_page_dirty_for_io() for details why set_page_dirty()
+ * is needed.
+ */
+ index = inode->i_size >> PAGE_CACHE_SHIFT;
+ page = find_lock_page(inode->i_mapping, index);
+ if (!page)
+ return 1;
+ if (page_mkclean(page))
+ set_page_dirty(page);
+ /* Zero dirty bits beyond current i_size */
+ if (page_has_buffers(page)) {
+ bh = head = page_buffers(page);
+ last_block = (inode->i_size - 1) >> inode->i_blkbits;
+ block = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ do {
+ if (block > last_block)
+ clear_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ block++;
+ } while (bh != head);
+ }
+ unlock_page(page);
+ page_cache_release(page);
+ return 1;
+}
+EXPORT_SYMBOL(block_lock_hole_extend);
+
+/* New i_size creating hole has been written, unlock the inode */
+void block_unlock_hole_extend(struct inode *inode)
+{
+ /*
+ * We want to clear the flag we could have set previously. Noone else
+ * can change the flag so lockless read is reliable.
+ */
+ if (inode->i_state & I_HOLE_EXTEND) {
+ spin_lock(&inode_lock);
+ inode->i_state &= ~I_HOLE_EXTEND;
+ spin_unlock(&inode_lock);
+ /* Prevent speculative execution through spin_unlock */
+ smp_mb();
+ wake_up_bit(&inode->i_state, __I_HOLE_EXTEND);
+ }
+}
+EXPORT_SYMBOL(block_unlock_hole_extend);
+
+void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
+{
+ int locked;
+
+ locked = block_lock_hole_extend(inode, pos + len);
+ i_size_write(inode, pos + len);
+ if (locked)
+ block_unlock_hole_extend(inode);
+}
+EXPORT_SYMBOL(block_extend_i_size);
+
+int block_wait_on_hole_extend(struct inode *inode, loff_t pos)
+{
+ loff_t size;
+ int ret = 0;
+
+restart:
+ size = i_size_read(inode);
+ if (pos > size)
+ return -EINVAL;
+ if (pos + PAGE_CACHE_SIZE < size)
+ return ret;
+ /*
+ * This page contains EOF; make sure we see i_state from the moment
+ * after page table modification
+ */
+ smp_rmb();
+ if (inode->i_state & I_HOLE_EXTEND) {
+ wait_queue_head_t *wqh;
+ DEFINE_WAIT_BIT(wqb, &inode->i_state, __I_HOLE_EXTEND);
+
+ wqh = bit_waitqueue(&inode->i_state, __I_HOLE_EXTEND);
+ __wait_on_bit(wqh, &wqb, inode_wait, TASK_UNINTERRUPTIBLE);
+ ret = 1;
+ goto restart;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(block_wait_on_hole_extend);
+
+/*
* block_page_mkwrite() is not allowed to change the file size as it gets
* called from a page fault handler when a page is first dirtied. Hence we must
* be careful to check for EOF conditions here. We set the page up correctly
@@ -2392,6 +2528,13 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
loff_t size;
int ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */

+ block_wait_on_hole_extend(inode, page_offset(page));
+ /*
+ * From this moment on a write creating a hole can happen
+ * without us waiting for it. But because it writeprotects
+ * the page, user cannot really write to the page until next
+ * page_mkwrite() is called. And that one will wait.
+ */
lock_page(page);
size = i_size_read(inode);
if ((page->mapping != inode->i_mapping) ||
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 16ed028..56a0162 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -219,6 +219,10 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
get_block_t *, loff_t *);
int generic_cont_expand_simple(struct inode *inode, loff_t size);
int block_commit_write(struct page *page, unsigned from, unsigned to);
+int block_lock_hole_extend(struct inode *inode, loff_t pos);
+void block_unlock_hole_extend(struct inode *inode);
+int block_wait_on_hole_extend(struct inode *inode, loff_t pos);
+void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block);
void block_sync_page(struct page *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b534e5..7cbb0c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -580,7 +580,7 @@ struct address_space_operations {
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
-
+ void (*extend_i_size)(struct inode *, loff_t pos, loff_t len);
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
@@ -597,6 +597,8 @@ struct address_space_operations {
unsigned long);
};

+void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
+
/*
* pagecache_write_begin/pagecache_write_end must be used by general code
* to write into the pagecache.
@@ -1590,7 +1592,8 @@ struct super_operations {
* until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
* various stages of removing an inode.
*
- * Two bits are used for locking and completion notification, I_LOCK and I_SYNC.
+ * Three bits are used for locking and completion notification, I_LOCK,
+ * I_HOLE_EXTEND and I_SYNC.
*
* I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
* fdatasync(). i_atime is the usual cause.
@@ -1628,6 +1631,8 @@ struct super_operations {
* of inode dirty data. Having a separate lock for this
* purpose reduces latency and prevents some filesystem-
* specific deadlocks.
+ * I_HOLE_EXTEND A lock synchronizing extension of a file which creates
+ * a hole under a mmapped page with page_mkwrite().
*
* Q: What is the difference between I_WILL_FREE and I_FREEING?
* Q: igrab() only checks on (I_FREEING|I_WILL_FREE). Should it also check on
@@ -1644,6 +1649,8 @@ struct super_operations {
#define I_LOCK (1 << __I_LOCK)
#define __I_SYNC 8
#define I_SYNC (1 << __I_SYNC)
+#define __I_HOLE_EXTEND 9
+#define I_HOLE_EXTEND (1 << __I_HOLE_EXTEND)

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..a227174 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2079,6 +2079,14 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
}
EXPORT_SYMBOL(pagecache_write_end);

+void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
+{
+ if (inode->i_mapping->a_ops->extend_i_size)
+ inode->i_mapping->a_ops->extend_i_size(inode, pos, len);
+ else
+ i_size_write(inode, pos + len);
+}
+
ssize_t
generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -2139,7 +2147,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
if (written > 0) {
loff_t end = pos + written;
if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
- i_size_write(inode, end);
+ do_extend_i_size(inode, pos, written);
mark_inode_dirty(inode);
}
*ppos = end;
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..535183d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2377,7 +2377,7 @@ int vmtruncate(struct inode * inode, loff_t offset)
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
- i_size_write(inode, offset);
+ do_extend_i_size(inode, offset, 0);
} else {
struct address_space *mapping = inode->i_mapping;

--
1.6.0.2

2009-05-27 14:23:41

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

On Wed, May 27, 2009 at 03:00:57PM +0200, Jan Kara wrote:
>
> The series is against 2.6.30-rc7. The first two patches are just
> small cleanup and should be merged separately (Ted should have the
> ext4 cleanup rebased on top of current ext4 tree).

So I can take the cleanup (which will be different given the changes
that are ready to get merged once the merge window opens), but it
looks like the the second ext4 patch can't go in until we get
agreement that changes to the VFS layer should go in, correct? So
we'll have a merge order dependency here.

(And this one will be the 2nd merge order dependency we have in the
ext4 tree, since I also have some patches in the unstable portion of
the tree which are waiting for some tracing patches to go in.)

OK, what I'll probably do is to merge the core VFS patch plus the 2nd
ext4 patch into the ext4 patch queue in the unstable portion of the
queue for testing purposes, but I won't actually push the VFS core
changes. We can figure out later who actually pushes the ext4 patch
which utilizes the new VFS infrastructure.

- Ted

2009-05-27 14:30:15

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 05/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize

On Wed, May 27, 2009 at 03:01:02PM +0200, Jan Kara wrote:
> In a situation like:
> truncate(f, 1024);
> a = mmap(f, 0, 4096);
> a[0] = 'a';
> truncate(f, 4096);
>
> we end up with a dirty page which does not have all blocks allocated /
> reserved. Fix the problem by using new VFS infrastructure.
>
> Signed-off-by: Jan Kara <[email protected]>

Hi Jan,

Have you tested with -o nodelalloc? There is apparently a problem
with ext4 when blocksize < pagesize which Aneesh has been working on.
He has been able to reproduce the problem, and theorized that an
earlier version your patch set would address the problem, but it
apparently did not. See:

http://bugzilla.kernel.org/show_bug.cgi?id=13369

... and Aneesh can provide more details. Aneesh, you might want to
try testing with this latest set and see if you can reproduce the
problem with this set.

- Ted

2009-05-27 14:52:11

[permalink] [raw]

Subject: Re: [PATCH 05/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize

On Wed 27-05-09 10:30:06, Theodore Tso wrote:
> On Wed, May 27, 2009 at 03:01:02PM +0200, Jan Kara wrote:
> > In a situation like:
> > truncate(f, 1024);
> > a = mmap(f, 0, 4096);
> > a[0] = 'a';
> > truncate(f, 4096);
> >
> > we end up with a dirty page which does not have all blocks allocated /
> > reserved. Fix the problem by using new VFS infrastructure.
> >
> > Signed-off-by: Jan Kara <[email protected]>
>
> Hi Jan,
>
> Have you tested with -o nodelalloc? There is apparently a problem
> with ext4 when blocksize < pagesize which Aneesh has been working on.
> He has been able to reproduce the problem, and theorized that an
> earlier version your patch set would address the problem, but it
> apparently did not. See:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=13369
>
> ... and Aneesh can provide more details. Aneesh, you might want to
> try testing with this latest set and see if you can reproduce the
> problem with this set.
Yes, I've exchanged a few emails with Aneesh privately yesterday and
today. There were bugs in my original patch set which should now be fixed.
Hopefully it also fixes the bug Aneesh sees (it's definitely the type of
bug this patch set should solve and if it does not I should fix it ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-05-27 14:59:10

[permalink] [raw]

Subject: Re: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

On Wed 27-05-09 10:23:32, Theodore Tso wrote:
> On Wed, May 27, 2009 at 03:00:57PM +0200, Jan Kara wrote:
> >
> > The series is against 2.6.30-rc7. The first two patches are just
> > small cleanup and should be merged separately (Ted should have the
> > ext4 cleanup rebased on top of current ext4 tree).
>
> So I can take the cleanup (which will be different given the changes
> that are ready to get merged once the merge window opens), but it
I sent you yesterday the version of the cleanup which should apply to the
latest patch queue. I think you can merge it right away.
In fact, this series should not depend on the cleanup in ext4 at all, I've
just included it for completeness along with ext3 cleanup of the same
problem (and the patch series depends on ext3 cleanup).

> looks like the the second ext4 patch can't go in until we get
> agreement that changes to the VFS layer should go in, correct? So
> we'll have a merge order dependency here.
Yes, we have to agree on VFS changes before ext4 changes make sense.

> (And this one will be the 2nd merge order dependency we have in the
> ext4 tree, since I also have some patches in the unstable portion of
> the tree which are waiting for some tracing patches to go in.)
>
> OK, what I'll probably do is to merge the core VFS patch plus the 2nd
> ext4 patch into the ext4 patch queue in the unstable portion of the
> queue for testing purposes, but I won't actually push the VFS core
> changes. We can figure out later who actually pushes the ext4 patch
> which utilizes the new VFS infrastructure.
Yes, that would be fine. Thanks. Luckily, the dependence is just one way
(ext4 change has to go after the VFS patch) so it's just enough to merge
that ext4 change some time after VFS changes go in.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-05-27 15:34:09

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

On Wed, May 27, 2009 at 03:00:57PM +0200, Jan Kara wrote:
>
> patches below are an attempt to solve problems filesystems have with
> page_mkwrite() when blocksize < pagesize (see the changelog of the third patch
> for details).
>
> The series is against 2.6.30-rc7. The first two patches are just small cleanup
> and should be merged separately (Ted should have the ext4 cleanup rebased on
> top of current ext4 tree). For ext3 the fix is done in two phases, in the first
> we make it to correctly allocate space at page-fault time from page_mkwrite().
> This has the disadvantage that under random mmaped writes, the file gets much
> more fragmented and performance of e.g. Berkeley DB drops by ~20%. Therefore
> in the second phase I've implemented delayed allocation for ext3 and blocks
> are just reserved during page_mkwrite time and really allocated only during
> writepage. This gets the performance back to original numbers for me.
>
> The patches should be fairly complete and sustained quite some testing. OTOH
> the area is kind of complex so please review them so that they can get merged.
> Thanks.

Can you move patch 7 and patch 11 as the last two patches. That would
make sure we can push rest of the patches earlier. Rest of the patches
are needed for ext4 to fix some of the bugs we are seeing. For eg:
http://bugzilla.kernel.org/show_bug.cgi?id=12624
http://bugzilla.kernel.org/show_bug.cgi?id=13369

I have few writepage patches also on top of your last series. Having
patch 7 and patch 11 as last two patches make sure we can get the rest
of the patches in ext4 patchqueue and get wider testing.

-aneesh

2009-05-27 15:36:10

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH 08/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped

On Wed, May 27, 2009 at 03:01:05PM +0200, Jan Kara wrote:
> When we do delayed allocation of some buffer, we want to signal to VFS that
> the buffer is new (set buffer_new) so that it properly zeros out everything.
> But we don't have the buffer mapped yet so we cannot really unmap underlying
> metadata in this state. Make VFS avoid doing unmapping of metadata when the
> buffer is not yet mapped.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/buffer.c | 12 +++++++-----
> 1 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ddfcade..3d0bced 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1683,8 +1683,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> if (buffer_new(bh)) {
> /* blockdev mappings never come here */
> clear_buffer_new(bh);
> - unmap_underlying_metadata(bh->b_bdev,
> - bh->b_blocknr);
> + if (buffer_mapped(bh))
> + unmap_underlying_metadata(bh->b_bdev,
> + bh->b_blocknr);
> }
> }
> bh = bh->b_this_page;
> @@ -1869,8 +1870,9 @@ static int __block_prepare_write(struct inode *inode, struct page *page,
> if (err)
> break;
> if (buffer_new(bh)) {
> - unmap_underlying_metadata(bh->b_bdev,
> - bh->b_blocknr);
> + if (buffer_mapped(bh))
> + unmap_underlying_metadata(bh->b_bdev,
> + bh->b_blocknr);
> if (PageUptodate(page)) {
> clear_buffer_new(bh);
> set_buffer_uptodate(bh);
> @@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
> goto failed;
> if (!buffer_mapped(bh))
> is_mapped_to_disk = 0;
> - if (buffer_new(bh))
> + if (buffer_new(bh) && buffer_mapped(bh))
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> if (PageUptodate(page)) {
> set_buffer_uptodate(bh);

Both xfs and ext4 return mapped delay buffer_head when we do a get_block
with delayed allocation in write_begin phase.

-aneesh

2009-05-27 15:59:49

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Wed, May 27, 2009 at 03:01:00PM +0200, Jan Kara wrote:
> page_mkwrite() is meant to be used by filesystems to allocate blocks under a
> page which is becoming writeably mmapped in some process address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than silently
> discarding data later when writepage is called.
>
> On filesystems where blocksize < pagesize the situation is more complicated.
> Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> ftruncate(fd, 0);
> pwrite(fd, buf, 1024, 0);
> map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> fsync(fd); ----> writepage() for index 0 is called
>
> At the moment page_mkwrite() is called, filesystem can allocate only one block
> for the page because i_size == 1024. Otherwise it would create blocks beyond
> i_size which is generally undesirable. But later at writepage() time, we would
> like to have blocks allocated for the whole page (and in principle we have to
> allocate them because user could have filled the page with data after the
> second ftruncate()). This patch introduces a framework which allows filesystems
> to handle this with a reasonable effort.
>
> The idea is following: Before we extend i_size, we obtain a special lock blocking
> page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
> change i_size and unlock the special lock. This way, page_mkwrite() is called for
> a page each time a number of blocks needed to be allocated for a page increases.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/buffer.c | 143 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/buffer_head.h | 4 +
> include/linux/fs.h | 11 +++-
> mm/filemap.c | 10 +++-
> mm/memory.c | 2 +-
> 5 files changed, 166 insertions(+), 4 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index aed2977..ddfcade 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -40,6 +40,7 @@
> #include <linux/cpu.h>
> #include <linux/bitops.h>
> #include <linux/mpage.h>
> +#include <linux/rmap.h>
> #include <linux/bit_spinlock.h>
>
> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
> @@ -1970,9 +1971,11 @@ int block_write_begin(struct file *file, struct address_space *mapping,
> page = *pagep;
> if (page == NULL) {
> ownpage = 1;
> + block_lock_hole_extend(inode, pos);
> page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page) {
> status = -ENOMEM;
> + block_unlock_hole_extend(inode);
> goto out;
> }
> *pagep = page;
> @@ -1987,6 +1990,7 @@ int block_write_begin(struct file *file, struct address_space *mapping,
> unlock_page(page);
> page_cache_release(page);
> *pagep = NULL;
> + block_unlock_hole_extend(inode);
>
> /*
> * prepare_write() may have instantiated a few blocks
> @@ -2062,6 +2066,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
>
> unlock_page(page);
> page_cache_release(page);
> + block_unlock_hole_extend(inode);
>
> /*
> * Don't mark the inode dirty under page lock. First, it unnecessarily
> @@ -2368,6 +2373,137 @@ int block_commit_write(struct page *page, unsigned from, unsigned to)
> }
>
> /*
> + * Lock inode with I_HOLE_EXTEND if the write is going to create a hole
> + * under a mmapped page. Also mark the page RO so that page_mkwrite()
> + * is called on the nearest write access to the page and clear dirty bits
> + * beyond i_size.
> + *
> + * @pos is offset to which write/truncate is happenning.
> + *
> + * Returns 1 if the lock has been acquired.
> + */
> +int block_lock_hole_extend(struct inode *inode, loff_t pos)
> +{
> + int bsize = 1 << inode->i_blkbits;
> + loff_t rounded_i_size;
> + struct page *page;
> + pgoff_t index;
> + struct buffer_head *head, *bh;
> + sector_t block, last_block;
> +
> + /* Optimize for common case */
> + if (PAGE_CACHE_SIZE == bsize)
> + return 0;
> + /* Currently last page will not have any hole block created? */
> + rounded_i_size = (inode->i_size + bsize - 1) & ~(bsize - 1);
> + if (pos <= rounded_i_size || !(rounded_i_size & (PAGE_CACHE_SIZE - 1)))
> + return 0;
> + /*
> + * Check the mutex here so that we don't warn on things like blockdev
> + * writes which have different locking rules...
> + */
> + WARN_ON(!mutex_is_locked(&inode->i_mutex));
> + spin_lock(&inode_lock);
> + /*
> + * From now on, block_page_mkwrite() will block on the page straddling
> + * i_size. Note that the page on which it blocks changes with the
> + * change of i_size but that is fine since when new i_size is written
> + * blocks for the hole will be allocated.
> + */
> + inode->i_state |= I_HOLE_EXTEND;
> + spin_unlock(&inode_lock);
> +
> + /*
> + * Make sure page_mkwrite() is called on this page before
> + * user is able to write any data beyond current i_size via
> + * mmap.
> + *
> + * See clear_page_dirty_for_io() for details why set_page_dirty()
> + * is needed.
> + */
> + index = inode->i_size >> PAGE_CACHE_SHIFT;
> + page = find_lock_page(inode->i_mapping, index);
> + if (!page)
> + return 1;
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + /* Zero dirty bits beyond current i_size */
> + if (page_has_buffers(page)) {
> + bh = head = page_buffers(page);
> + last_block = (inode->i_size - 1) >> inode->i_blkbits;
> + block = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> + do {
> + if (block > last_block)
> + clear_buffer_dirty(bh);
> + bh = bh->b_this_page;
> + block++;
> + } while (bh != head);
> + }
> + unlock_page(page);
> + page_cache_release(page);
> + return 1;
> +}
> +EXPORT_SYMBOL(block_lock_hole_extend);
> +
> +/* New i_size creating hole has been written, unlock the inode */
> +void block_unlock_hole_extend(struct inode *inode)
> +{
> + /*
> + * We want to clear the flag we could have set previously. Noone else
> + * can change the flag so lockless read is reliable.
> + */
> + if (inode->i_state & I_HOLE_EXTEND) {
> + spin_lock(&inode_lock);
> + inode->i_state &= ~I_HOLE_EXTEND;
> + spin_unlock(&inode_lock);
> + /* Prevent speculative execution through spin_unlock */
> + smp_mb();
> + wake_up_bit(&inode->i_state, __I_HOLE_EXTEND);
> + }
> +}
> +EXPORT_SYMBOL(block_unlock_hole_extend);
> +
> +void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
> +{
> + int locked;
> +
> + locked = block_lock_hole_extend(inode, pos + len);
> + i_size_write(inode, pos + len);
> + if (locked)
> + block_unlock_hole_extend(inode);
> +}
> +EXPORT_SYMBOL(block_extend_i_size);
> +
> +int block_wait_on_hole_extend(struct inode *inode, loff_t pos)
> +{
> + loff_t size;
> + int ret = 0;
> +
> +restart:
> + size = i_size_read(inode);
> + if (pos > size)
> + return -EINVAL;
> + if (pos + PAGE_CACHE_SIZE < size)
> + return ret;
> + /*
> + * This page contains EOF; make sure we see i_state from the moment
> + * after page table modification
> + */
> + smp_rmb();
> + if (inode->i_state & I_HOLE_EXTEND) {
> + wait_queue_head_t *wqh;
> + DEFINE_WAIT_BIT(wqb, &inode->i_state, __I_HOLE_EXTEND);
> +
> + wqh = bit_waitqueue(&inode->i_state, __I_HOLE_EXTEND);
> + __wait_on_bit(wqh, &wqb, inode_wait, TASK_UNINTERRUPTIBLE);
> + ret = 1;
> + goto restart;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(block_wait_on_hole_extend);
> +
> +/*
> * block_page_mkwrite() is not allowed to change the file size as it gets
> * called from a page fault handler when a page is first dirtied. Hence we must
> * be careful to check for EOF conditions here. We set the page up correctly
> @@ -2392,6 +2528,13 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> loff_t size;
> int ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
>
> + block_wait_on_hole_extend(inode, page_offset(page));
> + /*
> + * From this moment on a write creating a hole can happen
> + * without us waiting for it. But because it writeprotects
> + * the page, user cannot really write to the page until next
> + * page_mkwrite() is called. And that one will wait.
> + */
> lock_page(page);
> size = i_size_read(inode);
> if ((page->mapping != inode->i_mapping) ||
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 16ed028..56a0162 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -219,6 +219,10 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
> get_block_t *, loff_t *);
> int generic_cont_expand_simple(struct inode *inode, loff_t size);
> int block_commit_write(struct page *page, unsigned from, unsigned to);
> +int block_lock_hole_extend(struct inode *inode, loff_t pos);
> +void block_unlock_hole_extend(struct inode *inode);
> +int block_wait_on_hole_extend(struct inode *inode, loff_t pos);
> +void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
> int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> get_block_t get_block);
> void block_sync_page(struct page *);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3b534e5..7cbb0c2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -580,7 +580,7 @@ struct address_space_operations {
> int (*write_end)(struct file *, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned copied,
> struct page *page, void *fsdata);
> -
> + void (*extend_i_size)(struct inode *, loff_t pos, loff_t len);
> /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
> sector_t (*bmap)(struct address_space *, sector_t);
> void (*invalidatepage) (struct page *, unsigned long);
> @@ -597,6 +597,8 @@ struct address_space_operations {
> unsigned long);
> };
>
> +void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
> +
> /*
> * pagecache_write_begin/pagecache_write_end must be used by general code
> * to write into the pagecache.
> @@ -1590,7 +1592,8 @@ struct super_operations {
> * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> * various stages of removing an inode.
> *
> - * Two bits are used for locking and completion notification, I_LOCK and I_SYNC.
> + * Three bits are used for locking and completion notification, I_LOCK,
> + * I_HOLE_EXTEND and I_SYNC.
> *
> * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> * fdatasync(). i_atime is the usual cause.
> @@ -1628,6 +1631,8 @@ struct super_operations {
> * of inode dirty data. Having a separate lock for this
> * purpose reduces latency and prevents some filesystem-
> * specific deadlocks.
> + * I_HOLE_EXTEND A lock synchronizing extension of a file which creates
> + * a hole under a mmapped page with page_mkwrite().
> *
> * Q: What is the difference between I_WILL_FREE and I_FREEING?
> * Q: igrab() only checks on (I_FREEING|I_WILL_FREE). Should it also check on
> @@ -1644,6 +1649,8 @@ struct super_operations {
> #define I_LOCK (1 << __I_LOCK)
> #define __I_SYNC 8
> #define I_SYNC (1 << __I_SYNC)
> +#define __I_HOLE_EXTEND 9
> +#define I_HOLE_EXTEND (1 << __I_HOLE_EXTEND)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 379ff0b..a227174 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2079,6 +2079,14 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
> }
> EXPORT_SYMBOL(pagecache_write_end);
>
> +void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
> +{
> + if (inode->i_mapping->a_ops->extend_i_size)
> + inode->i_mapping->a_ops->extend_i_size(inode, pos, len);
> + else
> + i_size_write(inode, pos + len);
> +}
> +
> ssize_t
> generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> unsigned long *nr_segs, loff_t pos, loff_t *ppos,
> @@ -2139,7 +2147,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> if (written > 0) {
> loff_t end = pos + written;
> if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
> - i_size_write(inode, end);
> + do_extend_i_size(inode, pos, written);
> mark_inode_dirty(inode);
> }
> *ppos = end;
> diff --git a/mm/memory.c b/mm/memory.c
> index 4126dd1..535183d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2377,7 +2377,7 @@ int vmtruncate(struct inode * inode, loff_t offset)
> goto out_sig;
> if (offset > inode->i_sb->s_maxbytes)
> goto out_big;
> - i_size_write(inode, offset);
> + do_extend_i_size(inode, offset, 0);
> } else {
> struct address_space *mapping = inode->i_mapping;
>

Sorry if I'm being a bit dense, I'm just kind of confused. In the case you
outlined, we only allocate one block, as we should, but then the truncate
extends i_size so that when writepage() comes along, its valid to write the
whole page out, except we didn't allocate blocks for the whole page sine
blocksize < pagesize.

But if we didn't extend i_size, writepage would only have written the block's
worth of data correct? If thats the case, why don't we just make it so that
happens in this case as well? Technically only one part of the page is dirty,
so we just need to write that out, not the whole thing. I assume there is a way
to do this, since it presumably happens in the case where i_size < page size.
If thats not possible then I guess this way is a good answer, it just seems
overly complicated to me.

Also, on the actualy patch, you only check the return value of
block_lock_hole_extend in block_extend_i_size, but sinze
block_unlock_hole_extend doesn't really do anything if we didn't do anything in
block_lock_hole_extend, you might as well make it a void as well, since you
unconditionally lock/unlock in every other instance. Thanks,

Josef

2009-05-27 16:45:13

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Wed, May 27, 2009 at 12:00:06PM -0400, Josef Bacik wrote:
> On Wed, May 27, 2009 at 03:01:00PM +0200, Jan Kara wrote:
.....
....

> > if (offset > inode->i_sb->s_maxbytes)
> > goto out_big;
> > - i_size_write(inode, offset);
> > + do_extend_i_size(inode, offset, 0);
> > } else {
> > struct address_space *mapping = inode->i_mapping;
> >
>
> Sorry if I'm being a bit dense, I'm just kind of confused. In the case you
> outlined, we only allocate one block, as we should, but then the truncate
> extends i_size so that when writepage() comes along, its valid to write the
> whole page out, except we didn't allocate blocks for the whole page sine
> blocksize < pagesize.

We can't do block allocation in writepage because we can't handler
ENOSPC during writepage. So the patch attempt to make sure we do
all block allocation in the process context. For mmap we should
do it in page_mkwrite and for write in write_begin.

>
> But if we didn't extend i_size, writepage would only have written the block's
> worth of data correct? If thats the case, why don't we just make it so that
> happens in this case as well?

That is what is done in the patch i posted
http://article.gmane.org/gmane.comp.file-systems.ext4/13468

But that is not just enough. We also need to make sure we get a
page_fault when we try to write at another offset via mmap address so
that we can do block allocation via page_mkwrite. To do that all code
path that extend i_size should mark the page write protect.

>Technically only one part of the page is dirty,
> so we just need to write that out, not the whole thing. I assume there is a way
> to do this, since it presumably happens in the case where i_size < page size.
> If thats not possible then I guess this way is a good answer, it just seems
> overly complicated to me.

__block_write_full_page already does that. It looks at the last_block

>
> Also, on the actualy patch, you only check the return value of
> block_lock_hole_extend in block_extend_i_size, but sinze
> block_unlock_hole_extend doesn't really do anything if we didn't do anything in
> block_lock_hole_extend, you might as well make it a void as well, since you
> unconditionally lock/unlock in every other instance. Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

-aneesh

2009-05-27 17:06:32

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Wed, May 27, 2009 at 10:15:04PM +0530, Aneesh Kumar K.V wrote:
> On Wed, May 27, 2009 at 12:00:06PM -0400, Josef Bacik wrote:
> > On Wed, May 27, 2009 at 03:01:00PM +0200, Jan Kara wrote:
> .....
> ....
>
> > > if (offset > inode->i_sb->s_maxbytes)
> > > goto out_big;
> > > - i_size_write(inode, offset);
> > > + do_extend_i_size(inode, offset, 0);
> > > } else {
> > > struct address_space *mapping = inode->i_mapping;
> > >
> >
> > Sorry if I'm being a bit dense, I'm just kind of confused. In the case you
> > outlined, we only allocate one block, as we should, but then the truncate
> > extends i_size so that when writepage() comes along, its valid to write the
> > whole page out, except we didn't allocate blocks for the whole page sine
> > blocksize < pagesize.
>
> We can't do block allocation in writepage because we can't handler
> ENOSPC during writepage. So the patch attempt to make sure we do
> all block allocation in the process context. For mmap we should
> do it in page_mkwrite and for write in write_begin.
>

Right, I get that part, its the next part that confused me.

> >
> > But if we didn't extend i_size, writepage would only have written the block's
> > worth of data correct? If thats the case, why don't we just make it so that
> > happens in this case as well?
>
> That is what is done in the patch i posted
> http://article.gmane.org/gmane.comp.file-systems.ext4/13468
>
> But that is not just enough. We also need to make sure we get a
> page_fault when we try to write at another offset via mmap address so
> that we can do block allocation via page_mkwrite. To do that all code
> path that extend i_size should mark the page write protect.
>

Thank you, this is where I was confused. I didn't think about the fact that
we'd not hit page_mkwrite() on that page anymore even though we would need to
allocate more blocks to actually write out the entire page.
>
> >Technically only one part of the page is dirty,
> > so we just need to write that out, not the whole thing. I assume there is a way
> > to do this, since it presumably happens in the case where i_size < page size.
> > If thats not possible then I guess this way is a good answer, it just seems
> > overly complicated to me.
>
>
> __block_write_full_page already does that. It looks at the last_block
>

Gotcha. Thanks for explaining this to me,

Josef

2009-05-28 09:36:07

[permalink] [raw]

Subject: Re: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

Hi,

On Wed 27-05-09 21:03:58, Aneesh Kumar K.V wrote:
> On Wed, May 27, 2009 at 03:00:57PM +0200, Jan Kara wrote:
> >
> > patches below are an attempt to solve problems filesystems have with
> > page_mkwrite() when blocksize < pagesize (see the changelog of the third patch
> > for details).
> >
> > The series is against 2.6.30-rc7. The first two patches are just small cleanup
> > and should be merged separately (Ted should have the ext4 cleanup rebased on
> > top of current ext4 tree). For ext3 the fix is done in two phases, in the first
> > we make it to correctly allocate space at page-fault time from page_mkwrite().
> > This has the disadvantage that under random mmaped writes, the file gets much
> > more fragmented and performance of e.g. Berkeley DB drops by ~20%. Therefore
> > in the second phase I've implemented delayed allocation for ext3 and blocks
> > are just reserved during page_mkwrite time and really allocated only during
> > writepage. This gets the performance back to original numbers for me.
> >
> > The patches should be fairly complete and sustained quite some testing. OTOH
> > the area is kind of complex so please review them so that they can get merged.
> > Thanks.
>
> Can you move patch 7 and patch 11 as the last two patches. That would
> make sure we can push rest of the patches earlier. Rest of the patches
> are needed for ext4 to fix some of the bugs we are seeing. For eg:
> http://bugzilla.kernel.org/show_bug.cgi?id=12624
> http://bugzilla.kernel.org/show_bug.cgi?id=13369
OK, I think I can just move patches 3 and 5 to the beginning. That's all
that should be needed for ext4.

> I have few writepage patches also on top of your last series. Having
> patch 7 and patch 11 as last two patches make sure we can get the rest
> of the patches in ext4 patchqueue and get wider testing.
Good point. I think Ted did the right thing and took just patches 3 and 5
to ext4 patch queue but I can reorder the patches for next time...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-05-28 09:44:34

[permalink] [raw]

Subject: Re: [PATCH 08/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped

On Wed 27-05-09 21:05:59, Aneesh Kumar K.V wrote:
> On Wed, May 27, 2009 at 03:01:05PM +0200, Jan Kara wrote:
> > When we do delayed allocation of some buffer, we want to signal to VFS that
> > the buffer is new (set buffer_new) so that it properly zeros out everything.
> > But we don't have the buffer mapped yet so we cannot really unmap underlying
> > metadata in this state. Make VFS avoid doing unmapping of metadata when the
> > buffer is not yet mapped.
> >
...
> > @@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
> > goto failed;
> > if (!buffer_mapped(bh))
> > is_mapped_to_disk = 0;
> > - if (buffer_new(bh))
> > + if (buffer_new(bh) && buffer_mapped(bh))
> > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> > if (PageUptodate(page)) {
> > set_buffer_uptodate(bh);
>
> Both xfs and ext4 return mapped delay buffer_head when we do a get_block
> with delayed allocation in write_begin phase.
Yeah, I knew about ext4 doing this. Thanks for pointing this out. I
wanted to trigger a separate discussion about this and similar problems -
the current state of buffer bits is quite messy (I think Ted complained
about this as well recently) and we should somehow clean it up.
In this particular case: What's the point in returning the buffer mapped?
It does not make any sence logically (block *does not* have any physical
location assigned) and technically you have to map it to some fake block
and later remap it correctly when you do block allocation. So maybe I'm
missing some good reason but from what I can see, it just does not make
sence...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-05-28 10:15:58

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH 08/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped

On Thu, May 28, 2009 at 11:44:34AM +0200, Jan Kara wrote:
> On Wed 27-05-09 21:05:59, Aneesh Kumar K.V wrote:
> > On Wed, May 27, 2009 at 03:01:05PM +0200, Jan Kara wrote:
> > > When we do delayed allocation of some buffer, we want to signal to VFS that
> > > the buffer is new (set buffer_new) so that it properly zeros out everything.
> > > But we don't have the buffer mapped yet so we cannot really unmap underlying
> > > metadata in this state. Make VFS avoid doing unmapping of metadata when the
> > > buffer is not yet mapped.
> > >
> ...
> > > @@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
> > > goto failed;
> > > if (!buffer_mapped(bh))
> > > is_mapped_to_disk = 0;
> > > - if (buffer_new(bh))
> > > + if (buffer_new(bh) && buffer_mapped(bh))
> > > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> > > if (PageUptodate(page)) {
> > > set_buffer_uptodate(bh);
> >
> > Both xfs and ext4 return mapped delay buffer_head when we do a get_block
> > with delayed allocation in write_begin phase.
> Yeah, I knew about ext4 doing this. Thanks for pointing this out. I
> wanted to trigger a separate discussion about this and similar problems -
> the current state of buffer bits is quite messy (I think Ted complained
> about this as well recently) and we should somehow clean it up.
> In this particular case: What's the point in returning the buffer mapped?
> It does not make any sence logically (block *does not* have any physical
> location assigned) and technically you have to map it to some fake block
> and later remap it correctly when you do block allocation. So maybe I'm
> missing some good reason but from what I can see, it just does not make
> sence...

Marking it mapped make sure we don't do multiple get_block calls for
every write. For each write in write_begin path we do a get_block
call if the buffer is not mapped. (__block_prepare_write have more
details.)

-aneesh

2009-05-28 13:03:20

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Wed, May 27, 2009 at 03:01:00PM +0200, Jan Kara wrote:
> page_mkwrite() is meant to be used by filesystems to allocate blocks under a
> page which is becoming writeably mmapped in some process address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than silently
> discarding data later when writepage is called.
>
> On filesystems where blocksize < pagesize the situation is more complicated.
> Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> ftruncate(fd, 0);
> pwrite(fd, buf, 1024, 0);
> map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> fsync(fd); ----> writepage() for index 0 is called
>
> At the moment page_mkwrite() is called, filesystem can allocate only one block
> for the page because i_size == 1024. Otherwise it would create blocks beyond
> i_size which is generally undesirable. But later at writepage() time, we would
> like to have blocks allocated for the whole page (and in principle we have to
> allocate them because user could have filled the page with data after the
> second ftruncate()). This patch introduces a framework which allows filesystems
> to handle this with a reasonable effort.
>
> The idea is following: Before we extend i_size, we obtain a special lock blocking
> page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
> change i_size and unlock the special lock. This way, page_mkwrite() is called for
> a page each time a number of blocks needed to be allocated for a page increases.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/buffer.c | 143 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/buffer_head.h | 4 +
> include/linux/fs.h | 11 +++-
> mm/filemap.c | 10 +++-
> mm/memory.c | 2 +-
> 5 files changed, 166 insertions(+), 4 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index aed2977..ddfcade 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -40,6 +40,7 @@
> #include <linux/cpu.h>
> #include <linux/bitops.h>
> #include <linux/mpage.h>
> +#include <linux/rmap.h>
> #include <linux/bit_spinlock.h>
>
> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
> @@ -1970,9 +1971,11 @@ int block_write_begin(struct file *file, struct address_space *mapping,
> page = *pagep;
> if (page == NULL) {
> ownpage = 1;
> + block_lock_hole_extend(inode, pos);
> page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page) {
> status = -ENOMEM;
> + block_unlock_hole_extend(inode);
> goto out;
> }
> *pagep = page;
> @@ -1987,6 +1990,7 @@ int block_write_begin(struct file *file, struct address_space *mapping,
> unlock_page(page);
> page_cache_release(page);
> *pagep = NULL;
> + block_unlock_hole_extend(inode);
>
> /*
> * prepare_write() may have instantiated a few blocks
> @@ -2062,6 +2066,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
>
> unlock_page(page);
> page_cache_release(page);
> + block_unlock_hole_extend(inode);
>
> /*
> * Don't mark the inode dirty under page lock. First, it unnecessarily
> @@ -2368,6 +2373,137 @@ int block_commit_write(struct page *page, unsigned from, unsigned to)
> }
>
> /*
> + * Lock inode with I_HOLE_EXTEND if the write is going to create a hole
> + * under a mmapped page. Also mark the page RO so that page_mkwrite()
> + * is called on the nearest write access to the page and clear dirty bits
> + * beyond i_size.
> + *
> + * @pos is offset to which write/truncate is happenning.
> + *
> + * Returns 1 if the lock has been acquired.
> + */
> +int block_lock_hole_extend(struct inode *inode, loff_t pos)
> +{
> + int bsize = 1 << inode->i_blkbits;
> + loff_t rounded_i_size;
> + struct page *page;
> + pgoff_t index;
> + struct buffer_head *head, *bh;
> + sector_t block, last_block;
> +
> + /* Optimize for common case */
> + if (PAGE_CACHE_SIZE == bsize)
> + return 0;
> + /* Currently last page will not have any hole block created? */
> + rounded_i_size = (inode->i_size + bsize - 1) & ~(bsize - 1);
> + if (pos <= rounded_i_size || !(rounded_i_size & (PAGE_CACHE_SIZE - 1)))
> + return 0;
> + /*
> + * Check the mutex here so that we don't warn on things like blockdev
> + * writes which have different locking rules...
> + */
> + WARN_ON(!mutex_is_locked(&inode->i_mutex));
> + spin_lock(&inode_lock);
> + /*
> + * From now on, block_page_mkwrite() will block on the page straddling
> + * i_size. Note that the page on which it blocks changes with the
> + * change of i_size but that is fine since when new i_size is written
> + * blocks for the hole will be allocated.
> + */
> + inode->i_state |= I_HOLE_EXTEND;
> + spin_unlock(&inode_lock);
> +
> + /*
> + * Make sure page_mkwrite() is called on this page before
> + * user is able to write any data beyond current i_size via
> + * mmap.
> + *
> + * See clear_page_dirty_for_io() for details why set_page_dirty()
> + * is needed.
> + */
> + index = inode->i_size >> PAGE_CACHE_SHIFT;
> + page = find_lock_page(inode->i_mapping, index);
> + if (!page)
> + return 1;
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + /* Zero dirty bits beyond current i_size */
> + if (page_has_buffers(page)) {
> + bh = head = page_buffers(page);
> + last_block = (inode->i_size - 1) >> inode->i_blkbits;
> + block = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> + do {
> + if (block > last_block)
> + clear_buffer_dirty(bh);
> + bh = bh->b_this_page;
> + block++;
> + } while (bh != head);
> + }
> + unlock_page(page);
> + page_cache_release(page);
> + return 1;
> +}
> +EXPORT_SYMBOL(block_lock_hole_extend);
> +
> +/* New i_size creating hole has been written, unlock the inode */
> +void block_unlock_hole_extend(struct inode *inode)
> +{
> + /*
> + * We want to clear the flag we could have set previously. Noone else
> + * can change the flag so lockless read is reliable.
> + */
> + if (inode->i_state & I_HOLE_EXTEND) {
> + spin_lock(&inode_lock);
> + inode->i_state &= ~I_HOLE_EXTEND;
> + spin_unlock(&inode_lock);
> + /* Prevent speculative execution through spin_unlock */
> + smp_mb();
> + wake_up_bit(&inode->i_state, __I_HOLE_EXTEND);
> + }
> +}
> +EXPORT_SYMBOL(block_unlock_hole_extend);
> +
> +void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
> +{
> + int locked;
> +
> + locked = block_lock_hole_extend(inode, pos + len);
> + i_size_write(inode, pos + len);
> + if (locked)
> + block_unlock_hole_extend(inode);
> +}
> +EXPORT_SYMBOL(block_extend_i_size);
> +
> +int block_wait_on_hole_extend(struct inode *inode, loff_t pos)
> +{
> + loff_t size;
> + int ret = 0;
> +
> +restart:
> + size = i_size_read(inode);
> + if (pos > size)
> + return -EINVAL;
> + if (pos + PAGE_CACHE_SIZE < size)
> + return ret;
> + /*
> + * This page contains EOF; make sure we see i_state from the moment
> + * after page table modification
> + */
> + smp_rmb();
> + if (inode->i_state & I_HOLE_EXTEND) {
> + wait_queue_head_t *wqh;
> + DEFINE_WAIT_BIT(wqb, &inode->i_state, __I_HOLE_EXTEND);
> +
> + wqh = bit_waitqueue(&inode->i_state, __I_HOLE_EXTEND);
> + __wait_on_bit(wqh, &wqb, inode_wait, TASK_UNINTERRUPTIBLE);
> + ret = 1;
> + goto restart;
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL(block_wait_on_hole_extend);
> +
> +/*
> * block_page_mkwrite() is not allowed to change the file size as it gets
> * called from a page fault handler when a page is first dirtied. Hence we must
> * be careful to check for EOF conditions here. We set the page up correctly
> @@ -2392,6 +2528,13 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> loff_t size;
> int ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
>
> + block_wait_on_hole_extend(inode, page_offset(page));
> + /*
> + * From this moment on a write creating a hole can happen
> + * without us waiting for it. But because it writeprotects
> + * the page, user cannot really write to the page until next
> + * page_mkwrite() is called. And that one will wait.
> + */
> lock_page(page);
> size = i_size_read(inode);
> if ((page->mapping != inode->i_mapping) ||
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 16ed028..56a0162 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -219,6 +219,10 @@ int cont_write_begin(struct file *, struct address_space *, loff_t,
> get_block_t *, loff_t *);
> int generic_cont_expand_simple(struct inode *inode, loff_t size);
> int block_commit_write(struct page *page, unsigned from, unsigned to);
> +int block_lock_hole_extend(struct inode *inode, loff_t pos);
> +void block_unlock_hole_extend(struct inode *inode);
> +int block_wait_on_hole_extend(struct inode *inode, loff_t pos);
> +void block_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
> int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> get_block_t get_block);
> void block_sync_page(struct page *);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3b534e5..7cbb0c2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -580,7 +580,7 @@ struct address_space_operations {
> int (*write_end)(struct file *, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned copied,
> struct page *page, void *fsdata);
> -
> + void (*extend_i_size)(struct inode *, loff_t pos, loff_t len);
> /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
> sector_t (*bmap)(struct address_space *, sector_t);
> void (*invalidatepage) (struct page *, unsigned long);
> @@ -597,6 +597,8 @@ struct address_space_operations {
> unsigned long);
> };
>
> +void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len);
> +
> /*
> * pagecache_write_begin/pagecache_write_end must be used by general code
> * to write into the pagecache.
> @@ -1590,7 +1592,8 @@ struct super_operations {
> * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> * various stages of removing an inode.
> *
> - * Two bits are used for locking and completion notification, I_LOCK and I_SYNC.
> + * Three bits are used for locking and completion notification, I_LOCK,
> + * I_HOLE_EXTEND and I_SYNC.
> *
> * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> * fdatasync(). i_atime is the usual cause.
> @@ -1628,6 +1631,8 @@ struct super_operations {
> * of inode dirty data. Having a separate lock for this
> * purpose reduces latency and prevents some filesystem-
> * specific deadlocks.
> + * I_HOLE_EXTEND A lock synchronizing extension of a file which creates
> + * a hole under a mmapped page with page_mkwrite().
> *
> * Q: What is the difference between I_WILL_FREE and I_FREEING?
> * Q: igrab() only checks on (I_FREEING|I_WILL_FREE). Should it also check on
> @@ -1644,6 +1649,8 @@ struct super_operations {
> #define I_LOCK (1 << __I_LOCK)
> #define __I_SYNC 8
> #define I_SYNC (1 << __I_SYNC)
> +#define __I_HOLE_EXTEND 9
> +#define I_HOLE_EXTEND (1 << __I_HOLE_EXTEND)
>
> #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 379ff0b..a227174 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2079,6 +2079,14 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
> }
> EXPORT_SYMBOL(pagecache_write_end);
>
> +void do_extend_i_size(struct inode *inode, loff_t pos, loff_t len)
> +{
> + if (inode->i_mapping->a_ops->extend_i_size)
> + inode->i_mapping->a_ops->extend_i_size(inode, pos, len);
> + else
> + i_size_write(inode, pos + len);
> +}
> +
> ssize_t
> generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> unsigned long *nr_segs, loff_t pos, loff_t *ppos,
> @@ -2139,7 +2147,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
> if (written > 0) {
> loff_t end = pos + written;
> if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
> - i_size_write(inode, end);
> + do_extend_i_size(inode, pos, written);
> mark_inode_dirty(inode);
> }
> *ppos = end;
> diff --git a/mm/memory.c b/mm/memory.c
> index 4126dd1..535183d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2377,7 +2377,7 @@ int vmtruncate(struct inode * inode, loff_t offset)
> goto out_sig;
> if (offset > inode->i_sb->s_maxbytes)
> goto out_big;
> - i_size_write(inode, offset);
> + do_extend_i_size(inode, offset, 0);
> } else {
> struct address_space *mapping = inode->i_mapping;
>
> --
> 1.6.0.2
>

Sorry, another possibly braindead question. When we extend the i_size you make
it so that the page will be faulted the next time it's written to via
page_mkclean, which from what I can tell is done via pte_wrprotect. The problem
with this is the next time we write to the page, we do pte_mkwrite, which makes
it so that we won't fault the next time we write, correct? So if I were to do

ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a';
ftruncate(fd, 10000);
map[0] = 'b'; --> causes a page fault again
memset(map, 'a', 4096); --> wouldn't cause a pagefault
writepage at some point

We'd still be in the position that you are trying to solve, correct, since we
will have dirtied the rest of the page without calling mkwrite for the other
sections of it, which would result in unallocated blocks when we hit writepage.
Thanks,

Josef

2009-05-28 13:10:19

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Thu, May 28, 2009 at 09:03:41AM -0400, Josef Bacik wrote:
> On Wed, May 27, 2009 at 03:01:00PM +0200, Jan Kara wrote:
> >
>
> Sorry, another possibly braindead question. When we extend the i_size you make
> it so that the page will be faulted the next time it's written to via
> page_mkclean, which from what I can tell is done via pte_wrprotect. The problem
> with this is the next time we write to the page, we do pte_mkwrite, which makes
> it so that we won't fault the next time we write, correct? So if I were to do
>
> ftruncate(fd, 0);
> pwrite(fd, buf, 1024, 0);
> map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> map[0] = 'a';
> ftruncate(fd, 10000);
> map[0] = 'b'; --> causes a page fault again

This will do block allocation for the entire page within i_size. (ie multiple blocks)
ext4_page_mkwrite have

size = i_size_read(..)
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;

> memset(map, 'a', 4096); --> wouldn't cause a pagefault
> writepage at some point
>
> We'd still be in the position that you are trying to solve, correct, since we
> will have dirtied the rest of the page without calling mkwrite for the other
> sections of it, which would result in unallocated blocks when we hit writepage.
> Thanks,
>

-aneesh

2009-05-28 13:50:07

[permalink] [raw]

Subject: Re: [PATCH 08/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped

On Thu 28-05-09 15:45:54, Aneesh Kumar K.V wrote:
> On Thu, May 28, 2009 at 11:44:34AM +0200, Jan Kara wrote:
> > On Wed 27-05-09 21:05:59, Aneesh Kumar K.V wrote:
> > > On Wed, May 27, 2009 at 03:01:05PM +0200, Jan Kara wrote:
> > > > When we do delayed allocation of some buffer, we want to signal to VFS that
> > > > the buffer is new (set buffer_new) so that it properly zeros out everything.
> > > > But we don't have the buffer mapped yet so we cannot really unmap underlying
> > > > metadata in this state. Make VFS avoid doing unmapping of metadata when the
> > > > buffer is not yet mapped.
> > > >
> > ...
> > > > @@ -2683,7 +2685,7 @@ int nobh_write_begin(struct file *file, struct address_space *mapping,
> > > > goto failed;
> > > > if (!buffer_mapped(bh))
> > > > is_mapped_to_disk = 0;
> > > > - if (buffer_new(bh))
> > > > + if (buffer_new(bh) && buffer_mapped(bh))
> > > > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> > > > if (PageUptodate(page)) {
> > > > set_buffer_uptodate(bh);
> > >
> > > Both xfs and ext4 return mapped delay buffer_head when we do a get_block
> > > with delayed allocation in write_begin phase.
> > Yeah, I knew about ext4 doing this. Thanks for pointing this out. I
> > wanted to trigger a separate discussion about this and similar problems -
> > the current state of buffer bits is quite messy (I think Ted complained
> > about this as well recently) and we should somehow clean it up.
> > In this particular case: What's the point in returning the buffer mapped?
> > It does not make any sence logically (block *does not* have any physical
> > location assigned) and technically you have to map it to some fake block
> > and later remap it correctly when you do block allocation. So maybe I'm
> > missing some good reason but from what I can see, it just does not make
> > sense...
>
> Marking it mapped make sure we don't do multiple get_block calls for
> every write. For each write in write_begin path we do a get_block
> call if the buffer is not mapped. (__block_prepare_write have more
> details.)
Thanks for explanation. But you can always immediately return from
get_block() when you see that buffer has buffer_delay() set if you do not
want to do an allocation from there (e.g. in ext3 I want to do an allocation
in such case for compatibility reasons but that's a separate story).
Alternatively, if the get_block() call + immediate return bothers you, we
could introduce something like AOP_FLAG_DELALLOC and make
__block_prepare_write call get_block only if buffer is !mapped and !delay
when this flag is set. IMO that would be probably the best. What do you
think?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-05-30 11:23:38

by Pavel Machek

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

Hi!

> On filesystems where blocksize < pagesize the situation is more complicated.
> Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> ftruncate(fd, 0);
> pwrite(fd, buf, 1024, 0);
> map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> fsync(fd); ----> writepage() for index 0 is called
>
> At the moment page_mkwrite() is called, filesystem can allocate only one block
> for the page because i_size == 1024. Otherwise it would create blocks beyond
> i_size which is generally undesirable. But later at writepage() time, we would
> like to have blocks allocated for the whole page (and in principle we have to
> allocate them because user could have filled the page with data after the
> second ftruncate()). This patch introduces a framework which allows filesystems
> to handle this with a reasonable effort.

What happens when you do above sequence on today's kernels? Oops? 3000
bytes of random junk in file? ...?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-01 09:44:01

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Sat 30-05-09 13:23:24, Pavel Machek wrote:
> Hi!
>
> > On filesystems where blocksize < pagesize the situation is more complicated.
> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> > ftruncate(fd, 0);
> > pwrite(fd, buf, 1024, 0);
> > map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> > map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> > ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> > fsync(fd); ----> writepage() for index 0 is called
> >
> > At the moment page_mkwrite() is called, filesystem can allocate only one block
> > for the page because i_size == 1024. Otherwise it would create blocks beyond
> > i_size which is generally undesirable. But later at writepage() time, we would
> > like to have blocks allocated for the whole page (and in principle we have to
> > allocate them because user could have filled the page with data after the
> > second ftruncate()). This patch introduces a framework which allows filesystems
> > to handle this with a reasonable effort.
>
> What happens when you do above sequence on today's kernels? Oops? 3000
> bytes of random junk in file? ...?
Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
won't be written. Some filesystems may just try to map blocks and possibly
hit deadlock or something like that. Filesystems like ext2 / ext3 /
reiserfs generally don't care because so far they allocate blocks on writepage
time (which has the problem that you can write data via mmap and kernel
will later discard them because it hits ENOSPC or quota limit). That's
actually what I was trying to fix originally.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-01 11:33:09

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

Jan Kara <[email protected]> writes:

> On Sat 30-05-09 13:23:24, Pavel Machek wrote:
>> Hi!
>>
>> > On filesystems where blocksize < pagesize the situation is more complicated.
>> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
>> > ftruncate(fd, 0);
>> > pwrite(fd, buf, 1024, 0);
>> > map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
>> > map[0] = 'a'; ----> page_mkwrite() for index 0 is called
>> > ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>> > fsync(fd); ----> writepage() for index 0 is called
>> >
>> > At the moment page_mkwrite() is called, filesystem can allocate only one block
>> > for the page because i_size == 1024. Otherwise it would create blocks beyond
>> > i_size which is generally undesirable. But later at writepage() time, we would
>> > like to have blocks allocated for the whole page (and in principle we have to
>> > allocate them because user could have filled the page with data after the
>> > second ftruncate()). This patch introduces a framework which allows filesystems
>> > to handle this with a reasonable effort.
>>
>> What happens when you do above sequence on today's kernels? Oops? 3000
>> bytes of random junk in file? ...?
> Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
> won't be written. Some filesystems may just try to map blocks and possibly
> hit deadlock or something like that. Filesystems like ext2 / ext3 /
> reiserfs generally don't care because so far they allocate blocks on writepage
> time (which has the problem that you can write data via mmap and kernel
> will later discard them because it hits ENOSPC or quota limit). That's
> actually what I was trying to fix originally.
>
> Honza

man mmap:
A file is mapped in multiples of the page size. For a file that is not
a multiple of the page size, the remaining memory is zeroed when
mapped, and writes to that region are not written out to the file. The
effect of changing the size of the underlying file of a mapping on the
pages that correspond to added or removed regions of the file is
unspecified.

Whatever happens happens. The above code is just wrong, as in
unspecified behaviour.

What happens if you ftruncate() before mmap()?

MfG
Goswin

2009-06-01 14:00:30

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Mon 01-06-09 13:33:08, Goswin von Brederlow wrote:
> Jan Kara <[email protected]> writes:
>
> > On Sat 30-05-09 13:23:24, Pavel Machek wrote:
> >> Hi!
> >>
> >> > On filesystems where blocksize < pagesize the situation is more complicated.
> >> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> >> > ftruncate(fd, 0);
> >> > pwrite(fd, buf, 1024, 0);
> >> > map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> >> > map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> >> > ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> >> > fsync(fd); ----> writepage() for index 0 is called
> >> >
> >> > At the moment page_mkwrite() is called, filesystem can allocate only one block
> >> > for the page because i_size == 1024. Otherwise it would create blocks beyond
> >> > i_size which is generally undesirable. But later at writepage() time, we would
> >> > like to have blocks allocated for the whole page (and in principle we have to
> >> > allocate them because user could have filled the page with data after the
> >> > second ftruncate()). This patch introduces a framework which allows filesystems
> >> > to handle this with a reasonable effort.
> >>
> >> What happens when you do above sequence on today's kernels? Oops? 3000
> >> bytes of random junk in file? ...?
> > Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
> > won't be written. Some filesystems may just try to map blocks and possibly
> > hit deadlock or something like that. Filesystems like ext2 / ext3 /
> > reiserfs generally don't care because so far they allocate blocks on writepage
> > time (which has the problem that you can write data via mmap and kernel
> > will later discard them because it hits ENOSPC or quota limit). That's
> > actually what I was trying to fix originally.
> >
> > Honza
>
> man mmap:
> A file is mapped in multiples of the page size. For a file that is not
> a multiple of the page size, the remaining memory is zeroed when
> mapped, and writes to that region are not written out to the file. The
> effect of changing the size of the underlying file of a mapping on the
> pages that correspond to added or removed regions of the file is
> unspecified.
>
> Whatever happens happens. The above code is just wrong, as in
> unspecified behaviour.
> What happens if you ftruncate() before mmap()?
OK, I admit I didn't realize mmap() has so weak requirements. Doing mmap
after ftruncate() should work fine because before you write via that new
mmap page_mkwrite() will be called anyway.
So what we could alternatively do is that we just discard dirty bits from
buffers that don't have underlying blocks allocated. That would satisfy the
specification as well. But I have to say I'm a bit afraid of discarding
dirty bits like that. Also we'd have to handle the case where someone does
mremap() after ftruncate().
What other memory management people think?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-01 14:46:28

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

Jan Kara <[email protected]> writes:

> On Mon 01-06-09 13:33:08, Goswin von Brederlow wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > On Sat 30-05-09 13:23:24, Pavel Machek wrote:
>> >> Hi!
>> >>
>> >> > On filesystems where blocksize < pagesize the situation is more complicated.
>> >> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
>> >> > ftruncate(fd, 0);
>> >> > pwrite(fd, buf, 1024, 0);
>> >> > map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
>> >> > map[0] = 'a'; ----> page_mkwrite() for index 0 is called
>> >> > ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>> >> > fsync(fd); ----> writepage() for index 0 is called
>> >> >
>> >> > At the moment page_mkwrite() is called, filesystem can allocate only one block
>> >> > for the page because i_size == 1024. Otherwise it would create blocks beyond
>> >> > i_size which is generally undesirable. But later at writepage() time, we would
>> >> > like to have blocks allocated for the whole page (and in principle we have to
>> >> > allocate them because user could have filled the page with data after the
>> >> > second ftruncate()). This patch introduces a framework which allows filesystems
>> >> > to handle this with a reasonable effort.
>> >>
>> >> What happens when you do above sequence on today's kernels? Oops? 3000
>> >> bytes of random junk in file? ...?
>> > Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
>> > won't be written. Some filesystems may just try to map blocks and possibly
>> > hit deadlock or something like that. Filesystems like ext2 / ext3 /
>> > reiserfs generally don't care because so far they allocate blocks on writepage
>> > time (which has the problem that you can write data via mmap and kernel
>> > will later discard them because it hits ENOSPC or quota limit). That's
>> > actually what I was trying to fix originally.
>> >
>> > Honza
>>
>> man mmap:
>> A file is mapped in multiples of the page size. For a file that is not
>> a multiple of the page size, the remaining memory is zeroed when
>> mapped, and writes to that region are not written out to the file. The
>> effect of changing the size of the underlying file of a mapping on the
>> pages that correspond to added or removed regions of the file is
>> unspecified.
>>
>> Whatever happens happens. The above code is just wrong, as in
>> unspecified behaviour.
>> What happens if you ftruncate() before mmap()?
> OK, I admit I didn't realize mmap() has so weak requirements. Doing mmap
> after ftruncate() should work fine because before you write via that new
> mmap page_mkwrite() will be called anyway.

But the ftruncate would only allocate a block at position 10000. The
file still has a big hole from 1024-4095.

> So what we could alternatively do is that we just discard dirty bits from
> buffers that don't have underlying blocks allocated. That would satisfy the
> specification as well. But I have to say I'm a bit afraid of discarding
> dirty bits like that. Also we'd have to handle the case where someone does
> mremap() after ftruncate().
> What other memory management people think?

As said above the file still has a big hole after ftruncate. So not
having underlying blocks allocated can't be the deciding factor.

If possible I would make ftruncate() after mmap() work just like
ftruncate() before mmap(). That is write any dirty page completly up
to the current filesize. Allocate disk blocks for the file as needed
(you need to do that anyway). Wouldn't it be more work to remember the
filesize at the time of the mmap() to limit updates to that than using
the current file size?

MfG
Goswin

2009-06-01 15:02:17

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

On Mon 01-06-09 16:46:28, Goswin von Brederlow wrote:
> Jan Kara <[email protected]> writes:
> > On Mon 01-06-09 13:33:08, Goswin von Brederlow wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> > On Sat 30-05-09 13:23:24, Pavel Machek wrote:
> >> >> Hi!
> >> >>
> >> >> > On filesystems where blocksize < pagesize the situation is more complicated.
> >> >> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> >> >> > ftruncate(fd, 0);
> >> >> > pwrite(fd, buf, 1024, 0);
> >> >> > map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> >> >> > map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> >> >> > ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> >> >> > fsync(fd); ----> writepage() for index 0 is called
> >> >> >
> >> >> > At the moment page_mkwrite() is called, filesystem can allocate only one block
> >> >> > for the page because i_size == 1024. Otherwise it would create blocks beyond
> >> >> > i_size which is generally undesirable. But later at writepage() time, we would
> >> >> > like to have blocks allocated for the whole page (and in principle we have to
> >> >> > allocate them because user could have filled the page with data after the
> >> >> > second ftruncate()). This patch introduces a framework which allows filesystems
> >> >> > to handle this with a reasonable effort.
> >> >>
> >> >> What happens when you do above sequence on today's kernels? Oops? 3000
> >> >> bytes of random junk in file? ...?
> >> > Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
> >> > won't be written. Some filesystems may just try to map blocks and possibly
> >> > hit deadlock or something like that. Filesystems like ext2 / ext3 /
> >> > reiserfs generally don't care because so far they allocate blocks on writepage
> >> > time (which has the problem that you can write data via mmap and kernel
> >> > will later discard them because it hits ENOSPC or quota limit). That's
> >> > actually what I was trying to fix originally.
> >> >
> >> > Honza
> >>
> >> man mmap:
> >> A file is mapped in multiples of the page size. For a file that is not
> >> a multiple of the page size, the remaining memory is zeroed when
> >> mapped, and writes to that region are not written out to the file. The
> >> effect of changing the size of the underlying file of a mapping on the
> >> pages that correspond to added or removed regions of the file is
> >> unspecified.
> >>
> >> Whatever happens happens. The above code is just wrong, as in
> >> unspecified behaviour.
> >> What happens if you ftruncate() before mmap()?
> > OK, I admit I didn't realize mmap() has so weak requirements. Doing mmap
> > after ftruncate() should work fine because before you write via that new
> > mmap page_mkwrite() will be called anyway.
>
> But the ftruncate would only allocate a block at position 10000. The
> file still has a big hole from 1024-4095.
ftruncate() actually allocates no blocks. It just updates file size (at
least for most filesystems). The hole is created as you write.

> > So what we could alternatively do is that we just discard dirty bits from
> > buffers that don't have underlying blocks allocated. That would satisfy the
> > specification as well. But I have to say I'm a bit afraid of discarding
> > dirty bits like that. Also we'd have to handle the case where someone does
> > mremap() after ftruncate().
> > What other memory management people think?
>
> As said above the file still has a big hole after ftruncate. So not
> having underlying blocks allocated can't be the deciding factor.
I'm not sure I understand here. Do you mean that we should not decide
about discarding dirty bits depending on whether the buffers have
underlying blocks or not? In my opinion that should be correct option
because from what the man page says, user is not guaranteed what happens
when the file size is extended to 10000 and he tries to write from offset
1024 (old i_size) further... Anyway, I'm not defending doing that :-) I'm
just trying to understand what you mean.

> If possible I would make ftruncate() after mmap() work just like
> ftruncate() before mmap(). That is write any dirty page completly up
> to the current filesize. Allocate disk blocks for the file as needed
> (you need to do that anyway). Wouldn't it be more work to remember the
This is the thing my patches try to achieve :). At truncate time (or
generally just before i_size is extended), we check the last page and
propagate dirty bits to buffers inside old i_size (and clear the ones
beyond old i_size). We also writeprotect the page so that if someone tries
to write to it via mmap in future, we get page fault, page_mkwrite() is
called and a filesystem allocates all blocks it needs with the new i_size.

> filesize at the time of the mmap() to limit updates to that than using
> the current file size?
Yes, that would be more work. But I never intended to do this...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-01 15:35:49

by Goswin von Brederlow

[permalink] [raw]

Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

Jan Kara <[email protected]> writes:

> On Mon 01-06-09 16:46:28, Goswin von Brederlow wrote:
>> Jan Kara <[email protected]> writes:
>> > On Mon 01-06-09 13:33:08, Goswin von Brederlow wrote:
>> >> Jan Kara <[email protected]> writes:
>> >>
>> >> > On Sat 30-05-09 13:23:24, Pavel Machek wrote:
>> >> >> Hi!
>> >> >>
>> >> >> > On filesystems where blocksize < pagesize the situation is more complicated.
>> >> >> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
>> >> >> > ftruncate(fd, 0);
>> >> >> > pwrite(fd, buf, 1024, 0);
>> >> >> > map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
>> >> >> > map[0] = 'a'; ----> page_mkwrite() for index 0 is called
>> >> >> > ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>> >> >> > fsync(fd); ----> writepage() for index 0 is called
>> >> >> >
>> >> >> > At the moment page_mkwrite() is called, filesystem can allocate only one block
>> >> >> > for the page because i_size == 1024. Otherwise it would create blocks beyond
>> >> >> > i_size which is generally undesirable. But later at writepage() time, we would
>> >> >> > like to have blocks allocated for the whole page (and in principle we have to
>> >> >> > allocate them because user could have filled the page with data after the
>> >> >> > second ftruncate()). This patch introduces a framework which allows filesystems
>> >> >> > to handle this with a reasonable effort.
>> >> >>
>> >> >> What happens when you do above sequence on today's kernels? Oops? 3000
>> >> >> bytes of random junk in file? ...?
>> >> > Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
>> >> > won't be written. Some filesystems may just try to map blocks and possibly
>> >> > hit deadlock or something like that. Filesystems like ext2 / ext3 /
>> >> > reiserfs generally don't care because so far they allocate blocks on writepage
>> >> > time (which has the problem that you can write data via mmap and kernel
>> >> > will later discard them because it hits ENOSPC or quota limit). That's
>> >> > actually what I was trying to fix originally.
>> >> >
>> >> > Honza
>> >>
>> >> man mmap:
>> >> A file is mapped in multiples of the page size. For a file that is not
>> >> a multiple of the page size, the remaining memory is zeroed when
>> >> mapped, and writes to that region are not written out to the file. The
>> >> effect of changing the size of the underlying file of a mapping on the
>> >> pages that correspond to added or removed regions of the file is
>> >> unspecified.
>> >>
>> >> Whatever happens happens. The above code is just wrong, as in
>> >> unspecified behaviour.
>> >> What happens if you ftruncate() before mmap()?
>> > OK, I admit I didn't realize mmap() has so weak requirements. Doing mmap
>> > after ftruncate() should work fine because before you write via that new
>> > mmap page_mkwrite() will be called anyway.
>>
>> But the ftruncate would only allocate a block at position 10000. The
>> file still has a big hole from 1024-4095.
> ftruncate() actually allocates no blocks. It just updates file size (at
> least for most filesystems). The hole is created as you write.
>
>> > So what we could alternatively do is that we just discard dirty bits from
>> > buffers that don't have underlying blocks allocated. That would satisfy the
>> > specification as well. But I have to say I'm a bit afraid of discarding
>> > dirty bits like that. Also we'd have to handle the case where someone does
>> > mremap() after ftruncate().
>> > What other memory management people think?
>>
>> As said above the file still has a big hole after ftruncate. So not
>> having underlying blocks allocated can't be the deciding factor.
> I'm not sure I understand here. Do you mean that we should not decide
> about discarding dirty bits depending on whether the buffers have
> underlying blocks or not? In my opinion that should be correct option
> because from what the man page says, user is not guaranteed what happens
> when the file size is extended to 10000 and he tries to write from offset
> 1024 (old i_size) further... Anyway, I'm not defending doing that :-) I'm
> just trying to understand what you mean.

I mean that the file can lack underlying blocks below its old
i_size. So that can't be the deciding factor.

>> If possible I would make ftruncate() after mmap() work just like
>> ftruncate() before mmap(). That is write any dirty page completly up
>> to the current filesize. Allocate disk blocks for the file as needed
>> (you need to do that anyway). Wouldn't it be more work to remember the
> This is the thing my patches try to achieve :). At truncate time (or
> generally just before i_size is extended), we check the last page and
> propagate dirty bits to buffers inside old i_size (and clear the ones
> beyond old i_size). We also writeprotect the page so that if someone tries
> to write to it via mmap in future, we get page fault, page_mkwrite() is
> called and a filesystem allocates all blocks it needs with the new i_size.

Sounds right.

>> filesize at the time of the mmap() to limit updates to that than using
>> the current file size?
> Yes, that would be more work. But I never intended to do this...

MfG
Goswin

2009-06-04 14:09:53

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 05/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize

On Wed, May 27, 2009 at 03:01:02PM +0200, Jan Kara wrote:
> In a situation like:
> truncate(f, 1024);
> a = mmap(f, 0, 4096);
> a[0] = 'a';
> truncate(f, 4096);
>
> we end up with a dirty page which does not have all blocks allocated /
> reserved. Fix the problem by using new VFS infrastructure.
>
> Signed-off-by: Jan Kara <[email protected]>

One warning --- this patch will have a slight patch conflict with the
patch ext4-convert-markers-to-tracepoints in the ext4 patch queue.

The ext4-convert-markers-to-tracepoints patch is also in the unstable
part of the patch queue, since it is waiting for patches allowing
tracepoints to be used in modules to be merged into mainline at the
next merge window. It's a very minor, easy fix-up, but depending who
which patch series gets merged first, there might be a minor fixup
required.

- Ted

2009-06-04 17:11:45

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

On Wed, May 27, 2009 at 04:59:11PM +0200, Jan Kara wrote:
> Yes, that would be fine. Thanks. Luckily, the dependence is just one way
> (ext4 change has to go after the VFS patch) so it's just enough to merge
> that ext4 change some time after VFS changes go in.

ext4: Add WARN_ON on unmapped dirty buffer_heads in writepage
fs: Don't clear dirty bits in block_write_full_page()
vfs: Unmap underlying metadata of new data buffers only when buffer is m
ext4: Make sure blocks are properly allocated under mmaped page even whe
vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

FYI, I've added the following to the unstable portion of the patch
queue, and have started running tests (fsx, fsstress, and dbench)
using 1k blocksizes, using a variety of ext4 mount options. So far,
things look good.

How goes the reviews/comments? Are we planning on pushing this to
Linus at the beginning of the next merge window?

- Ted

2009-06-05 23:23:06

[permalink] [raw]

Subject: Re: [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize

On Thu 04-06-09 13:11:42, Theodore Tso wrote:
> On Wed, May 27, 2009 at 04:59:11PM +0200, Jan Kara wrote:
> > Yes, that would be fine. Thanks. Luckily, the dependence is just one way
> > (ext4 change has to go after the VFS patch) so it's just enough to merge
> > that ext4 change some time after VFS changes go in.
>
> ext4: Add WARN_ON on unmapped dirty buffer_heads in writepage
> fs: Don't clear dirty bits in block_write_full_page()
> vfs: Unmap underlying metadata of new data buffers only when buffer is m
> ext4: Make sure blocks are properly allocated under mmaped page even whe
> vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
>
> FYI, I've added the following to the unstable portion of the patch
> queue, and have started running tests (fsx, fsstress, and dbench)
> using 1k blocksizes, using a variety of ext4 mount options. So far,
> things look good.
Thanks for the merge and testing!

> How goes the reviews/comments? Are we planning on pushing this to
> Linus at the beginning of the next merge window?
I'd like to get some VFS / MM people review the generic part of the
change... I guess I'll post a request to linux-mm to catch more attention.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR