2010-10-09 00:13:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH RFC 0/3] Block reservation for ext3


Hi,

currently, when mmapped write is done to a file backed by ext3, the
filesystem does nothing to make sure blocks will be available when we need
to write them out. This has two nasty consequences:
1) When flusher thread does writeback of the mmapped data, allocation
happens in the context of flusher thread (i.e., as root). Thus user
can effectively arbitrarily exceed quota limits or use space reserved
only for sysadmin.
2) When a filesystem runs out of space, we just silently drop data on the
floor (the same happens when writeout is performed in the context of
the user and he hits his quota limit).

I think these problems are serious enough (especially (1) about which I was
notified lately) so that we try to fix them in ext3. The trouble is that the
fix in non-trivial. We have to basically implement delayed allocation for ext3
- block reservation happens on page_mkwrite() time and we convert the
reservation into a real allocation during writeout. Note: Of course, we could
use a much simpler solution and just do block allocation directly at
page_mkwrite() time but that really cripples performance for loads that do
random writes via mmap - I've tried that some time ago.
The patches in this series (against 2.6.26-rc7) implement this. They
survived some testing with fsx-linux, I'll do more and also some performance
testing.
So I'd like to hear what other people think about this. Reviews or
testing are welcome ;)

Honza


2010-10-09 00:13:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/3] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped

When we do delayed allocation of some buffer, we want to signal to VFS that
the buffer is new (set buffer_new) so that it properly zeros out everything.
But we don't have the buffer mapped yet so we cannot really unmap underlying
metadata in this state. Make VFS avoid doing unmapping of metadata when the
buffer is not yet mapped.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..0f2ba29 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1688,8 +1688,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
if (buffer_new(bh)) {
/* blockdev mappings never come here */
clear_buffer_new(bh);
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
}
}
bh = bh->b_this_page;
@@ -1875,8 +1876,9 @@ int block_prepare_write(struct page *page, unsigned from, unsigned to,
if (err)
break;
if (buffer_new(bh)) {
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
if (PageUptodate(page)) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
@@ -2514,7 +2516,7 @@ int nobh_write_begin(struct address_space *mapping,
goto failed;
if (!buffer_mapped(bh))
is_mapped_to_disk = 0;
- if (buffer_new(bh))
+ if (buffer_new(bh) && buffer_mapped(bh))
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
if (PageUptodate(page)) {
set_buffer_uptodate(bh);
--
1.6.4.2


2010-10-09 00:13:43

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/3] vfs: Implement generic per-cpu counters for delayed allocation

Implement free blocks and reserved blocks counters for delayed allocation.
These counters are reliable in the sence that when they return success, the
subsequent conversion from reserved to allocated blocks always succeeds (see
comments in the code for details). This is useful for ext3 filesystem to
implement delayed allocation in particular for allocation in page_mkwrite.

Signed-off-by: Jan Kara <[email protected]>
---
fs/Kconfig | 4 ++
fs/Makefile | 1 +
fs/delalloc_counter.c | 109 ++++++++++++++++++++++++++++++++++++++
fs/ext3/Kconfig | 1 +
include/linux/delalloc_counter.h | 73 +++++++++++++++++++++++++
5 files changed, 188 insertions(+), 0 deletions(-)
create mode 100644 fs/delalloc_counter.c
create mode 100644 include/linux/delalloc_counter.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 3d18530..4432d53 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -19,6 +19,10 @@ config FS_XIP
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"

+config DELALLOC_COUNTER
+ bool
+ default n
+
config FS_MBCACHE
# Meta block cache for Extended Attributes (ext2/ext3/ext4)
tristate
diff --git a/fs/Makefile b/fs/Makefile
index e6ec1d3..31b22e9 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -19,6 +19,7 @@ else
obj-y += no-block.o
endif

+obj-$(CONFIG_DELALLOC_COUNTER) += delalloc_counter.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
diff --git a/fs/delalloc_counter.c b/fs/delalloc_counter.c
new file mode 100644
index 0000000..0f575d5
--- /dev/null
+++ b/fs/delalloc_counter.c
@@ -0,0 +1,109 @@
+/*
+ * Per-cpu counters for delayed allocation
+ */
+#include <linux/percpu_counter.h>
+#include <linux/delalloc_counter.h>
+#include <linux/module.h>
+#include <linux/log2.h>
+
+static long dac_error(struct delalloc_counter *c)
+{
+#ifdef CONFIG_SMP
+ return c->batch * nr_cpu_ids;
+#else
+ return 0;
+#endif
+}
+
+/*
+ * Reserve blocks for delayed allocation
+ *
+ * This code is subtle because we want to avoid synchronization of processes
+ * doing allocation in the common case when there's plenty of space in the
+ * filesystem.
+ *
+ * The code maintains the following property: Among all the calls to
+ * dac_reserve() that return 0 there exists a simple sequential ordering of
+ * these calls such that the check (free - reserved >= limit) in each call
+ * succeeds. This guarantees that we never reserve blocks we don't have.
+ *
+ * The proof of the above invariant: The function can return 0 either when the
+ * first if succeeds or when both ifs fail. To the first type of callers we
+ * assign the time of read of c->reserved in the first if, to the second type
+ * of callers we assign the time of read of c->reserved in the second if. We
+ * order callers by their assigned time and claim that this is the ordering
+ * required by the invariant. Suppose that a check (free - reserved >= limit)
+ * fails for caller C in the proposed ordering. We distinguish two cases:
+ * 1) function called by C returned zero because the first if succeeded - in
+ * this case reads of counters in the first if must have seen effects of
+ * __percpu_counter_add of all the callers before C (even their condition
+ * evaluation happened before our). The errors accumulated in cpu-local
+ * variables are clearly < dac_error(c) and thus the condition should fail.
+ * Contradiction.
+ * 2) function called by C returned zero because the second if failed - again
+ * the read of the counters must have seen effects of __percpu_counter_add of
+ * all the callers before C and thus the condition should have succeeded.
+ * Contradiction.
+ */
+int dac_reserve(struct delalloc_counter *c, s32 amount, s64 limit)
+{
+ s64 free, reserved;
+ int ret = 0;
+
+ __percpu_counter_add(&c->reserved, amount, c->batch);
+ /*
+ * This barrier makes sure that when effects of the following read of
+ * c->reserved are observable by another CPU also effects of the
+ * previous store to c->reserved are seen.
+ */
+ smp_mb();
+ if (percpu_counter_read(&c->free) - percpu_counter_read(&c->reserved)
+ - 2 * dac_error(c) >= limit)
+ return ret;
+ /*
+ * Near the limit - sum the counter to avoid returning ENOSPC too
+ * early. Note that we can still "unnecessarily" return ENOSPC when
+ * there are several racing writers. Spinlock in this section would
+ * solve it but let's ignore it for now.
+ */
+ free = percpu_counter_sum_positive(&c->free);
+ reserved = percpu_counter_sum_positive(&c->reserved);
+ if (free - reserved < limit) {
+ __percpu_counter_add(&c->reserved, -amount, c->batch);
+ ret = -ENOSPC;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(dac_reserve);
+
+/* Account reserved blocks as allocated */
+void dac_alloc_reserved(struct delalloc_counter *c, s32 amount)
+{
+ __percpu_counter_add(&c->free, -amount, c->batch);
+ /*
+ * Make sure update of free counter is seen before update of
+ * reserved counter.
+ */
+ smp_wmb();
+ __percpu_counter_add(&c->reserved, -amount, c->batch);
+}
+EXPORT_SYMBOL(dac_alloc_reserved);
+
+int dac_init(struct delalloc_counter *c, s64 amount)
+{
+ int err;
+
+ c->batch = 8*(1+ilog2(nr_cpu_ids));
+ err = percpu_counter_init(&c->free, amount);
+ if (!err)
+ err = percpu_counter_init(&c->reserved, 0);
+ return err;
+}
+EXPORT_SYMBOL(dac_init);
+
+void dac_destroy(struct delalloc_counter *c)
+{
+ percpu_counter_destroy(&c->free);
+ percpu_counter_destroy(&c->reserved);
+}
+EXPORT_SYMBOL(dac_destroy);
diff --git a/fs/ext3/Kconfig b/fs/ext3/Kconfig
index e8c6ba0..20418f3 100644
--- a/fs/ext3/Kconfig
+++ b/fs/ext3/Kconfig
@@ -1,6 +1,7 @@
config EXT3_FS
tristate "Ext3 journalling file system support"
select JBD
+ select DELALLOC_COUNTER
help
This is the journalling version of the Second extended file system
(often called ext3), the de facto standard Linux file system
diff --git a/include/linux/delalloc_counter.h b/include/linux/delalloc_counter.h
new file mode 100644
index 0000000..599fffc
--- /dev/null
+++ b/include/linux/delalloc_counter.h
@@ -0,0 +1,73 @@
+#ifndef _LINUX_DELALLOC_COUNTER_H
+#define _LINUX_DELALLOC_COUNTER_H
+
+#include <linux/percpu_counter.h>
+
+struct delalloc_counter {
+ struct percpu_counter free;
+ struct percpu_counter reserved;
+ int batch;
+};
+
+int dac_reserve(struct delalloc_counter *c, s32 amount, s64 limit);
+void dac_alloc_reserved(struct delalloc_counter *c, s32 amount);
+
+static inline int dac_alloc(struct delalloc_counter *c, s32 amount, s64 limit)
+{
+ int ret = dac_reserve(c, amount, limit);
+ if (!ret)
+ dac_alloc_reserved(c, amount);
+ return ret;
+}
+
+static inline void dac_free(struct delalloc_counter *c, s32 amount)
+{
+ __percpu_counter_add(&c->free, amount, c->batch);
+}
+
+static inline void dac_cancel_reserved(struct delalloc_counter *c, s32 amount)
+{
+ __percpu_counter_add(&c->reserved, -amount, c->batch);
+}
+
+int dac_init(struct delalloc_counter *c, s64 amount);
+void dac_destroy(struct delalloc_counter *c);
+
+static inline s64 dac_get_avail(struct delalloc_counter *c)
+{
+ s64 ret = percpu_counter_read(&c->free) -
+ percpu_counter_read(&c->reserved);
+ if (ret < 0)
+ return 0;
+ return ret;
+}
+
+static inline s64 dac_get_avail_sum(struct delalloc_counter *c)
+{
+ s64 ret = percpu_counter_sum(&c->free) -
+ percpu_counter_sum(&c->reserved);
+ if (ret < 0)
+ return 0;
+ return ret;
+}
+
+static inline s64 dac_get_reserved(struct delalloc_counter *c)
+{
+ return percpu_counter_read_positive(&c->reserved);
+}
+
+static inline s64 dac_get_reserved_sum(struct delalloc_counter *c)
+{
+ return percpu_counter_sum_positive(&c->reserved);
+}
+
+static inline s64 dac_get_free(struct delalloc_counter *c)
+{
+ return percpu_counter_read_positive(&c->free);
+}
+
+static inline s64 dac_get_free_sum(struct delalloc_counter *c)
+{
+ return percpu_counter_sum_positive(&c->free);
+}
+#endif
--
1.6.4.2


2010-10-09 00:13:43

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/3] ext3: Implement delayed allocation on page_mkwrite time

We don't want to really allocate blocks on page_mkwrite() time because for
random writes via mmap it results is much more fragmented files. So just
reserve enough free blocks in page_mkwrite() and do the real allocation from
writepage().

It's however not so simple because we do not want to overestimate necessary
number of indirect blocks too badly in presence of lots of delayed allocated
buffers. Thus we track which indirect blocks have already reservation pending
and do not reserve space for them again.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/balloc.c | 103 +++++++++-----
fs/ext3/file.c | 19 +++-
fs/ext3/ialloc.c | 2 +-
fs/ext3/inode.c | 346 +++++++++++++++++++++++++++++++++++++++++---
fs/ext3/resize.c | 2 +-
fs/ext3/super.c | 23 +++-
include/linux/ext3_fs.h | 5 +-
include/linux/ext3_fs_i.h | 20 +++
include/linux/ext3_fs_sb.h | 3 +-
9 files changed, 458 insertions(+), 65 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 4a32511..bf3f607 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -20,6 +20,8 @@
#include <linux/ext3_jbd.h>
#include <linux/quotaops.h>
#include <linux/buffer_head.h>
+#include <linux/delalloc_counter.h>
+#include <linux/writeback.h>

/*
* balloc.c contains the blocks allocation and deallocation routines
@@ -633,7 +635,7 @@ do_more:
spin_lock(sb_bgl_lock(sbi, block_group));
le16_add_cpu(&desc->bg_free_blocks_count, group_freed);
spin_unlock(sb_bgl_lock(sbi, block_group));
- percpu_counter_add(&sbi->s_freeblocks_counter, count);
+ dac_free(&sbi->s_alloc_counter, count);

/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1411,23 +1413,19 @@ out:
}

/**
- * ext3_has_free_blocks()
- * @sbi: in-core super block structure.
+ * ext3_free_blocks_limit()
+ * @sb: super block
*
* Check if filesystem has at least 1 free block available for allocation.
*/
-static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
+ext3_fsblk_t ext3_free_blocks_limit(struct super_block *sb)
{
- ext3_fsblk_t free_blocks, root_blocks;
+ struct ext3_sb_info *sbi = EXT3_SB(sb);

- free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
- root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
- if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
- sbi->s_resuid != current_fsuid() &&
- (sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
- return 0;
- }
- return 1;
+ if (!capable(CAP_SYS_RESOURCE) && sbi->s_resuid != current_fsuid() &&
+ (sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
+ return le32_to_cpu(sbi->s_es->s_r_blocks_count) + 1;
+ return 0;
}

/**
@@ -1444,12 +1442,21 @@ static int ext3_has_free_blocks(struct ext3_sb_info *sbi)
*/
int ext3_should_retry_alloc(struct super_block *sb, int *retries)
{
- if (!ext3_has_free_blocks(EXT3_SB(sb)) || (*retries)++ > 3)
+ struct ext3_sb_info *sbi = EXT3_SB(sb);
+ ext3_fsblk_t limit;
+
+ limit = ext3_free_blocks_limit(sb);
+ if (dac_get_free(&sbi->s_alloc_counter) < limit || (*retries)++ > 3)
return 0;

jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
-
- return journal_force_commit_nested(EXT3_SB(sb)->s_journal);
+ /*
+ * There's a chance commit will free some blocks and writeback can
+ * write delayed blocks so that excessive reservation gets released.
+ */
+ if (dac_get_reserved(&sbi->s_alloc_counter))
+ writeback_inodes_sb_if_idle(sb);
+ return journal_force_commit_nested(sbi->s_journal);
}

/**
@@ -1458,6 +1465,7 @@ int ext3_should_retry_alloc(struct super_block *sb, int *retries)
* @inode: file inode
* @goal: given target block(filesystem wide)
* @count: target number of blocks to allocate
+ * @reserved: number of reserved blocks
* @errp: error code
*
* ext3_new_blocks uses a goal block to assist allocation. It tries to
@@ -1465,9 +1473,13 @@ int ext3_should_retry_alloc(struct super_block *sb, int *retries)
* fails, it will try to allocate block(s) from other block groups without
* any specific goal block.
*
+ * If there is some number of blocks reserved for the allocation, we first
+ * allocate non-reserved blocks and only when we have enough of them, we start
+ * using the reserved ones.
*/
ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
- ext3_fsblk_t goal, unsigned long *count, int *errp)
+ ext3_fsblk_t goal, unsigned long *count,
+ unsigned int reserved, int *errp)
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *gdp_bh;
@@ -1478,7 +1490,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
ext3_fsblk_t ret_block; /* filesyetem-wide allocated block */
int bgi; /* blockgroup iteration index */
int fatal = 0, err;
- int performed_allocation = 0;
+ int got_quota = 0, got_space = 0;
ext3_grpblk_t free_blocks; /* number of free blocks in a group */
struct super_block *sb;
struct ext3_group_desc *gdp;
@@ -1499,17 +1511,28 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
printk("ext3_new_block: nonexistent device");
return 0;
}
+ sbi = EXT3_SB(sb);

/*
* Check quota for allocation of this block.
*/
- err = dquot_alloc_block(inode, num);
- if (err) {
- *errp = err;
- return 0;
+ if (dquot_alloc_block(inode, num - reserved)) {
+ *errp = -EDQUOT;
+ goto out;
}
+ got_quota = 1;
+ /*
+ * We need not succeed in allocating all these blocks but we have to
+ * check & update delalloc counter before allocating blocks. That
+ * guarantees that reserved blocks are always possible to allocate...
+ */
+ if (dac_alloc(&sbi->s_alloc_counter, num - reserved,
+ ext3_free_blocks_limit(sb)) < 0) {
+ *errp = -ENOSPC;
+ goto out;
+ }
+ got_space = 1;

- sbi = EXT3_SB(sb);
es = EXT3_SB(sb)->s_es;
ext3_debug("goal=%lu.\n", goal);
/*
@@ -1524,11 +1547,6 @@ ext3_fsblk_t ext3_new_blocks(handle_t *handle, struct inode *inode,
if (block_i && ((windowsz = block_i->rsv_window_node.rsv_goal_size) > 0))
my_rsv = &block_i->rsv_window_node;

- if (!ext3_has_free_blocks(sbi)) {
- *errp = -ENOSPC;
- goto out;
- }
-
/*
* First, test whether the goal block is free.
*/
@@ -1658,8 +1676,6 @@ allocated:
goto retry_alloc;
}

- performed_allocation = 1;
-
#ifdef CONFIG_JBD_DEBUG
{
struct buffer_head *debug_bh;
@@ -1709,7 +1725,6 @@ allocated:
spin_lock(sb_bgl_lock(sbi, group_no));
le16_add_cpu(&gdp->bg_free_blocks_count, -num);
spin_unlock(sb_bgl_lock(sbi, group_no));
- percpu_counter_sub(&sbi->s_freeblocks_counter, num);

BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
err = ext3_journal_dirty_metadata(handle, gdp_bh);
@@ -1721,7 +1736,23 @@ allocated:

*errp = 0;
brelse(bitmap_bh);
- dquot_free_block(inode, *count-num);
+ /* Used some of the reserved blocks? */
+ if (*count - reserved < num) {
+ unsigned int used_rsv = num - (*count - reserved);
+
+ dac_alloc_reserved(&sbi->s_alloc_counter, used_rsv);
+ dquot_claim_block(inode, used_rsv);
+ } else {
+ unsigned int missing_blocks = *count - reserved - num;
+
+ /*
+ * We didn't succeed in allocating all non-reserved blocks.
+ * Update counters to fix overestimation we did at the
+ * beginning of this function
+ */
+ dac_free(&sbi->s_alloc_counter, missing_blocks);
+ dquot_free_block(inode, missing_blocks);
+ }
*count = num;
return ret_block;

@@ -1735,8 +1766,10 @@ out:
/*
* Undo the block allocation
*/
- if (!performed_allocation)
- dquot_free_block(inode, *count);
+ if (got_quota)
+ dquot_free_block(inode, *count - reserved);
+ if (got_space)
+ dac_free(&sbi->s_alloc_counter, *count - reserved);
brelse(bitmap_bh);
return 0;
}
@@ -1746,7 +1779,7 @@ ext3_fsblk_t ext3_new_block(handle_t *handle, struct inode *inode,
{
unsigned long count = 1;

- return ext3_new_blocks(handle, inode, goal, &count, errp);
+ return ext3_new_blocks(handle, inode, goal, &count, 0, errp);
}

/**
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index f55df0e..249597d 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
return 0;
}

+static const struct vm_operations_struct ext3_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext3_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext3_file_mmap,
.open = dquot_file_open,
.release = ext3_release_file,
.fsync = ext3_sync_file,
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4ab72db..481f63c 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -257,7 +257,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)

freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
avefreei = freei / ngroups;
- freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+ freeb = dac_get_avail(&sbi->s_alloc_counter);
avefreeb = freeb / ngroups;
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5e0faf4..2ee6df7 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -38,6 +38,7 @@
#include <linux/bio.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/mount.h>
#include "xattr.h"
#include "acl.h"

@@ -195,6 +196,7 @@ static int truncate_restart_transaction(handle_t *handle, struct inode *inode)
void ext3_evict_inode (struct inode *inode)
{
struct ext3_block_alloc_info *rsv;
+ struct ext3_inode_info *ei;
handle_t *handle;
int want_delete = 0;

@@ -205,9 +207,10 @@ void ext3_evict_inode (struct inode *inode)

truncate_inode_pages(&inode->i_data, 0);

+ ei = EXT3_I(inode);
ext3_discard_reservation(inode);
- rsv = EXT3_I(inode)->i_block_alloc_info;
- EXT3_I(inode)->i_block_alloc_info = NULL;
+ rsv = ei->i_block_alloc_info;
+ ei->i_block_alloc_info = NULL;
if (unlikely(rsv))
kfree(rsv);

@@ -239,7 +242,7 @@ void ext3_evict_inode (struct inode *inode)
* (Well, we could do this if we need to, but heck - it works)
*/
ext3_orphan_del(handle, inode);
- EXT3_I(inode)->i_dtime = get_seconds();
+ ei->i_dtime = get_seconds();

/*
* One subtle ordering requirement: if anything has gone wrong
@@ -260,10 +263,194 @@ void ext3_evict_inode (struct inode *inode)
ext3_free_inode(handle, inode);
}
ext3_journal_stop(handle);
+out_check:
+ if (ei->i_reserved_quota)
+ ext3_warning(inode->i_sb, __func__, "Releasing inode %lu with "
+ "%lu reserved blocks.\n", inode->i_ino,
+ (unsigned long)ei->i_reserved_quota);
return;
no_delete:
end_writeback(inode);
dquot_drop(inode);
+ goto out_check;
+}
+
+/*
+ * Find indirect block structure for given block offset. If the structure
+ * does not exist, return NULL and fill parentp (provided it's != NULL) with
+ * a pointer to the parent node in rb_tree.
+ */
+static struct ext3_da_indirect *ext3_find_da_indirect(struct inode *inode,
+ long i_block, struct rb_node **parentp)
+{
+ struct rb_node *n = EXT3_I(inode)->i_da_indirect.rb_node;
+ struct rb_node *parent = NULL;
+ struct ext3_da_indirect *ind;
+
+ if (i_block < EXT3_NDIR_BLOCKS)
+ return NULL;
+ i_block = (i_block - EXT3_NDIR_BLOCKS) &
+ ~(EXT3_ADDR_PER_BLOCK(inode->i_sb) - 1);
+ while (n) {
+ ind = rb_entry(n, struct ext3_da_indirect, node);
+
+ parent = n;
+ if (i_block < ind->offset)
+ n = n->rb_left;
+ else if (i_block > ind->offset)
+ n = n->rb_right;
+ else
+ return ind;
+ }
+ if (parentp)
+ *parentp = parent;
+ return NULL;
+}
+
+static struct ext3_da_indirect *ext3_add_da_indirect(struct inode *inode,
+ long i_block, struct rb_node *parent_node)
+{
+ struct ext3_da_indirect *ind;
+ struct rb_node **np;
+
+ ind = kmalloc(sizeof(struct ext3_da_indirect), GFP_NOFS);
+ if (!ind)
+ return NULL;
+
+ ind->offset = (i_block - EXT3_NDIR_BLOCKS) &
+ ~(EXT3_ADDR_PER_BLOCK(inode->i_sb) - 1);
+ ind->data_blocks = 1;
+ ind->flags = 0;
+ if (parent_node) {
+ struct ext3_da_indirect *parent = rb_entry(
+ parent_node, struct ext3_da_indirect, node);
+
+ if (ind->offset < parent->offset)
+ np = &parent_node->rb_left;
+ else
+ np = &parent_node->rb_right;
+ } else
+ np = &EXT3_I(inode)->i_da_indirect.rb_node;
+ rb_link_node(&ind->node, parent_node, np);
+ rb_insert_color(&ind->node, &EXT3_I(inode)->i_da_indirect);
+ return ind;
+}
+
+static int ext3_calc_indirect_depth(struct inode *inode, long i_block)
+{
+ int apbb = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
+
+ if (i_block < EXT3_NDIR_BLOCKS)
+ return 0;
+ i_block -= EXT3_NDIR_BLOCKS;
+ if (i_block < (1 << apbb))
+ return 1;
+ i_block -= (1 << apbb);
+ if (i_block < (1 << 2*apbb))
+ return 2;
+ return 3;
+}
+
+static int ext3_reserve_blocks(struct inode *inode, unsigned int count)
+{
+ int ret;
+
+ if (dquot_reserve_block(inode, count))
+ return -EDQUOT;
+ ret = dac_reserve(&EXT3_SB(inode->i_sb)->s_alloc_counter, count,
+ ext3_free_blocks_limit(inode->i_sb));
+ if (ret < 0) {
+ dquot_release_reservation_block(inode, count);
+ return ret;
+ }
+ return 0;
+}
+
+static void ext3_cancel_rsv_blocks(struct inode *inode, unsigned int count)
+{
+ dac_cancel_reserved(&EXT3_SB(inode->i_sb)->s_alloc_counter, count);
+ dquot_release_reservation_block(inode, count);
+}
+
+/*
+ * Reserve appropriate amount of space (and quota) for future allocation.
+ * Record the fact in inode's tree of reserved indirect blocks.
+ */
+static int ext3_rsv_da_block(struct inode *inode, long i_block)
+{
+ int depth = ext3_calc_indirect_depth(inode, i_block);
+ struct rb_node *parent_node;
+ struct ext3_da_indirect *ind;
+ int ret;
+
+ /* No indirect blocks needed? */
+ if (depth == 0)
+ return ext3_reserve_blocks(inode, 1);
+
+ mutex_lock(&EXT3_I(inode)->truncate_mutex);
+ ind = ext3_find_da_indirect(inode, i_block, &parent_node);
+ /* If indirect block is already reserved, we need just the data block */
+ if (ind)
+ depth = 1;
+ else
+ depth++;
+
+ ret = ext3_reserve_blocks(inode, depth);
+ if (ret < 0)
+ goto out;
+
+ if (!ind) {
+ ind = ext3_add_da_indirect(inode, i_block, parent_node);
+ if (!ind) {
+ ext3_cancel_rsv_blocks(inode, depth);
+ ret = -ENOMEM;
+ goto out;
+ }
+ } else
+ ind->data_blocks++;
+out:
+ mutex_unlock(&EXT3_I(inode)->truncate_mutex);
+ return 0;
+}
+
+/*
+ * Cancel reservation of delayed allocated block and corresponding metadata
+ */
+static void ext3_cancel_da_block(struct inode *inode, long i_block)
+{
+ struct ext3_da_indirect *ind;
+ int unrsv = 1;
+
+ if (i_block < EXT3_NDIR_BLOCKS) {
+ ext3_cancel_rsv_blocks(inode, 1);
+ return;
+ }
+
+ mutex_lock(&EXT3_I(inode)->truncate_mutex);
+ ind = ext3_find_da_indirect(inode, i_block, NULL);
+ if (ind && !--ind->data_blocks) {
+ if (!(ind->flags & EXT3_DA_ALLOC_FL))
+ unrsv += ext3_calc_indirect_depth(inode, i_block);
+ rb_erase(&ind->node, &EXT3_I(inode)->i_da_indirect);
+ kfree(ind);
+ }
+ mutex_unlock(&EXT3_I(inode)->truncate_mutex);
+ ext3_cancel_rsv_blocks(inode, unrsv);
+}
+
+static void ext3_allocated_da_block(struct inode *inode,
+ struct ext3_da_indirect *ind,
+ int bh_delayed, unsigned int unrsv_blocks)
+{
+ if (!(ind->flags & EXT3_DA_ALLOC_FL)) {
+ /* Cancel unused indirect blocks reservation */
+ ext3_cancel_rsv_blocks(inode, unrsv_blocks);
+ ind->flags |= EXT3_DA_ALLOC_FL;
+ }
+ if (bh_delayed && !--ind->data_blocks) {
+ rb_erase(&ind->node, &EXT3_I(inode)->i_da_indirect);
+ kfree(ind);
+ }
}

typedef struct {
@@ -537,8 +724,10 @@ static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,

/**
* ext3_alloc_blocks: multiple allocate blocks needed for a branch
+ * @goal: goal block for the allocation
* @indirect_blks: the number of blocks need to allocate for indirect
* blocks
+ * @reserved: is the data block reserved?
*
* @new_blocks: on return it will store the new block numbers for
* the indirect blocks(if needed) and the first direct block,
@@ -547,7 +736,8 @@ static int ext3_blks_to_allocate(Indirect *branch, int k, unsigned long blks,
*/
static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
ext3_fsblk_t goal, int indirect_blks, int blks,
- ext3_fsblk_t new_blocks[4], int *err)
+ unsigned int reserved, ext3_fsblk_t new_blocks[4],
+ int *err)
{
int target, i;
unsigned long count = 0;
@@ -568,11 +758,15 @@ static int ext3_alloc_blocks(handle_t *handle, struct inode *inode,
while (1) {
count = target;
/* allocating blocks for indirect blocks and direct blocks */
- current_block = ext3_new_blocks(handle,inode,goal,&count,err);
+ current_block = ext3_new_blocks(handle, inode, goal, &count,
+ reserved, err);
if (*err)
goto failed_out;

target -= count;
+ /* Used some reserved blocks? */
+ if (target < reserved)
+ reserved = target;
/* allocate blocks for indirect blocks */
while (index < indirect_blks && count) {
new_blocks[index++] = current_block++;
@@ -601,6 +795,8 @@ failed_out:
* @inode: owner
* @indirect_blks: number of allocated indirect blocks
* @blks: number of allocated direct blocks
+ * @reserved: is the data block reserved?
+ * @goal: goal block for the allocation
* @offsets: offsets (in the blocks) to store the pointers to next.
* @branch: place to store the chain in.
*
@@ -622,8 +818,8 @@ failed_out:
* as described above and return 0.
*/
static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
- int indirect_blks, int *blks, ext3_fsblk_t goal,
- int *offsets, Indirect *branch)
+ int indirect_blks, int *blks, unsigned int reserved,
+ ext3_fsblk_t goal, int *offsets, Indirect *branch)
{
int blocksize = inode->i_sb->s_blocksize;
int i, n = 0;
@@ -634,7 +830,7 @@ static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
ext3_fsblk_t current_block;

num = ext3_alloc_blocks(handle, inode, goal, indirect_blks,
- *blks, new_blocks, &err);
+ *blks, reserved, new_blocks, &err);
if (err)
return err;

@@ -834,7 +1030,9 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
int depth;
struct ext3_inode_info *ei = EXT3_I(inode);
int count = 0;
+ unsigned int reserved = 0;
ext3_fsblk_t first_block = 0;
+ struct ext3_da_indirect *ind = NULL;


J_ASSERT(handle != NULL || create == 0);
@@ -924,16 +1122,48 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
indirect_blks = (chain + depth) - partial - 1;

/*
- * Next look up the indirect map to count the totoal number of
+ * Next look up the indirect map to count the total number of
* direct blocks to allocate for this branch.
*/
count = ext3_blks_to_allocate(partial, indirect_blks,
maxblocks, blocks_to_boundary);
- /*
- * Block out ext3_truncate while we alter the tree
- */
- err = ext3_alloc_branch(handle, inode, indirect_blks, &count, goal,
- offsets + (partial - chain), partial);
+ if (indirect_blks || buffer_delay(bh_result)) {
+ ind = ext3_find_da_indirect(inode, iblock, NULL);
+ if (ind) {
+ if (!(ind->flags & EXT3_DA_ALLOC_FL))
+ reserved = indirect_blks;
+ else if (indirect_blks)
+ ext3_warning(inode->i_sb, __func__,
+ "Block %lu of inode %lu needs "
+ "allocating %d indirect blocks but all "
+ "should be already allocated.",
+ (unsigned long)iblock, inode->i_ino,
+ indirect_blks);
+ }
+ if (buffer_delay(bh_result)) {
+ WARN_ON(maxblocks != 1 || !bh_result->b_page);
+ if (!ind && depth > 1)
+ ext3_warning(inode->i_sb, __func__,
+ "Delayed block %lu of inode %lu is "
+ "missing reservation for %d indirect "
+ "blocks.", (unsigned long)iblock,
+ inode->i_ino, indirect_blks);
+ reserved++; /* For data block */
+ }
+ }
+ err = ext3_alloc_branch(handle, inode, indirect_blks, &count, reserved,
+ goal, offsets + (partial - chain), partial);
+ if (!err) {
+ if (ind)
+ ext3_allocated_da_block(inode, ind,
+ buffer_delay(bh_result),
+ ext3_calc_indirect_depth(inode, iblock) -
+ indirect_blks);
+ if (buffer_delay(bh_result))
+ clear_buffer_delay(bh_result);
+ else
+ set_buffer_new(bh_result);
+ }

/*
* The ext3_splice_branch call will free and forget any buffers
@@ -948,8 +1178,6 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;
-
- set_buffer_new(bh_result);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
@@ -1744,15 +1972,39 @@ ext3_readpages(struct file *file, struct address_space *mapping,
return mpage_readpages(mapping, pages, nr_pages, ext3_get_block);
}

+
+static int truncate_delayed_bh(handle_t *handle, struct buffer_head *bh)
+{
+ if (buffer_delay(bh)) {
+ struct inode *inode = bh->b_page->mapping->host;
+
+ /*
+ * We cheat here a bit since we do not add a block-in-page
+ * offset but that does not matter for identifying indirect
+ * block
+ */
+ ext3_cancel_da_block(inode, bh->b_page->index <<
+ (PAGE_CACHE_SHIFT - inode->i_blkbits));
+ clear_buffer_delay(bh);
+ }
+ return 0;
+}
+
static void ext3_invalidatepage(struct page *page, unsigned long offset)
{
- journal_t *journal = EXT3_JOURNAL(page->mapping->host);
+ struct inode *inode = page->mapping->host;
+ journal_t *journal = EXT3_JOURNAL(inode);
+ int bsize = 1 << inode->i_blkbits;

/*
* If it's a full truncate we just forget about the pending dirtying
*/
if (offset == 0)
ClearPageChecked(page);
+ if (page_has_buffers(page)) {
+ walk_page_buffers(NULL, page_buffers(page), offset + bsize - 1,
+ PAGE_CACHE_SIZE, NULL, truncate_delayed_bh);
+ }

journal_invalidatepage(journal, page, offset);
}
@@ -2044,6 +2296,7 @@ static inline int all_zeroes(__le32 *p, __le32 *q)
/**
* ext3_find_shared - find the indirect blocks for partial truncation.
* @inode: inode in question
+ * @iblock: number of the first truncated block
* @depth: depth of the affected branch
* @offsets: offsets of pointers in that branch (see ext3_block_to_path)
* @chain: place to store the pointers to partial indirect blocks
@@ -2076,8 +2329,8 @@ static inline int all_zeroes(__le32 *p, __le32 *q)
* c) free the subtrees growing from the inode past the @chain[0].
* (no partially truncated stuff there). */

-static Indirect *ext3_find_shared(struct inode *inode, int depth,
- int offsets[4], Indirect chain[4], __le32 *top)
+static Indirect *ext3_find_shared(struct inode *inode, sector_t iblock,
+ int depth, int offsets[4], Indirect chain[4], __le32 *top)
{
Indirect *partial, *p;
int k, err;
@@ -2097,8 +2350,22 @@ static Indirect *ext3_find_shared(struct inode *inode, int depth,
if (!partial->key && *partial->p)
/* Writer: end */
goto no_top;
+ /*
+ * If we don't truncate the whole indirect block and there are some
+ * delay allocated blocks in it (must be before the truncation point
+ * as ext3_invalidatepage() has been already run for others), we must
+ * keep the indirect block as reservation has been already spent on
+ * its allocation.
+ */
+ if (partial == chain + depth - 1 &&
+ ext3_find_da_indirect(inode, iblock, NULL)) {
+ p = partial;
+ goto shared_ind_found;
+ }
+
for (p=partial; p>chain && all_zeroes((__le32*)p->bh->b_data,p->p); p--)
;
+shared_ind_found:
/*
* OK, we've found the last block that must survive. The rest of our
* branch should be detached before unlocking. However, if that rest
@@ -2516,7 +2783,7 @@ void ext3_truncate(struct inode *inode)
goto do_indirects;
}

- partial = ext3_find_shared(inode, n, offsets, chain, &nr);
+ partial = ext3_find_shared(inode, last_block, n, offsets, chain, &nr);
/* Kill the top of shared branch (not detached) */
if (nr) {
if (partial == chain) {
@@ -3493,3 +3760,42 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)

return err;
}
+
+/*
+ * Reserve block writes instead of allocation. Called only on buffer heads
+ * attached to a page (and thus for 1 block).
+ */
+static int ext3_da_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int ret;
+
+ /* Buffer has already blocks reserved? */
+ if (buffer_delay(bh))
+ return 0;
+
+ ret = ext3_get_blocks_handle(NULL, inode, iblock, 1, bh, 0);
+ if (ret < 0)
+ return ret;
+ if (ret > 0 || !create)
+ return 0;
+ ret = ext3_rsv_da_block(inode, iblock);
+ if (ret < 0)
+ return ret;
+ set_buffer_delay(bh);
+ set_buffer_new(bh);
+ return 0;
+}
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ int retry = 0;
+ int ret;
+ struct super_block *sb = vma->vm_file->f_path.mnt->mnt_sb;
+
+ do {
+ ret = block_page_mkwrite(vma, vmf, ext3_da_get_block);
+ } while (ret == VM_FAULT_SIGBUS &&
+ ext3_should_retry_alloc(sb, &retry));
+ return ret;
+}
diff --git a/fs/ext3/resize.c b/fs/ext3/resize.c
index 0ccd7b1..91d1ae1 100644
--- a/fs/ext3/resize.c
+++ b/fs/ext3/resize.c
@@ -929,7 +929,7 @@ int ext3_group_add(struct super_block *sb, struct ext3_new_group_data *input)
le32_add_cpu(&es->s_r_blocks_count, input->reserved_blocks);

/* Update the free space counts */
- percpu_counter_add(&sbi->s_freeblocks_counter,
+ percpu_counter_add(&sbi->s_alloc_counter.free,
input->free_blocks_count);
percpu_counter_add(&sbi->s_freeinodes_counter,
EXT3_INODES_PER_GROUP(sb));
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 5dbf4db..c5b7f39 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -431,7 +431,7 @@ static void ext3_put_super (struct super_block * sb)
for (i = 0; i < sbi->s_gdb_count; i++)
brelse(sbi->s_group_desc[i]);
kfree(sbi->s_group_desc);
- percpu_counter_destroy(&sbi->s_freeblocks_counter);
+ dac_destroy(&sbi->s_alloc_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
brelse(sbi->s_sbh);
@@ -482,6 +482,11 @@ static struct inode *ext3_alloc_inode(struct super_block *sb)
ei->vfs_inode.i_version = 1;
atomic_set(&ei->i_datasync_tid, 0);
atomic_set(&ei->i_sync_tid, 0);
+#ifdef CONFIG_QUOTA
+ ei->i_reserved_quota = 0;
+#endif
+ ei->i_da_indirect = RB_ROOT;
+
return &ei->vfs_inode;
}

@@ -742,8 +747,17 @@ static ssize_t ext3_quota_read(struct super_block *sb, int type, char *data,
size_t len, loff_t off);
static ssize_t ext3_quota_write(struct super_block *sb, int type,
const char *data, size_t len, loff_t off);
+#ifdef CONFIG_QUOTA
+qsize_t *ext3_get_reserved_space(struct inode *inode)
+{
+ return &EXT3_I(inode)->i_reserved_quota;
+}
+#endif

static const struct dquot_operations ext3_quota_operations = {
+#ifdef CONFIG_QUOTA
+ .get_reserved_space = ext3_get_reserved_space,
+#endif
.write_dquot = ext3_write_dquot,
.acquire_dquot = ext3_acquire_dquot,
.release_dquot = ext3_release_dquot,
@@ -1946,8 +1960,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
"mounting ext3 over ext2?");
goto failed_mount2;
}
- err = percpu_counter_init(&sbi->s_freeblocks_counter,
- ext3_count_free_blocks(sb));
+ err = dac_init(&sbi->s_alloc_counter, ext3_count_free_blocks(sb));
if (!err) {
err = percpu_counter_init(&sbi->s_freeinodes_counter,
ext3_count_free_inodes(sb));
@@ -2036,7 +2049,7 @@ cantfind_ext3:
goto failed_mount;

failed_mount3:
- percpu_counter_destroy(&sbi->s_freeblocks_counter);
+ dac_destroy(&sbi->s_alloc_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
journal_destroy(sbi->s_journal);
@@ -2723,7 +2736,7 @@ static int ext3_statfs (struct dentry * dentry, struct kstatfs * buf)
buf->f_type = EXT3_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
- buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
+ buf->f_bfree = dac_get_avail_sum(&sbi->s_alloc_counter);
buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
buf->f_bavail = 0;
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 6ce1bca..e24a355 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -837,12 +837,14 @@ ext3_group_first_block_no(struct super_block *sb, unsigned long group_no)
# define NORET_AND noreturn,

/* balloc.c */
+extern ext3_fsblk_t ext3_free_blocks_limit(struct super_block *sb);
extern int ext3_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
ext3_fsblk_t goal, int *errp);
extern ext3_fsblk_t ext3_new_blocks (handle_t *handle, struct inode *inode,
- ext3_fsblk_t goal, unsigned long *count, int *errp);
+ ext3_fsblk_t goal, unsigned long *count,
+ unsigned int reserved, int *errp);
extern void ext3_free_blocks (handle_t *handle, struct inode *inode,
ext3_fsblk_t block, unsigned long count);
extern void ext3_free_blocks_sb (handle_t *handle, struct super_block *sb,
@@ -908,6 +910,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/include/linux/ext3_fs_i.h b/include/linux/ext3_fs_i.h
index f42c098..10e7703 100644
--- a/include/linux/ext3_fs_i.h
+++ b/include/linux/ext3_fs_i.h
@@ -64,6 +64,20 @@ struct ext3_block_alloc_info {
#define rsv_start rsv_window._rsv_start
#define rsv_end rsv_window._rsv_end

+
+#define EXT3_DA_ALLOC_FL 0x0001 /* Indirect block is allocated */
+/*
+ * Structure recording information about indirect block with delayed allocated
+ * data blocks beneath.
+ */
+struct ext3_da_indirect {
+ struct rb_node node;
+ __u32 offset; /* Offset of indirect block */
+ unsigned short data_blocks; /* Number of delayed allocated data
+ * blocks below this indirect block */
+ unsigned short flags;
+};
+
/*
* third extended file system inode data in memory
*/
@@ -92,6 +106,9 @@ struct ext3_inode_info {
/* block reservation info */
struct ext3_block_alloc_info *i_block_alloc_info;

+ /* RB-tree with information about delayed-allocated indirect blocks */
+ struct rb_root i_da_indirect;
+
__u32 i_dir_start_lookup;
#ifdef CONFIG_EXT3_FS_XATTR
/*
@@ -125,6 +142,9 @@ struct ext3_inode_info {

/* on-disk additional length */
__u16 i_extra_isize;
+#ifdef CONFIG_QUOTA
+ qsize_t i_reserved_quota;
+#endif

/*
* truncate_mutex is for serialising ext3_truncate() against
diff --git a/include/linux/ext3_fs_sb.h b/include/linux/ext3_fs_sb.h
index 258088a..54909d0 100644
--- a/include/linux/ext3_fs_sb.h
+++ b/include/linux/ext3_fs_sb.h
@@ -21,6 +21,7 @@
#include <linux/wait.h>
#include <linux/blockgroup_lock.h>
#include <linux/percpu_counter.h>
+#include <linux/delalloc_counter.h>
#endif
#include <linux/rbtree.h>

@@ -58,7 +59,7 @@ struct ext3_sb_info {
u32 s_hash_seed[4];
int s_def_hash_version;
int s_hash_unsigned; /* 3 if hash should be signed, 0 if not */
- struct percpu_counter s_freeblocks_counter;
+ struct delalloc_counter s_alloc_counter;
struct percpu_counter s_freeinodes_counter;
struct percpu_counter s_dirs_counter;
struct blockgroup_lock *s_blockgroup_lock;
--
1.6.4.2


2010-10-09 07:44:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] vfs: Implement generic per-cpu counters for delayed allocation

On Sat, Oct 09, 2010 at 02:12:26AM +0200, Jan Kara wrote:
> Implement free blocks and reserved blocks counters for delayed allocation.
> These counters are reliable in the sence that when they return success, the
> subsequent conversion from reserved to allocated blocks always succeeds (see
> comments in the code for details). This is useful for ext3 filesystem to
> implement delayed allocation in particular for allocation in page_mkwrite.

This doesn't really look like generic code that should go into the core
kernel. I'd just add it to ext3 directly.


2010-10-09 18:04:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Sat, Oct 09, 2010 at 02:12:24AM +0200, Jan Kara wrote:
>
> currently, when mmapped write is done to a file backed by ext3, the
> filesystem does nothing to make sure blocks will be available when we need
> to write them out.

Hmm, you've done all of this work already, so this isn't the best time
to suggest this, but I wonder if we've explored all of the
alternatives that might allow for a less drastic set of changes to
ext3, just out of stability's sake.

How often do legitimate workloads mmap a sparse file then write into
it? As I recall, the original POSIX.1 spec didn't allow mmap beyond
the end of the file; this I believe was lifted later on (at least I
don't see it in SUSv3 spec).

If it's not all that common, then other options are:

1) Fail an mmap with EINVAL if there is an attempt to map a file
region which is either sparse or extends beyond the end of a file.
This is probably not a great alternative, but it's a possibility.

2) Allocate all of the pages that are not allocated at mmap time.
Since ext3 doesn't have space for an uninitialized bit, we'd have to
either (2a) forcing a disk write out for all of the newly initialized
pages, or (2b) keep track of the allocated disk blocks in memory, but
don't actually write the block mappings to the indirect blocks until
the blocks are actually written out. (This last might be just as
complex, alas).

3) Keep a global counter of sparse blocks which are mapped at mmap()
time, and update it as blocks are allocated, or when the region is
freed at munmap() time.

#3 might be much simpler, at the end of the day. Note that there are
some Japanese customers that really freaked with ext4 just because it
was *different*, and begged a distribution not to ship ext4 because it
might destablize their customers. Not that I think we are obliged to
listen to some of the more extremely conservative customers, but there
was something nice about telling people (well, if you want something
which is nice and stable and conservative, you can pick ext3).

Do really have legitimate and common workloads which are allocating
blocks by writing into an mmapped region? I wasn't aware of such
beasts, but maybe they are out there...

- Ted

2010-10-11 14:29:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Sat 09-10-10 14:03:58, Ted Ts'o wrote:
> On Sat, Oct 09, 2010 at 02:12:24AM +0200, Jan Kara wrote:
> >
> > currently, when mmapped write is done to a file backed by ext3, the
> > filesystem does nothing to make sure blocks will be available when we need
> > to write them out.
>
> Hmm, you've done all of this work already, so this isn't the best time
> to suggest this, but I wonder if we've explored all of the
> alternatives that might allow for a less drastic set of changes to
> ext3, just out of stability's sake.
Yeah, I understand that and I've been also thinking for some time whether
I cannot avoid implementing block reservation but I haven't come up with
anything really acceptable. Moreover, unless we write via mmap to a sparse
file, the code paths taken are changed only a little (only when and how
we account for allocated blocks)...

> How often do legitimate workloads mmap a sparse file then write into
> it? As I recall, the original POSIX.1 spec didn't allow mmap beyond
> the end of the file; this I believe was lifted later on (at least I
> don't see it in SUSv3 spec).
Well, mmap beyond EOF is still undefined AFAIK (although Linux
traditionally supports it) but mmap of sparse files was always supposed
to work. My favorite user of sparse-file mmap is Berkeley DB, some torrent
clients do that as well and I believe there are others. So it's not the most
common thing but it happens often enough.

> If it's not all that common, then other options are:
>
> 1) Fail an mmap with EINVAL if there is an attempt to map a file
> region which is either sparse or extends beyond the end of a file.
> This is probably not a great alternative, but it's a possibility.
This is no-go IMHO. We would surely get lots of users complaining...

> 2) Allocate all of the pages that are not allocated at mmap time.
> Since ext3 doesn't have space for an uninitialized bit, we'd have to
> either (2a) forcing a disk write out for all of the newly initialized
> pages, or (2b) keep track of the allocated disk blocks in memory, but
> don't actually write the block mappings to the indirect blocks until
> the blocks are actually written out. (This last might be just as
> complex, alas).
Doing allocation at mmap time does not really work - on each mmap we
would have to map blocks for the whole file which would make mmap really
expensive operation. Doing it at page-fault as you suggest in (2a) works
(that's the second plausible option IMO) but the increased fragmentation
and thus loss of performance is rather noticeable. I don't have current
numbers but when I tried that last year Berkeley DB was like two or three
times slower.
In your (2b) suggestion, I don't see how we would avoid leaking allocated
blocks when we crash before writing allocation to indirect block. Also the
fragmentation problem which seems to be the main source of performance
issues would stay the same.

> 3) Keep a global counter of sparse blocks which are mapped at mmap()
> time, and update it as blocks are allocated, or when the region is
> freed at munmap() time.
Here again I see the problem that mapping all file blocks at mmap time
is rather expensive and so does not seem viable to me. Also the
overestimation of needed blocks could be rather huge.

> #3 might be much simpler, at the end of the day. Note that there are
> some Japanese customers that really freaked with ext4 just because it
> was *different*, and begged a distribution not to ship ext4 because it
> might destablize their customers. Not that I think we are obliged to
> listen to some of the more extremely conservative customers, but there
> was something nice about telling people (well, if you want something
> which is nice and stable and conservative, you can pick ext3).
I'm aware of this. Actually, the user observable differences should be
rather minimal. The only one I'm aware of is that you can get SIGSEGV at
page fault time because the filesystem runs out of disk space (or out of
disk quota) which seems better than throwing away the data later. Also I
don't think anybody serious runs systems close to ENOSPC regularly and if
that happens accidentally, manual intervention is usually needed anyway...
Thanks for your ideas!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-10-11 22:00:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Mon, 11 Oct 2010 16:28:13 +0200
Jan Kara <[email protected]> wrote:

> On Sat 09-10-10 14:03:58, Ted Ts'o wrote:
> > On Sat, Oct 09, 2010 at 02:12:24AM +0200, Jan Kara wrote:
> > >
> > > currently, when mmapped write is done to a file backed by ext3, the
> > > filesystem does nothing to make sure blocks will be available when we need
> > > to write them out.

I thought we'd actually fixed this. I guess we didn't. I think what
we did do was to ensure that a subsequent fsync()/msync() would
reliably report the data loss (has anyone tested this in the past few
years??). This is something, but it's quite lame.

> > Hmm, you've done all of this work already, so this isn't the best time
> > to suggest this, but I wonder if we've explored all of the
> > alternatives that might allow for a less drastic set of changes to
> > ext3, just out of stability's sake.
> Yeah, I understand that and I've been also thinking for some time whether
> I cannot avoid implementing block reservation but I haven't come up with
> anything really acceptable. Moreover, unless we write via mmap to a sparse
> file, the code paths taken are changed only a little (only when and how
> we account for allocated blocks)...
>
> > How often do legitimate workloads mmap a sparse file then write into
> > it? As I recall, the original POSIX.1 spec didn't allow mmap beyond
> > the end of the file; this I believe was lifted later on (at least I
> > don't see it in SUSv3 spec).
> Well, mmap beyond EOF is still undefined AFAIK (although Linux
> traditionally supports it) but mmap of sparse files was always supposed
> to work. My favorite user of sparse-file mmap is Berkeley DB, some torrent
> clients do that as well and I believe there are others. So it's not the most
> common thing but it happens often enough.

Yes, people do this. With a 64-bit address space they create a
gargantuan mmap of the entire database and just populate teeny bits of
it simply with CPU stores. They'd be unhappy if the kernel started
instantiating every block within the mmap()!

> > If it's not all that common, then other options are:
> >
> > 1) Fail an mmap with EINVAL if there is an attempt to map a file
> > region which is either sparse or extends beyond the end of a file.
> > This is probably not a great alternative, but it's a possibility.
> This is no-go IMHO. We would surely get lots of users complaining...
>
> > 2) Allocate all of the pages that are not allocated at mmap time.
> > Since ext3 doesn't have space for an uninitialized bit, we'd have to
> > either (2a) forcing a disk write out for all of the newly initialized
> > pages, or (2b) keep track of the allocated disk blocks in memory, but
> > don't actually write the block mappings to the indirect blocks until
> > the blocks are actually written out. (This last might be just as
> > complex, alas).
> Doing allocation at mmap time does not really work - on each mmap we
> would have to map blocks for the whole file which would make mmap really
> expensive operation. Doing it at page-fault as you suggest in (2a) works
> (that's the second plausible option IMO) but the increased fragmentation
> and thus loss of performance is rather noticeable. I don't have current
> numbers but when I tried that last year Berkeley DB was like two or three
> times slower.

ouch.

Can we fix the layout problem? Are reservation windows of no use here?

> In your (2b) suggestion, I don't see how we would avoid leaking allocated
> blocks when we crash before writing allocation to indirect block. Also the
> fragmentation problem which seems to be the main source of performance
> issues would stay the same.
>
> > 3) Keep a global counter of sparse blocks which are mapped at mmap()
> > time, and update it as blocks are allocated, or when the region is
> > freed at munmap() time.
> Here again I see the problem that mapping all file blocks at mmap time
> is rather expensive and so does not seem viable to me. Also the
> overestimation of needed blocks could be rather huge.

When I did ext2 delayed allocation back in, err, 2001 I had
considerable trouble working out how many blocks to actually reserve
for a file block, because it also had to reserve the indirect blocks.
One file block allocation can result in reserving four disk blocks!
And iirc it was not possible with existing in-core data structures to
work out whether all four blocks needed reserving until the actual
block allocation had occurred. So I ended up reserving the worst-case
number of indirects, based upon the file offset. If the disk ran out
of "space" I'd do a forced writeback to empty all the reservations and
would then take a look to see if the disk was _really_ out of space.

Is all of this an issue with this work? If so, what approach did you
take?

> > #3 might be much simpler, at the end of the day. Note that there are
> > some Japanese customers that really freaked with ext4 just because it
> > was *different*, and begged a distribution not to ship ext4 because it
> > might destablize their customers. Not that I think we are obliged to
> > listen to some of the more extremely conservative customers, but there
> > was something nice about telling people (well, if you want something
> > which is nice and stable and conservative, you can pick ext3).
> I'm aware of this. Actually, the user observable differences should be
> rather minimal. The only one I'm aware of is that you can get SIGSEGV at
> page fault time because the filesystem runs out of disk space (or out of
> disk quota) which seems better than throwing away the data later. Also I
> don't think anybody serious runs systems close to ENOSPC regularly and if
> that happens accidentally, manual intervention is usually needed anyway...

Gee. I remember people having issues with forcing the SEGV at
pagefault time. It _is_ a behaviour change: the application might be
about to free up some disk space, so the msync() would have succeeded
anyway.

iirc another issue was that the standards (posix?) don't anticipate
getting a SEGV in response to ENOSPC. There might have been other
concerns - it's all foggy now.


Our general answer to this overall problem is: "run msync() and check
the result". That's a bit weaselly, but it's not a _bad_ answer.
After all, there might be an EIO as well! So a good application should
be checking for both ENOSPC and EIO. Your patches only address the
ENOSPC.


2010-10-12 23:15:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Mon 11-10-10 14:59:45, Andrew Morton wrote:
> On Mon, 11 Oct 2010 16:28:13 +0200
> Jan Kara <[email protected]> wrote:
>
> > On Sat 09-10-10 14:03:58, Ted Ts'o wrote:
> > > On Sat, Oct 09, 2010 at 02:12:24AM +0200, Jan Kara wrote:
> > > >
> > > > currently, when mmapped write is done to a file backed by ext3, the
> > > > filesystem does nothing to make sure blocks will be available when we need
> > > > to write them out.
>
> I thought we'd actually fixed this. I guess we didn't. I think what
> we did do was to ensure that a subsequent fsync()/msync() would
> reliably report the data loss (has anyone tested this in the past few
> years??). This is something, but it's quite lame.
Yes, that's what we do these days - we set bit in address space in
generic_writepages() and the nearest syncing function (in fact the first
caller of filemap_fdatawait()) will get the error. It's kind of suboptimal,
that if e.g. sys_sync() runs before you manage to call fsync(), you've just
lost the chance to see possible error. So I agree the current interface is
lame (but not that I would know better at least for EIO handling)...

> > > 2) Allocate all of the pages that are not allocated at mmap time.
> > > Since ext3 doesn't have space for an uninitialized bit, we'd have to
> > > either (2a) forcing a disk write out for all of the newly initialized
> > > pages, or (2b) keep track of the allocated disk blocks in memory, but
> > > don't actually write the block mappings to the indirect blocks until
> > > the blocks are actually written out. (This last might be just as
> > > complex, alas).
> > Doing allocation at mmap time does not really work - on each mmap we
> > would have to map blocks for the whole file which would make mmap really
> > expensive operation. Doing it at page-fault as you suggest in (2a) works
> > (that's the second plausible option IMO) but the increased fragmentation
> > and thus loss of performance is rather noticeable. I don't have current
> > numbers but when I tried that last year Berkeley DB was like two or three
> > times slower.
>
> ouch.
>
> Can we fix the layout problem? Are reservation windows of no use here?
Reservation windows do not work for this load. The reason is that the
page-fault order is completely random so we just spend time creating and
removing tiny reservation windows because the next page fault doing
allocation is scarcely close enough to fall into the small window.
The logic in ext3_find_goal() ends up picking blocks close together for
blocks belonging to the same indirect block if we are lucky but they
definitely won't be sequentially ordered. For Berkeley DB the situation is
made worse by the fact that there are several database files and their
blocks end up interleaved.
So we could improve the layout but we'd have to tweak the reservation
logic and allocator and it's not completely clear to me how.
One thing to note is that currently, ext3 *is* in fact doing delayed
allocation for writes via mmap. We just never called it like that and never
bothered to do proper space estimation...

> > > 3) Keep a global counter of sparse blocks which are mapped at mmap()
> > > time, and update it as blocks are allocated, or when the region is
> > > freed at munmap() time.
> > Here again I see the problem that mapping all file blocks at mmap time
> > is rather expensive and so does not seem viable to me. Also the
> > overestimation of needed blocks could be rather huge.
>
> When I did ext2 delayed allocation back in, err, 2001 I had
> considerable trouble working out how many blocks to actually reserve
> for a file block, because it also had to reserve the indirect blocks.
> One file block allocation can result in reserving four disk blocks!
> And iirc it was not possible with existing in-core data structures to
> work out whether all four blocks needed reserving until the actual
> block allocation had occurred. So I ended up reserving the worst-case
> number of indirects, based upon the file offset. If the disk ran out
> of "space" I'd do a forced writeback to empty all the reservations and
> would then take a look to see if the disk was _really_ out of space.
>
> Is all of this an issue with this work? If so, what approach did you
> take?
Yeah, I've spotted exactly the same problem. How I decided to solve it in
the end is that in memory we keep track of each indirect block that has
delay-allocated buffer under it. This allows us to reserve space for each
indirect block at most once (I didn't bother with making the accounting
precise for double or triple indirect blocks so when I need to reserve
space for indirect block, I reserve the whole path just to be sure). This
pushes the error in estimation to rather acceptable range for reasonably
common workloads - the error can still be 50% for workloads which use just
one data block in each indirect block but even in this case the absolute
number of blocks falsely reserved is small.
The cost is of course increased complexity of the code, the memory
spent for tracking those indirect blocks (32 bytes per indirect block), and
some time for lookups in the RB-tree of the structures. At least the nice
thing is that when there are no delay-allocated blocks, there isn't any
overhead (tree is empty).

> > > #3 might be much simpler, at the end of the day. Note that there are
> > > some Japanese customers that really freaked with ext4 just because it
> > > was *different*, and begged a distribution not to ship ext4 because it
> > > might destablize their customers. Not that I think we are obliged to
> > > listen to some of the more extremely conservative customers, but there
> > > was something nice about telling people (well, if you want something
> > > which is nice and stable and conservative, you can pick ext3).
> > I'm aware of this. Actually, the user observable differences should be
> > rather minimal. The only one I'm aware of is that you can get SIGSEGV at
> > page fault time because the filesystem runs out of disk space (or out of
> > disk quota) which seems better than throwing away the data later. Also I
> > don't think anybody serious runs systems close to ENOSPC regularly and if
> > that happens accidentally, manual intervention is usually needed anyway...
>
> Gee. I remember people having issues with forcing the SEGV at
> pagefault time. It _is_ a behaviour change: the application might be
> about to free up some disk space, so the msync() would have succeeded
> anyway.
>
> iirc another issue was that the standards (posix?) don't anticipate
> getting a SEGV in response to ENOSPC. There might have been other
> concerns - it's all foggy now.
>
> Our general answer to this overall problem is: "run msync() and check
> the result". That's a bit weaselly, but it's not a _bad_ answer.
> After all, there might be an EIO as well! So a good application should
> be checking for both ENOSPC and EIO. Your patches only address the
> ENOSPC.
Yes, here my main concern is that the patch set is not only about ENOSPC
(I can imagine we could live with that when we lived with that upto now)
but also about the quota problem. To reiterate - if the allocation happens
during writeback, we don't know who originally did the write and thus
whether he was allowed to exceed quota limit or not. Currently, since
flusher threads run as root, we always ignore quota limits and thus user
can write arbirary amount of data by writing via mmap. Sysadmins don't like
that... BTW the same problem happens with checking reserved space for root
in ext? filesystems.
I don't see a different solution than to check quotas at page fault
because that is the only moment when we know the identity of the writer and
if quota check fails we have to refuse the fault - SIGSEGV is the only
option I know about. And when I have to do all the reservation because of
quotas, ENOSPC handling is a nice bonus.

IMHO there are three separate questions:
a) Do we want to fix the quota problem?
- I'm convinced that yes.
b) Can we solve it without behavior change of sending SIGSEGV on error?
- I don't see how but maybe you have some bright idea...
c) When we decide some reservation scheme is unavoidable, there is question
how to estimate amount of indirect blocks. My scheme is one possibility,
but there is a wider variety of tradeoffs between complexity and
accuracy. A special low effort, low impact possibility here might be to
just ignore the ENOSPC problem as we did so far, reserve only quota for
data block on page fault, and rely on the fact that there isn't going to
be that much metadata so user cannot exceed his quota limit by too
much... But when we already have the interface change, it seems a bit
stupid not to fix it properly and also handle ENOSPC with it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-10-13 00:17:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Wed, Oct 13, 2010 at 01:14:08AM +0200, Jan Kara wrote:
> c) When we decide some reservation scheme is unavoidable, there is question
> how to estimate amount of indirect blocks. My scheme is one possibility,
> but there is a wider variety of tradeoffs between complexity and
> accuracy. A special low effort, low impact possibility here might be to
> just ignore the ENOSPC problem as we did so far, reserve only quota for
> data block on page fault, and rely on the fact that there isn't going to
> be that much metadata so user cannot exceed his quota limit by too
> much... But when we already have the interface change, it seems a bit
> stupid not to fix it properly and also handle ENOSPC with it.

We ultimately decided to do two different things for ENOSPC versus
EDQUOTA in ext4. For quota overflow we just assume that the number of
metadata blocks won't be that many, and just allow them to go over
quota. For ENOSPC, we would force writeback to see if it would free
space, and ultimately we would drop out of delayed allocation mode
when we were close to running out of space (and for non-root users we
would depend on the 5% blocks reserved for root).

Yeah, that means if root application mmap's a huge 100GB sparse
region, and we only have 2GB free in the file system, and then the
application proceeds to write to all 100GB of mmap'ed region, there's
a chance data might get silently lost when we drop out of delalloc
mode and we then really do completely run out of memory. But really,
what we are we supposed do? Unless you have the kernel break out in
hysterical laughter and reject the mmap at allocation time, I suppose
the only other thing we could do, if silently dropping data is
unacceptable, is we can send the SEGV early even though we might have
a few blocks left. That way the data loss isn't silent (the
application will probably drop core and die instead), so it's no
longer our problem. :-)

- Ted

2010-10-13 08:49:17

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

---------- Forwarded message ----------
From: Amir Goldstein <[email protected]>
Date: Wed, Oct 13, 2010 at 10:44 AM
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3
To: Jan Kara <[email protected]>
Cc: Andrew Morton <[email protected]>, Ted Ts'o
<[email protected]>, [email protected]




On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <[email protected]> wrote:
>
> On Mon 11-10-10 14:59:45, Andrew Morton wrote:
> > On Mon, 11 Oct 2010 16:28:13 +0200 Jan Kara <[email protected]> wrote:
> >
> > >   Doing allocation at mmap time does not really work - on each mmap we
> > > would have to map blocks for the whole file which would make mmap really
> > > expensive operation. Doing it at page-fault as you suggest in (2a) works
> > > (that's the second plausible option IMO) but the increased fragmentation
> > > and thus loss of performance is rather noticeable. I don't have current
> > > numbers but when I tried that last year Berkeley DB was like two or three
> > > times slower.
> >
> > ouch.
> >
> > Can we fix the layout problem?  Are reservation windows of no use here?
>  Reservation windows do not work for this load. The reason is that the
> page-fault order is completely random so we just spend time creating and
> removing tiny reservation windows because the next page fault doing
> allocation is scarcely close enough to fall into the small window.
>  The logic in ext3_find_goal() ends up picking blocks close together for
> blocks belonging to the same indirect block if we are lucky but they
> definitely won't be sequentially ordered. For Berkeley DB the situation is
> made worse by the fact that there are several database files and their
> blocks end up interleaved.
>  So we could improve the layout but we'd have to tweak the reservation
> logic and allocator and it's not completely clear to me how.
>  One thing to note is that currently, ext3 *is* in fact doing delayed
> allocation for writes via mmap. We just never called it like that and never
> bothered to do proper space estimation...
>
> > > > 3) Keep a global counter of sparse blocks which are mapped at mmap()
> > > > time, and update it as blocks are allocated, or when the region is
> > > > freed at munmap() time.
> > >   Here again I see the problem that mapping all file blocks at mmap time
> > > is rather expensive and so does not seem viable to me. Also the
> > > overestimation of needed blocks could be rather huge.
> >
> > When I did ext2 delayed allocation back in, err, 2001 I had
> > considerable trouble working out how many blocks to actually reserve
> > for a file block, because it also had to reserve the indirect blocks.
> > One file block allocation can result in reserving four disk blocks!
> > And iirc it was not possible with existing in-core data structures to
> > work out whether all four blocks needed reserving until the actual
> > block allocation had occurred.  So I ended up reserving the worst-case
> > number of indirects, based upon the file offset.  If the disk ran out
> > of "space" I'd do a forced writeback to empty all the reservations and
> > would then take a look to see if the disk was _really_ out of space.
> >
> > Is all of this an issue with this work?  If so, what approach did you
> > take?
>  Yeah, I've spotted exactly the same problem. How I decided to solve it in
> the end is that in memory we keep track of each indirect block that has
> delay-allocated buffer under it. This allows us to reserve space for each
> indirect block at most once (I didn't bother with making the accounting
> precise for double or triple indirect blocks so when I need to reserve
> space for indirect block, I reserve the whole path just to be sure). This
> pushes the error in estimation to rather acceptable range for reasonably
> common workloads - the error can still be 50% for workloads which use just
> one data block in each indirect block but even in this case the absolute
> number of blocks falsely reserved is small.
>  The cost is of course increased complexity of the code, the memory
> spent for tracking those indirect blocks (32 bytes per indirect block), and
> some time for lookups in the RB-tree of the structures. At least the nice
> thing is that when there are no delay-allocated blocks, there isn't any
> overhead (tree is empty).
>

How about allocating *only* the indirect blocks on page fault.
IMHO it seems like a fair mixture of high quota accuracy, low
complexity of the accounting code and low file fragmentation (only
indirect may be a bit further away from data).

In my snapshot patches I use the @create arg to get_blocks_handle() to
pass commands just like "allocate only indirect blocks".
The patch is rather simple. I can prepare it for ext3 if you like.

Amir.

2010-10-13 16:14:10

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Wed, Oct 13, 2010 at 10:49 AM, Amir G.
<[email protected]> wrote:
> On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <[email protected]> wrote:
>>
>> On Mon 11-10-10 14:59:45, Andrew Morton wrote:
>> > On Mon, 11 Oct 2010 16:28:13 +0200?Jan Kara <[email protected]> wrote:
>> >
>> > > ? Doing allocation at mmap time does not really work - on each mmap we
>> > > would have to map blocks for the whole file which would make mmap really
>> > > expensive operation. Doing it at page-fault as you suggest in (2a) works
>> > > (that's the second plausible option IMO) but the increased fragmentation
>> > > and thus loss of performance is rather noticeable. I don't have current
>> > > numbers but when I tried that last year Berkeley DB was like two or three
>> > > times slower.
>> >
>> > ouch.
>> >
>> > Can we fix the layout problem? ?Are reservation windows of no use here?
>> ?Reservation windows do not work for this load. The reason is that the
>> page-fault order is completely random so we just spend time creating and
>> removing tiny reservation windows because the next page fault doing
>> allocation is scarcely close enough to fall into the small window.
>> ?The logic in ext3_find_goal() ends up picking blocks close together for
>> blocks belonging to the same indirect block if we are lucky but they
>> definitely won't be sequentially ordered. For Berkeley DB the situation is
>> made worse by the fact that there are several database files and their
>> blocks end up interleaved.
>> ?So we could improve the layout but we'd have to tweak the reservation
>> logic and allocator and it's not completely clear to me how.
>> ?One thing to note is that currently, ext3 *is* in fact doing delayed
>> allocation for writes via mmap. We just never called it like that and never
>> bothered to do proper space estimation...
>>
>> > > > 3) Keep a global counter of sparse blocks which are mapped at mmap()
>> > > > time, and update it as blocks are allocated, or when the region is
>> > > > freed at munmap() time.
>> > > ? Here again I see the problem that mapping all file blocks at mmap time
>> > > is rather expensive and so does not seem viable to me. Also the
>> > > overestimation of needed blocks could be rather huge.
>> >
>> > When I did ext2 delayed allocation back in, err, 2001 I had
>> > considerable trouble working out how many blocks to actually reserve
>> > for a file block, because it also had to reserve the indirect blocks.
>> > One file block allocation can result in reserving four disk blocks!
>> > And iirc it was not possible with existing in-core data structures to
>> > work out whether all four blocks needed reserving until the actual
>> > block allocation had occurred. ?So I ended up reserving the worst-case
>> > number of indirects, based upon the file offset. ?If the disk ran out
>> > of "space" I'd do a forced writeback to empty all the reservations and
>> > would then take a look to see if the disk was _really_ out of space.
>> >
>> > Is all of this an issue with this work? ?If so, what approach did you
>> > take?
>> ?Yeah, I've spotted exactly the same problem. How I decided to solve it in
>> the end is that in memory we keep track of each indirect block that has
>> delay-allocated buffer under it. This allows us to reserve space for each
>> indirect block at most once (I didn't bother with making the accounting
>> precise for double or triple indirect blocks so when I need to reserve
>> space for indirect block, I reserve the whole path just to be sure). This
>> pushes the error in estimation to rather acceptable range for reasonably
>> common workloads - the error can still be 50% for workloads which use just
>> one data block in each indirect block but even in this case the absolute
>> number of blocks falsely reserved is small.
>> ?The cost is of course increased complexity of the code, the memory
>> spent for tracking those indirect blocks (32 bytes per indirect block), and
>> some time for lookups in the RB-tree of the structures. At least the nice
>> thing is that when there are no delay-allocated blocks, there isn't any
>> overhead (tree is empty).
>>
>
> How about allocating *only* the indirect blocks on page fault.
> IMHO it seems like a fair mixture of high quota accuracy, low
> complexity of the accounting code and low file fragmentation (only
> indirect may be a bit further away from data).
>
> In my snapshot patches I use the @create arg to get_blocks_handle() to
> pass commands just like "allocate only indirect blocks".
> The patch is rather simple. I can prepare it for ext3 if you like.
>
> Amir.
>

Here is the indirect allocation patch.
The following debugfs dump shows the difference between a 1G file
allocated by dd (indirect interleaved with data) and by mmap (indirect
sequential before data).
(I did not test performance and I ignored SIGBUS at the end of mmap
file allocation)

debugfs: stat dd.1G
Inode: 15 Type: regular Mode: 0644 Flags: 0x0
Generation: 4035748347 Version: 0x00000000
User: 0 Group: 0 Size: 1073741824
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 2099208
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4cb5d1a9 -- Wed Oct 13 17:35:05 2010
atime: 0x4cb5d199 -- Wed Oct 13 17:34:49 2010
mtime: 0x4cb5d1a9 -- Wed Oct 13 17:35:05 2010
Size of extra inode fields: 4
BLOCKS:
(0-11):16384-16395, (IND):16396, (12-134):16397-16519,
(135-641):17037-17543, (642-1035):17792-18185, (DIND):18186,
(IND):18187, (10
36-1279):18188-18431, (1280-1284):1311235-1311239,
(1285-2059):1334197-1334971, (IND):1334972,
(2060-3083):1334973-1335996, (IND):13
35997, (3084-4107):1335998-1337021, (IND):1337022,
(4108-5131):1337023-1338046, (IND):1338047,
(5132-6155):1338048-1339071, (IND):13
39072, (6156-7179):1339073-1340096, (IND):1340097,
(7180-8203):1340098-1341121, (IND):1341122,
(8204-9227):1341123-1342146, (IND):13
42147, (9228-10251):1342148-1343171, (IND):1343172,
(10252-10566):1343173-1343487, (10567-11275):1344008-1344716,
(IND):1344717, (11
276-12299):1344718-1345741, (IND):1345742,
(12300-13323):1345743-1346766, (IND):1346767,
(13324-14347):1346768-1347791, (IND):134779
2, (14348-15371):1347793-1348816, (IND):1348817,
(15372-16395):1348818-1349841, (IND):1349842,
(16396-17419):1349843-1350866, (IND):
...
debugfs:
debugfs: stat mmap.1G
Inode: 14 Type: regular Mode: 0644 Flags: 0x0
Generation: 1442185090 Version: 0x00000000
User: 0 Group: 0 Size: 1073741824
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 1968016
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4cb5d044 -- Wed Oct 13 17:29:08 2010
atime: 0x4cb5bf81 -- Wed Oct 13 16:17:37 2010
mtime: 0x4cb5d025 -- Wed Oct 13 17:28:37 2010
Size of extra inode fields: 4
BLOCKS:
(DIND):14336, (IND):14337, (16384-16395):14360-14371, (IND):14338,
(16396-16481):14372-14457, (16482-17153):14600-15271, (17154-1741
9):16520-16785, (IND):14339, (17420-17670):16786-17036,
(17671-17675):1081859-1081863, (17676-18443):1086089-1086856,
(IND):14340, (
18444-19467):1086857-1087880, (IND):14341,
(19468-20491):1087881-1088904, (IND):14342,
(20492-21515):1088905-1089928, (IND):14343, (
21516-22539):1089929-1090952, (IND):14344,
(22540-23563):1090953-1091976, (IND):14345,
(23564-24587):1091977-1093000, (IND):14346, (
24588-25611):1093001-1094024, (IND):14347,
(25612-26635):1094025-1095048, (IND):14348,
(26636-27659):1095049-1096072, (IND):14349, (
27660-28683):1096073-1097096, (IND):14472,
(28684-29707):1097097-1098120, (IND):15496,
(29708-30731):1098121-1099144, (IND):17544, (
30732-31755):1099145-1100168, (IND):17545,
(31756-32779):1100169-1101192, (IND):17546,
(32780-33803):1101193-1102216, (IND):17547, (
...
debugfs:



Allocate file indirect blocks on page_mkwrite().

This is a sample patch to be merged with Jan Kara's ext3 delayed allocation
patches. Some of the code was taken from Jan's patches for testing only.
This patch is for kernel 2.6.35.6.

On page_mkwrite(), we allocate indirect blocks if needed and leave the data
blocks unallocated. Jan's patches take care of reserving space for the data.

Signed-off-by: Amir Goldstein <[email protected]>

--------------------------------------------------------------------------------
diff -Nuarp a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c 2010-10-13 17:42:06.472252298 +0200
+++ b/fs/buffer.c 2010-10-13 17:42:14.772244244 +0200
@@ -1687,8 +1687,9 @@ static int __block_write_full_page(struc
if (buffer_new(bh)) {
/* blockdev mappings never come here */
clear_buffer_new(bh);
- unmap_underlying_metadata(bh->b_bdev,
- bh->b_blocknr);
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
}
}
bh = bh->b_this_page;
@@ -1873,7 +1874,8 @@ static int __block_prepare_write(struct
if (err)
break;
if (buffer_new(bh)) {
- unmap_underlying_metadata(bh->b_bdev,
+ if (buffer_mapped(bh))
+ unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
if (PageUptodate(page)) {
clear_buffer_new(bh);
@@ -2592,7 +2594,7 @@ int nobh_write_begin_newtrunc(struct fil
goto failed;
if (!buffer_mapped(bh))
is_mapped_to_disk = 0;
- if (buffer_new(bh))
+ if (buffer_new(bh) && buffer_mapped(bh))
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
if (PageUptodate(page)) {
set_buffer_uptodate(bh);
diff -Nuarp a/fs/ext3/file.c b/fs/ext3/file.c
--- a/fs/ext3/file.c 2010-10-13 17:41:52.962541253 +0200
+++ b/fs/ext3/file.c 2010-10-13 17:42:25.083163556 +0200
@@ -52,6 +52,23 @@ static int ext3_release_file (struct ino
return 0;
}

+static const struct vm_operations_struct ext3_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext3_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -62,7 +79,7 @@ const struct file_operations ext3_file_o
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext3_file_mmap,
.open = dquot_file_open,
.release = ext3_release_file,
.fsync = ext3_sync_file,
diff -Nuarp a/fs/ext3/inode.c b/fs/ext3/inode.c
--- a/fs/ext3/inode.c 2010-10-13 17:41:42.722231144 +0200
+++ b/fs/ext3/inode.c 2010-10-13 17:41:12.973180756 +0200
@@ -38,6 +38,7 @@
#include <linux/bio.h>
#include <linux/fiemap.h>
#include <linux/namei.h>
+#include <linux/mount.h>
#include "xattr.h"
#include "acl.h"

@@ -562,10 +563,17 @@ static int ext3_alloc_blocks(handle_t *h
count--;
}

- if (count > 0)
+ if (index == indirect_blks)
break;
}

+ if (blks == 0) {
+ /* blks == 0 when allocating only indirect blocks */
+ new_blocks[index] = 0;
+ *err = 0;
+ return 0;
+ }
+
/* save the new block number for the first direct block */
new_blocks[index] = current_block;

@@ -676,7 +684,9 @@ failed:
for (i = 0; i <indirect_blks; i++)
ext3_free_blocks(handle, inode, new_blocks[i], 1);

- ext3_free_blocks(handle, inode, new_blocks[i], num);
+ if (num > 0)
+ /* num == 0 when allocating only indirect blocks */
+ ext3_free_blocks(handle, inode, new_blocks[i], num);

return err;
}
@@ -735,7 +745,8 @@ static int ext3_splice_branch(handle_t *
* in i_block_alloc_info, to assist find the proper goal block for next
* allocation
*/
- if (block_i) {
+ if (block_i && blks > 0) {
+ /* blks == 0 when allocating only indirect blocks */
block_i->last_alloc_logical_block = block + blks - 1;
block_i->last_alloc_physical_block =
le32_to_cpu(where[num].key) + blks - 1;
@@ -778,7 +789,9 @@ err_out:
ext3_journal_forget(handle, where[i].bh);
ext3_free_blocks(handle,inode,le32_to_cpu(where[i-1].key),1);
}
- ext3_free_blocks(handle, inode, le32_to_cpu(where[num].key), blks);
+ if (blks > 0)
+ /* blks == 0 when allocating only indirect blocks */
+ ext3_free_blocks(handle, inode, le32_to_cpu(where[num].key), blks);

return err;
}
@@ -905,6 +918,11 @@ int ext3_get_blocks_handle(handle_t *han

/* the number of blocks need to allocate for [d,t]indirect blocks */
indirect_blks = (chain + depth) - partial - 1;
+ if (indirect_blks + maxblocks == 0) {
+ /* maxblocks == 0 when allocating only indirect blocks */
+ mutex_unlock(&ei->truncate_mutex);
+ goto cleanup;
+ }

/*
* Next look up the indirect map to count the totoal number of
@@ -929,7 +947,8 @@ int ext3_get_blocks_handle(handle_t *han
err = ext3_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
mutex_unlock(&ei->truncate_mutex);
- if (err)
+ if (err || count == 0)
+ /* count == 0 when allocating only indirect blocks */
goto cleanup;

set_buffer_new(bh_result);
@@ -981,6 +1000,9 @@ static int ext3_get_block(struct inode *
started = 1;
}

+ if (create < 0)
+ /* create < 0 when allocating only indirect blocks */
+ max_blocks = 0;
ret = ext3_get_blocks_handle(handle, inode, iblock,
max_blocks, bh_result, create);
if (ret > 0) {
@@ -1827,6 +1849,43 @@ out:
}

/*
+ * Reserve block writes instead of allocation. Called only on buffer heads
+ * attached to a page (and thus for 1 block).
+ */
+static int ext3_da_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int ret;
+
+ /* Buffer has already blocks reserved? */
+ if (buffer_delay(bh))
+ return 0;
+
+ /* passing -1 to allocate only indirect blocks */
+ ret = ext3_get_block(inode, iblock, bh, -1);
+ if (ret < 0)
+ return ret;
+ if (ret > 0 || !create)
+ return 0;
+ set_buffer_delay(bh);
+ set_buffer_new(bh);
+ return 0;
+}
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ int retry = 0;
+ int ret;
+ struct super_block *sb = vma->vm_file->f_path.mnt->mnt_sb;
+
+ do {
+ ret = block_page_mkwrite(vma, vmf, ext3_da_get_block);
+ } while (ret == VM_FAULT_SIGBUS &&
+ ext3_should_retry_alloc(sb, &retry));
+ return ret;
+}
+
+/*
* Pages can be marked dirty completely asynchronously from ext3's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
* much here because ->set_page_dirty is called under VFS locks. The page is
diff -Nuarp a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
--- a/include/linux/ext3_fs.h 2010-10-13 17:49:26.892439258 +0200
+++ b/include/linux/ext3_fs.h 2010-10-13 17:49:10.662493115 +0200
@@ -909,6 +909,7 @@ extern void ext3_get_inode_flags(struct
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);

2010-10-14 15:58:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Wed 13-10-10 18:14:08, Amir G. wrote:
> On Wed, Oct 13, 2010 at 10:49 AM, Amir G.
> <[email protected]> wrote:
> > On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <[email protected]> wrote:
> >> > When I did ext2 delayed allocation back in, err, 2001 I had
> >> > considerable trouble working out how many blocks to actually reserve
> >> > for a file block, because it also had to reserve the indirect blocks.
> >> > One file block allocation can result in reserving four disk blocks!
> >> > And iirc it was not possible with existing in-core data structures to
> >> > work out whether all four blocks needed reserving until the actual
> >> > block allocation had occurred. ?So I ended up reserving the worst-case
> >> > number of indirects, based upon the file offset. ?If the disk ran out
> >> > of "space" I'd do a forced writeback to empty all the reservations and
> >> > would then take a look to see if the disk was _really_ out of space.
> >> >
> >> > Is all of this an issue with this work? ?If so, what approach did you
> >> > take?
> >> ?Yeah, I've spotted exactly the same problem. How I decided to solve it in
> >> the end is that in memory we keep track of each indirect block that has
> >> delay-allocated buffer under it. This allows us to reserve space for each
> >> indirect block at most once (I didn't bother with making the accounting
> >> precise for double or triple indirect blocks so when I need to reserve
> >> space for indirect block, I reserve the whole path just to be sure). This
> >> pushes the error in estimation to rather acceptable range for reasonably
> >> common workloads - the error can still be 50% for workloads which use just
> >> one data block in each indirect block but even in this case the absolute
> >> number of blocks falsely reserved is small.
> >> ?The cost is of course increased complexity of the code, the memory
> >> spent for tracking those indirect blocks (32 bytes per indirect block), and
> >> some time for lookups in the RB-tree of the structures. At least the nice
> >> thing is that when there are no delay-allocated blocks, there isn't any
> >> overhead (tree is empty).
> >>
> >
> > How about allocating *only* the indirect blocks on page fault.
> > IMHO it seems like a fair mixture of high quota accuracy, low
> > complexity of the accounting code and low file fragmentation (only
> > indirect may be a bit further away from data).
> >
> > In my snapshot patches I use the @create arg to get_blocks_handle() to
> > pass commands just like "allocate only indirect blocks".
> > The patch is rather simple. I can prepare it for ext3 if you like.
>
> Here is the indirect allocation patch.
> The following debugfs dump shows the difference between a 1G file
> allocated by dd (indirect interleaved with data) and by mmap (indirect
> sequential before data).
> (I did not test performance and I ignored SIGBUS at the end of mmap
> file allocation)
...
>
> Allocate file indirect blocks on page_mkwrite().
>
> This is a sample patch to be merged with Jan Kara's ext3 delayed allocation
> patches. Some of the code was taken from Jan's patches for testing only.
> This patch is for kernel 2.6.35.6.
>
> On page_mkwrite(), we allocate indirect blocks if needed and leave the data
> blocks unallocated. Jan's patches take care of reserving space for the data.
Thanks for the idea and the patch.

Yes, this is one of the trade-off options. But it's not that simple.
There's a problem with with truncate coming after page_mkwrite but before
the allocation happens. See:

ext3_page_mkwrite() for page index 70
-> allocates the indirect block (or just sees that it's allocated
and does nothing)
- marks buffer as delayed
...
truncate to index 80
- sees indirect block has no more blocks allocated and removes it

ext3_writepage() for index 70
- would like to allocate block for index 70 but indirect block does
not exist. Bugger.

So you have to somehow track that indirect block has some delayed
allocation pending - and that is the most complex part of my patch. The
rest is rather simple...
Actually, there are also other ways to track that indirect block has the
delayed allocation pending. For example if we could do a disk format
change, we could simply reserve say block number 1 or -1 to indicate delayed
allocated block and everything would be much simpler. But I don't think we
really can (it would be an incompatible disk format change).
Hmm, using radix tree dirty tag could also work for that -- when the
range covered by an indirect block has some dirty pages we know that we
shouldn't delete it because it's going to be used soon. But it's subtle
because we rely on the fact that radix tree dirty tag is cleared only in
set_page_writeback() which is after the get_block() call while page dirty
flag is already cleared before the writepage() (and thus get_block()) call
-- I originally discared this idea because I forgot about this tag handling
subtlety.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR