2008-01-22 03:02:43

by Theodore Ts'o

[permalink] [raw]
Subject: ext4 merge plans for 2.6.25

The following patches have been in the -mm tree for a while, and I
plan to push them to Linus when the 2.6.25 merge window opens. With
this patch series, it is expected that ext4 format should be settling
down. We still have delayed allocation and online defrag which aren't
quite ready to merge, but those shouldn't affect the on-disk format.

I don't expect any other on-disk format changes to show up after this
point, but I've been wrong before.... any such changes would have to
have a Really Good Reason, though. (No, Abhishek Rai's changes
wouldn't count as an on-disk change, since they change layout choices,
but not anything that e2fsck would actually care about. We may try
merging those into ext4 and see how they play out in the -mm tree;
we'll see.)

- Ted

P.S. Yes, the currently released e2fsprogs won't support all of these
format changes yet; again ext4, shouldn't be deployed to production
systems yet, although we do salute those who are willing to be guinea
pigs and play with this code! Never fear, I'll be working to get
e2fsprogs caught up Real Soon Now.

Adrian Bunk (1):
ext4/super.c: fix #ifdef's (CONFIG_EXT4_* -> CONFIG_EXT4DEV_*)

Alex Tomas (2):
ext4: Add new functions for searching extent tree
ext4: Add multi block allocator for ext4

Aneesh Kumar K.V (23):
ext4: Introduce ext4_lblk_t
ext4: Introduce ext4_update_*_feature
ext4: Fix sparse warnings.
ext4: Rename i_file_acl to i_file_acl_lo
ext4: Rename i_dir_acl to i_size_high
ext4: Add support for 48 bit inode i_blocks.
ext4: Support large files
ext2: Fix the max file size for ext2 file system.
ext3: Fix the max file size for ext3 file system.
ext4: Return after ext4_error in case of failures
ext4: Change the default behaviour on error
Add buffer head related helper functions
ext4: add block bitmap validation
ext4: Check for the correct error return from
ext4: Make ext4_get_blocks_wrap take the truncate_mutex early.
ext4: Convert truncate_mutex to read write semaphore.
ext4: Take read lock during overwrite case.
ext4: Add EXT4_IOC_MIGRATE ioctl
ext4: Fix ext4_show_options to show the correct mount options.
ext4: Add ext4_find_next_bit()
ext4: Enable the multiblock allocator by default
ext4: Check for return value from sb_set_blocksize
ext4: Use the ext4_ext_actual_len() helper function

Avantika Mathur (2):
ext4: add ext4_group_t, and change all group variables to this type.
ext4: fixes block group number being set to a negative value

Chris Snook (1):
jbd2: Remove printk from J_ASSERT to preserve registers during BUG

Coly Li (1):
ext4: sync up block group descriptor with e2fsprogs.

Dmitry Monakhov (1):
ext4: fix uniniatilized extent splitting error

Eric Sandeen (6):
ext4 extents: remove unneeded casts
ext4: different maxbytes functions for bitmap & extent files
ext4: export iov_shorten from kernel for ext4's use
ext4: store maxbytes for bitmapped files and return EFBIG as appropriate
ext4: fix oops on corrupted ext4 mount
ext4: fix up EXT4FS_DEBUG builds

Girish Shilamkar (1):
ext4: Add the journal checksum feature

Jan Kara (2):
ext4: Avoid rec_len overflow with 64KB block size
jbd2: Fix assertion failure in fs/jbd2/checkpoint.c

Jean Noel Cordenner (2):
vfs: Add 64 bit i_version support
ext4: Add inode version support in ext4

Johann Lombardi (1):
jbd2: jbd2 stats through procfs

Mariusz Kozlowski (1):
ext4: remove unused code from ext4_find_entry()

Mingming Cao (4):
jbd2: add lockdep support
jbd2: Mark jbd2 slabs as SLAB_TEMPORARY
jbd2: Use round-jiffies() function for the "5 second" ext4/jbd2 wakeup
jbd2: sparse pointer use of zero as null

Takashi Sato (1):
ext4: Support large blocksize up to PAGESIZE


2008-01-22 03:03:13

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 15/49] ext4: store maxbytes for bitmapped files and return EFBIG as appropriate

From: Eric Sandeen <[email protected]>

Calculate & store the max offset for bitmapped files, and
catch too-large seeks, truncates, and writes in ext4, shortening
or rejecting as appropriate.

Signed-off-by: Eric Sandeen <[email protected]>
---
fs/ext4/file.c | 19 ++++++++++++++++++-
fs/ext4/inode.c | 16 +++++++++++++++-
fs/ext4/super.c | 1 +
include/linux/ext4_fs_sb.h | 1 +
4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a81cd6..a6b2aa1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -56,8 +56,25 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
ssize_t ret;
int err;

- ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+ /*
+ * If we have encountered a bitmap-format file, the size limit
+ * is smaller than s_maxbytes, which is for extent-mapped files.
+ */
+
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) {
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ size_t length = iov_length(iov, nr_segs);

+ if (pos > sbi->s_bitmap_maxbytes)
+ return -EFBIG;
+
+ if (pos + length > sbi->s_bitmap_maxbytes) {
+ nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
+ sbi->s_bitmap_maxbytes - pos);
+ }
+ }
+
+ ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
/*
* Skip flushing if there was an error, or if nothing was written.
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9cf8572..eaace13 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -314,7 +314,10 @@ static int ext4_block_to_path(struct inode *inode,
offsets[n++] = i_block & (ptrs - 1);
final = ptrs;
} else {
- ext4_warning(inode->i_sb, "ext4_block_to_path", "block > big");
+ ext4_warning(inode->i_sb, "ext4_block_to_path",
+ "block %u > max",
+ i_block + direct_blocks +
+ indirect_blocks + double_blocks);
}
if (boundary)
*boundary = final - 1 - (i_block & (ptrs - 1));
@@ -3092,6 +3095,17 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
ext4_journal_stop(handle);
}

+ if (attr->ia_valid & ATTR_SIZE) {
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) {
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ if (attr->ia_size > sbi->s_bitmap_maxbytes) {
+ error = -EFBIG;
+ goto err_out;
+ }
+ }
+ }
+
if (S_ISREG(inode->i_mode) &&
attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
handle_t *handle;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c79e46b..0931831 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1922,6 +1922,7 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
}
}

+ sbi->s_bitmap_maxbytes = ext4_max_bitmap_size(sb->s_blocksize_bits);
sb->s_maxbytes = ext4_max_size(sb->s_blocksize_bits);

if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h
index f15821c..38a47ec 100644
--- a/include/linux/ext4_fs_sb.h
+++ b/include/linux/ext4_fs_sb.h
@@ -38,6 +38,7 @@ struct ext4_sb_info {
ext4_group_t s_groups_count; /* Number of groups in the fs */
unsigned long s_overhead_last; /* Last calculated overhead */
unsigned long s_blocks_last; /* Last seen block count */
+ loff_t s_bitmap_maxbytes; /* max bytes for bitmap files */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext4_super_block * s_es; /* Pointer to the super block in the buffer */
struct buffer_head ** s_group_desc;
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:03:34

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 40/49] ext4: Add new functions for searching extent tree

From: Alex Tomas <[email protected]>

Add the functions ext4_ext_search_left() and ext4_ext_search_right(),
which are used by mballoc during ext4_ext_get_blocks to decided whether
to merge extent information.

Signed-off-by: Alex Tomas <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Johann Lombardi <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 142 +++++++++++++++++++++++++++++++++++++++
include/linux/ext4_fs_extents.h | 4 +
2 files changed, 146 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 03d1bbb..a60227c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1017,6 +1017,148 @@ out:
}

/*
+ * search the closest allocated block to the left for *logical
+ * and returns it at @logical + it's physical address at @phys
+ * if *logical is the smallest allocated block, the function
+ * returns 0 at @phys
+ * return value contains 0 (success) or error code
+ */
+int
+ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path,
+ ext4_lblk_t *logical, ext4_fsblk_t *phys)
+{
+ struct ext4_extent_idx *ix;
+ struct ext4_extent *ex;
+ int depth;
+
+ BUG_ON(path == NULL);
+ depth = path->p_depth;
+ *phys = 0;
+
+ if (depth == 0 && path->p_ext == NULL)
+ return 0;
+
+ /* usually extent in the path covers blocks smaller
+ * then *logical, but it can be that extent is the
+ * first one in the file */
+
+ ex = path[depth].p_ext;
+ if (*logical < le32_to_cpu(ex->ee_block)) {
+ BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
+ while (--depth >= 0) {
+ ix = path[depth].p_idx;
+ BUG_ON(ix != EXT_FIRST_INDEX(path[depth].p_hdr));
+ }
+ return 0;
+ }
+
+ BUG_ON(*logical < le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len));
+
+ *logical = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1;
+ *phys = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - 1;
+ return 0;
+}
+
+/*
+ * search the closest allocated block to the right for *logical
+ * and returns it at @logical + it's physical address at @phys
+ * if *logical is the smallest allocated block, the function
+ * returns 0 at @phys
+ * return value contains 0 (success) or error code
+ */
+int
+ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path,
+ ext4_lblk_t *logical, ext4_fsblk_t *phys)
+{
+ struct buffer_head *bh = NULL;
+ struct ext4_extent_header *eh;
+ struct ext4_extent_idx *ix;
+ struct ext4_extent *ex;
+ ext4_fsblk_t block;
+ int depth;
+
+ BUG_ON(path == NULL);
+ depth = path->p_depth;
+ *phys = 0;
+
+ if (depth == 0 && path->p_ext == NULL)
+ return 0;
+
+ /* usually extent in the path covers blocks smaller
+ * then *logical, but it can be that extent is the
+ * first one in the file */
+
+ ex = path[depth].p_ext;
+ if (*logical < le32_to_cpu(ex->ee_block)) {
+ BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
+ while (--depth >= 0) {
+ ix = path[depth].p_idx;
+ BUG_ON(ix != EXT_FIRST_INDEX(path[depth].p_hdr));
+ }
+ *logical = le32_to_cpu(ex->ee_block);
+ *phys = ext_pblock(ex);
+ return 0;
+ }
+
+ BUG_ON(*logical < le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len));
+
+ if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
+ /* next allocated block in this leaf */
+ ex++;
+ *logical = le32_to_cpu(ex->ee_block);
+ *phys = ext_pblock(ex);
+ return 0;
+ }
+
+ /* go up and search for index to the right */
+ while (--depth >= 0) {
+ ix = path[depth].p_idx;
+ if (ix != EXT_LAST_INDEX(path[depth].p_hdr))
+ break;
+ }
+
+ if (depth < 0) {
+ /* we've gone up to the root and
+ * found no index to the right */
+ return 0;
+ }
+
+ /* we've found index to the right, let's
+ * follow it and find the closest allocated
+ * block to the right */
+ ix++;
+ block = idx_pblock(ix);
+ while (++depth < path->p_depth) {
+ bh = sb_bread(inode->i_sb, block);
+ if (bh == NULL)
+ return -EIO;
+ eh = ext_block_hdr(bh);
+ if (ext4_ext_check_header(inode, eh, depth)) {
+ brelse(bh);
+ return -EIO;
+ }
+ ix = EXT_FIRST_INDEX(eh);
+ block = idx_pblock(ix);
+ brelse(bh);
+ }
+
+ bh = sb_bread(inode->i_sb, block);
+ if (bh == NULL)
+ return -EIO;
+ eh = ext_block_hdr(bh);
+ if (ext4_ext_check_header(inode, eh, path->p_depth - depth)) {
+ brelse(bh);
+ return -EIO;
+ }
+ ex = EXT_FIRST_EXTENT(eh);
+ *logical = le32_to_cpu(ex->ee_block);
+ *phys = ext_pblock(ex);
+ brelse(bh);
+ return 0;
+
+}
+
+/*
* ext4_ext_next_allocated_block:
* returns allocated block in subsequent extent or EXT_MAX_BLOCK.
* NOTE: it considers block number from index entry as
diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h
index 023683b..56d0ec6 100644
--- a/include/linux/ext4_fs_extents.h
+++ b/include/linux/ext4_fs_extents.h
@@ -221,5 +221,9 @@ extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *,
extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t,
struct ext4_ext_path *);
+extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *,
+ ext4_lblk_t *, ext4_fsblk_t *);
+extern int ext4_ext_search_right(struct inode *, struct ext4_ext_path *,
+ ext4_lblk_t *, ext4_fsblk_t *);
#endif /* _LINUX_EXT4_EXTENTS */

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:03:48

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 01/49] ext4: Support large blocksize up to PAGESIZE

From: Takashi Sato <[email protected]>

This patch set supports large block size(>4k, <=64k) in ext4,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext4 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
[1/2] ext4: enlarge blocksize
- Allow blocksize up to pagesize

[2/2] ext4: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext4dev, and able to handle empty directory block.

Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/super.c | 5 +++++
include/linux/ext4_fs.h | 4 ++--
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1ca0f54..ab7010d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1624,6 +1624,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
goto out_fail;
}

+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+ goto out_fail;
+ }
+
/*
* The ext4 superblock will not be buffer aligned for other than 1kB
* block sizes. We need to calculate the offset from buffer start.
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 97dd409..dfe4487 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -73,8 +73,8 @@
* Macro-instructions used to manage several block sizes
*/
#define EXT4_MIN_BLOCK_SIZE 1024
-#define EXT4_MAX_BLOCK_SIZE 4096
-#define EXT4_MIN_BLOCK_LOG_SIZE 10
+#define EXT4_MAX_BLOCK_SIZE 65536
+#define EXT4_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
#else
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:04:07

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 44/49] ext4: fix uniniatilized extent splitting error

From: Dmitry Monakhov <[email protected]>

Fix bug reported by Dmitry Monakhov caused by lost error code

Testcase:

blksize = 0x1000;
fd = open(argv[1], O_RDWR|O_CREAT, 0700);
unsigned long long sz = 0x10000000UL;
/* allocating big blocks chunk */
syscall(__NR_fallocate, fd, 0, 0UL, sz)

/* grab all other available filesystem space */
tfd = open("tmp", O_RDWR|O_CREAT|O_DIRECT, 0700);
while( write(tfd, buf, 4096) > 0); /* loop untill ENOSPC */
fsync(fd); /* just in case */
while (pos < sz) {
/* each seek+ write operation result in splits uninitialized extent
in three extents. Splitting may result in new extent allocation
which probably will fail because of ENOSPC*/

lseek(fd, blksize*2 -1, SEEK_CUR);
if ((ret = write(fd, 'a', 1)) != 1)
exit(1);
pos += blksize * 2;
}

Signed-off-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8cf5545..13e3e8c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2373,9 +2373,10 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ret = ext4_ext_convert_to_initialized(handle, inode,
path, iblock,
max_blocks);
- if (ret <= 0)
+ if (ret <= 0) {
+ err = ret;
goto out2;
- else
+ } else
allocated = ret;
goto outnew;
}
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:04:27

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 48/49] jbd2: Use round-jiffies() function for the "5 second" ext4/jbd2 wakeup

From: Mingming Cao <[email protected]>

While "every 5 seconds" doesn't sound as a problem, there can be many
of these (and these timers do add up over all the kernel). The "5
second" wakeup isn't really timing sensitive; in addition even with
rounding it'll still happen every 5 seconds (with the exception of the
very first time, which is likely to be rounded up to somewhere closer
to 6 seconds)

(Ported from similar JBD patch made by Arjan van de Ven to
fs/jbd/transaction.c)

Cc: Arjan van de Ven <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/transaction.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 70b3199..0c8adab 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -54,7 +54,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
spin_lock_init(&transaction->t_handle_lock);

/* Set up the commit timer for the new transaction. */
- journal->j_commit_timer.expires = transaction->t_expires;
+ journal->j_commit_timer.expires = round_jiffies(transaction->t_expires);
add_timer(&journal->j_commit_timer);

J_ASSERT(journal->j_running_transaction == NULL);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:04:41

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 39/49] ext4: Add ext4_find_next_bit()

From: Aneesh Kumar K.V <[email protected]>

This function is used by the ext4 multi block allocator patches.

Also add generic_find_next_le_bit

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/asm-arm/bitops.h | 2 +
include/asm-generic/bitops/ext2-non-atomic.h | 2 +
include/asm-generic/bitops/le.h | 4 ++
include/asm-m68k/bitops.h | 2 +
include/asm-m68knommu/bitops.h | 2 +
include/asm-powerpc/bitops.h | 4 ++
include/asm-s390/bitops.h | 2 +
include/linux/ext4_fs.h | 1 +
lib/find_next_bit.c | 43 ++++++++++++++++++++++++++
9 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h
index 47a6b08..5c60bfc 100644
--- a/include/asm-arm/bitops.h
+++ b/include/asm-arm/bitops.h
@@ -310,6 +310,8 @@ static inline int constant_fls(int x)
_find_first_zero_bit_le(p,sz)
#define ext2_find_next_zero_bit(p,sz,off) \
_find_next_zero_bit_le(p,sz,off)
+#define ext2_find_next_bit(p, sz, off) \
+ _find_next_bit_le(p, sz, off)

/*
* Minix is defined to use little-endian byte ordering.
diff --git a/include/asm-generic/bitops/ext2-non-atomic.h b/include/asm-generic/bitops/ext2-non-atomic.h
index 1697404..63cf822 100644
--- a/include/asm-generic/bitops/ext2-non-atomic.h
+++ b/include/asm-generic/bitops/ext2-non-atomic.h
@@ -14,5 +14,7 @@
generic_find_first_zero_le_bit((unsigned long *)(addr), (size))
#define ext2_find_next_zero_bit(addr, size, off) \
generic_find_next_zero_le_bit((unsigned long *)(addr), (size), (off))
+#define ext2_find_next_bit(addr, size, off) \
+ generic_find_next_le_bit((unsigned long *)(addr), (size), (off))

#endif /* _ASM_GENERIC_BITOPS_EXT2_NON_ATOMIC_H_ */
diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h
index b9c7e5d..80e3bf1 100644
--- a/include/asm-generic/bitops/le.h
+++ b/include/asm-generic/bitops/le.h
@@ -20,6 +20,8 @@
#define generic___test_and_clear_le_bit(nr, addr) __test_and_clear_bit(nr, addr)

#define generic_find_next_zero_le_bit(addr, size, offset) find_next_zero_bit(addr, size, offset)
+#define generic_find_next_le_bit(addr, size, offset) \
+ find_next_bit(addr, size, offset)

#elif defined(__BIG_ENDIAN)

@@ -42,6 +44,8 @@

extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);
+extern unsigned long generic_find_next_le_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);

#else
#error "Please fix <asm/byteorder.h>"
diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h
index 2976b5d..83d1f28 100644
--- a/include/asm-m68k/bitops.h
+++ b/include/asm-m68k/bitops.h
@@ -410,6 +410,8 @@ static inline int ext2_find_next_zero_bit(const void *vaddr, unsigned size,
res = ext2_find_first_zero_bit (p, size - 32 * (p - addr));
return (p - addr) * 32 + res;
}
+#define ext2_find_next_bit(addr, size, off) \
+ generic_find_next_le_bit((unsigned long *)(addr), (size), (off))

#endif /* __KERNEL__ */

diff --git a/include/asm-m68knommu/bitops.h b/include/asm-m68knommu/bitops.h
index f8dfb7b..f43afe1 100644
--- a/include/asm-m68knommu/bitops.h
+++ b/include/asm-m68knommu/bitops.h
@@ -294,6 +294,8 @@ found_middle:
return result + ffz(__swab32(tmp));
}

+#define ext2_find_next_bit(addr, size, off) \
+ generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
#include <asm-generic/bitops/minix.h>

#endif /* __KERNEL__ */
diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h
index 733b4af..220d9a7 100644
--- a/include/asm-powerpc/bitops.h
+++ b/include/asm-powerpc/bitops.h
@@ -359,6 +359,8 @@ static __inline__ int test_le_bit(unsigned long nr,
unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);

+unsigned long generic_find_next_le_bit(const unsigned long *addr,
+ unsigned long size, unsigned long offset);
/* Bitmap functions for the ext2 filesystem */

#define ext2_set_bit(nr,addr) \
@@ -378,6 +380,8 @@ unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
#define ext2_find_next_zero_bit(addr, size, off) \
generic_find_next_zero_le_bit((unsigned long*)addr, size, off)

+#define ext2_find_next_bit(addr, size, off) \
+ generic_find_next_le_bit((unsigned long *)addr, size, off)
/* Bitmap functions for the minix filesystem. */

#define minix_test_and_set_bit(nr,addr) \
diff --git a/include/asm-s390/bitops.h b/include/asm-s390/bitops.h
index 34d9a63..dba6fec 100644
--- a/include/asm-s390/bitops.h
+++ b/include/asm-s390/bitops.h
@@ -772,6 +772,8 @@ static inline int sched_find_first_bit(unsigned long *b)
test_and_clear_bit((nr)^(__BITOPS_WORDSIZE - 8), (unsigned long *)addr)
#define ext2_test_bit(nr, addr) \
test_bit((nr)^(__BITOPS_WORDSIZE - 8), (unsigned long *)addr)
+#define ext2_find_next_bit(addr, size, off) \
+ generic_find_next_le_bit((unsigned long *)(addr), (size), (off))

#ifndef __s390x__

diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 213974f..d0b7ca9 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -493,6 +493,7 @@ do { \
#define ext4_test_bit ext2_test_bit
#define ext4_find_first_zero_bit ext2_find_first_zero_bit
#define ext4_find_next_zero_bit ext2_find_next_zero_bit
+#define ext4_find_next_bit ext2_find_next_bit

/*
* Maximal mount counts between two filesystem checks
diff --git a/lib/find_next_bit.c b/lib/find_next_bit.c
index bda0d71..78ccd73 100644
--- a/lib/find_next_bit.c
+++ b/lib/find_next_bit.c
@@ -178,4 +178,47 @@ found_middle_swap:

EXPORT_SYMBOL(generic_find_next_zero_le_bit);

+unsigned long generic_find_next_le_bit(const unsigned long *addr, unsigned
+ long size, unsigned long offset)
+{
+ const unsigned long *p = addr + BITOP_WORD(offset);
+ unsigned long result = offset & ~(BITS_PER_LONG - 1);
+ unsigned long tmp;
+
+ if (offset >= size)
+ return size;
+ size -= result;
+ offset &= (BITS_PER_LONG - 1UL);
+ if (offset) {
+ tmp = ext2_swabp(p++);
+ tmp &= (~0UL << offset);
+ if (size < BITS_PER_LONG)
+ goto found_first;
+ if (tmp)
+ goto found_middle;
+ size -= BITS_PER_LONG;
+ result += BITS_PER_LONG;
+ }
+
+ while (size & ~(BITS_PER_LONG - 1)) {
+ tmp = *(p++);
+ if (tmp)
+ goto found_middle_swap;
+ result += BITS_PER_LONG;
+ size -= BITS_PER_LONG;
+ }
+ if (!size)
+ return result;
+ tmp = ext2_swabp(p);
+found_first:
+ tmp &= (~0UL >> (BITS_PER_LONG - size));
+ if (tmp == 0UL) /* Are any bits set? */
+ return result + size; /* Nope. */
+found_middle:
+ return result + __ffs(tmp);
+
+found_middle_swap:
+ return result + __ffs(ext2_swab(tmp));
+}
+EXPORT_SYMBOL(generic_find_next_le_bit);
#endif /* __BIG_ENDIAN */
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:04:59

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 20/49] ext4/super.c: fix #ifdef's (CONFIG_EXT4_* -> CONFIG_EXT4DEV_*)

From: Adrian Bunk <[email protected]>

Based on a report by Robert P. J. Day.

Signed-off-by: Adrian Bunk <[email protected]>
---
fs/ext4/super.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0931831..1484a08 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -706,7 +706,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",debug");
if (test_opt(sb, OLDALLOC))
seq_puts(seq, ",oldalloc");
-#ifdef CONFIG_EXT4_FS_XATTR
+#ifdef CONFIG_EXT4DEV_FS_XATTR
if (test_opt(sb, XATTR_USER))
seq_puts(seq, ",user_xattr");
if (!test_opt(sb, XATTR_USER) &&
@@ -714,7 +714,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",nouser_xattr");
}
#endif
-#ifdef CONFIG_EXT4_FS_POSIX_ACL
+#ifdef CONFIG_EXT4DEV_FS_POSIX_ACL
if (test_opt(sb, POSIX_ACL))
seq_puts(seq, ",acl");
if (!test_opt(sb, POSIX_ACL) && (def_mount_opts & EXT4_DEFM_ACL))
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:05:31

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 41/49] ext4: Add multi block allocator for ext4

From: Alex Tomas <[email protected]>

Signed-off-by: Alex Tomas <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/Makefile | 2 +-
fs/ext4/balloc.c | 67 +-
fs/ext4/extents.c | 45 +-
fs/ext4/inode.c | 15 +-
fs/ext4/mballoc.c | 4551 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/migrate.c | 10 +-
fs/ext4/super.c | 61 +-
fs/ext4/xattr.c | 4 +-
include/linux/ext4_fs.h | 76 +-
include/linux/ext4_fs_i.h | 4 +
include/linux/ext4_fs_sb.h | 52 +
11 files changed, 4850 insertions(+), 37 deletions(-)
create mode 100644 fs/ext4/mballoc.c

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index d5fd80b..ac6fa8c 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o

ext4dev-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
- ext4_jbd2.o migrate.o
+ ext4_jbd2.o migrate.o mballoc.o

ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 54d3da7..643046b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -577,6 +577,8 @@ void ext4_discard_reservation(struct inode *inode)
struct ext4_reserve_window_node *rsv;
spinlock_t *rsv_lock = &EXT4_SB(inode->i_sb)->s_rsv_window_lock;

+ ext4_mb_discard_inode_preallocations(inode);
+
if (!block_i)
return;

@@ -785,19 +787,29 @@ error_return:
* @inode: inode
* @block: start physical block to free
* @count: number of blocks to count
+ * @metadata: Are these metadata blocks
*/
void ext4_free_blocks(handle_t *handle, struct inode *inode,
- ext4_fsblk_t block, unsigned long count)
+ ext4_fsblk_t block, unsigned long count,
+ int metadata)
{
struct super_block * sb;
unsigned long dquot_freed_blocks;

+ /* this isn't the right place to decide whether block is metadata
+ * inode.c/extents.c knows better, but for safety ... */
+ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode) ||
+ ext4_should_journal_data(inode))
+ metadata = 1;
+
sb = inode->i_sb;
- if (!sb) {
- printk ("ext4_free_blocks: nonexistent device");
- return;
- }
- ext4_free_blocks_sb(handle, sb, block, count, &dquot_freed_blocks);
+
+ if (!test_opt(sb, MBALLOC) || !EXT4_SB(sb)->s_group_info)
+ ext4_free_blocks_sb(handle, sb, block, count,
+ &dquot_freed_blocks);
+ else
+ ext4_mb_free_blocks(handle, inode, block, count,
+ metadata, &dquot_freed_blocks);
if (dquot_freed_blocks)
DQUOT_FREE_BLOCK(inode, dquot_freed_blocks);
return;
@@ -1576,7 +1588,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
}

/**
- * ext4_new_blocks() -- core block(s) allocation function
+ * ext4_new_blocks_old() -- core block(s) allocation function
* @handle: handle to this transaction
* @inode: file inode
* @goal: given target block(filesystem wide)
@@ -1589,7 +1601,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
* any specific goal block.
*
*/
-ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
+ext4_fsblk_t ext4_new_blocks_old(handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, unsigned long *count, int *errp)
{
struct buffer_head *bitmap_bh = NULL;
@@ -1849,13 +1861,46 @@ out:
}

ext4_fsblk_t ext4_new_block(handle_t *handle, struct inode *inode,
- ext4_fsblk_t goal, int *errp)
+ ext4_fsblk_t goal, int *errp)
+{
+ struct ext4_allocation_request ar;
+ ext4_fsblk_t ret;
+
+ if (!test_opt(inode->i_sb, MBALLOC)) {
+ unsigned long count = 1;
+ ret = ext4_new_blocks_old(handle, inode, goal, &count, errp);
+ return ret;
+ }
+
+ memset(&ar, 0, sizeof(ar));
+ ar.inode = inode;
+ ar.goal = goal;
+ ar.len = 1;
+ ret = ext4_mb_new_blocks(handle, &ar, errp);
+ return ret;
+}
+
+ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t goal, unsigned long *count, int *errp)
{
- unsigned long count = 1;
+ struct ext4_allocation_request ar;
+ ext4_fsblk_t ret;

- return ext4_new_blocks(handle, inode, goal, &count, errp);
+ if (!test_opt(inode->i_sb, MBALLOC)) {
+ ret = ext4_new_blocks_old(handle, inode, goal, count, errp);
+ return ret;
+ }
+
+ memset(&ar, 0, sizeof(ar));
+ ar.inode = inode;
+ ar.goal = goal;
+ ar.len = *count;
+ ret = ext4_mb_new_blocks(handle, &ar, errp);
+ *count = ar.len;
+ return ret;
}

+
/**
* ext4_count_free_blocks() -- count filesystem free blocks
* @sb: superblock
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a60227c..8cf5545 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -853,7 +853,7 @@ cleanup:
for (i = 0; i < depth; i++) {
if (!ablocks[i])
continue;
- ext4_free_blocks(handle, inode, ablocks[i], 1);
+ ext4_free_blocks(handle, inode, ablocks[i], 1, 1);
}
}
kfree(ablocks);
@@ -1698,7 +1698,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
ext_debug("index is empty, remove it, free block %llu\n", leaf);
bh = sb_find_get_block(inode->i_sb, leaf);
ext4_forget(handle, 1, inode, bh, leaf);
- ext4_free_blocks(handle, inode, leaf, 1);
+ ext4_free_blocks(handle, inode, leaf, 1, 1);
return err;
}

@@ -1759,8 +1759,10 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
{
struct buffer_head *bh;
unsigned short ee_len = ext4_ext_get_actual_len(ex);
- int i;
+ int i, metadata = 0;

+ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
+ metadata = 1;
#ifdef EXTENTS_STATS
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -1789,7 +1791,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
bh = sb_find_get_block(inode->i_sb, start + i);
ext4_forget(handle, 0, inode, bh, start + i);
}
- ext4_free_blocks(handle, inode, start, num);
+ ext4_free_blocks(handle, inode, start, num, metadata);
} else if (from == le32_to_cpu(ex->ee_block)
&& to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
printk(KERN_INFO "strange request: removal %u-%u from %u:%u\n",
@@ -2287,6 +2289,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, newblock;
int err = 0, depth, ret;
unsigned long allocated = 0;
+ struct ext4_allocation_request ar;

__clear_bit(BH_New, &bh_result->b_state);
ext_debug("blocks %u/%lu requested for inode %u\n",
@@ -2397,8 +2400,15 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
if (S_ISREG(inode->i_mode) && (!EXT4_I(inode)->i_block_alloc_info))
ext4_init_block_alloc_info(inode);

- /* allocate new block */
- goal = ext4_ext_find_goal(inode, path, iblock);
+ /* find neighbour allocated blocks */
+ ar.lleft = iblock;
+ err = ext4_ext_search_left(inode, path, &ar.lleft, &ar.pleft);
+ if (err)
+ goto out2;
+ ar.lright = iblock;
+ err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright);
+ if (err)
+ goto out2;

/*
* See if request is beyond maximum number of blocks we can have in
@@ -2421,7 +2431,18 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
allocated = le16_to_cpu(newex.ee_len);
else
allocated = max_blocks;
- newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
+
+ /* allocate new block */
+ ar.inode = inode;
+ ar.goal = ext4_ext_find_goal(inode, path, iblock);
+ ar.logical = iblock;
+ ar.len = allocated;
+ if (S_ISREG(inode->i_mode))
+ ar.flags = EXT4_MB_HINT_DATA;
+ else
+ /* disable in-core preallocation for non-regular files */
+ ar.flags = 0;
+ newblock = ext4_mb_new_blocks(handle, &ar, &err);
if (!newblock)
goto out2;
ext_debug("allocate new block: goal %llu, found %llu/%lu\n",
@@ -2429,14 +2450,17 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,

/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock);
- newex.ee_len = cpu_to_le16(allocated);
+ newex.ee_len = cpu_to_le16(ar.len);
if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */
ext4_ext_mark_uninitialized(&newex);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
if (err) {
/* free data blocks we just allocated */
+ /* not a good idea to call discard here directly,
+ * but otherwise we'd need to call it every free() */
+ ext4_mb_discard_inode_preallocations(inode);
ext4_free_blocks(handle, inode, ext_pblock(&newex),
- le16_to_cpu(newex.ee_len));
+ le16_to_cpu(newex.ee_len), 0);
goto out2;
}

@@ -2445,6 +2469,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,

/* previous routine could use block we allocated */
newblock = ext_pblock(&newex);
+ allocated = le16_to_cpu(newex.ee_len);
outnew:
__set_bit(BH_New, &bh_result->b_state);

@@ -2496,6 +2521,8 @@ void ext4_ext_truncate(struct inode * inode, struct page *page)
down_write(&EXT4_I(inode)->i_data_sem);
ext4_ext_invalidate_cache(inode);

+ ext4_mb_discard_inode_preallocations(inode);
+
/*
* TODO: optimization is possible here.
* Probably we need not scan at all,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c013e5..2947800 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -551,7 +551,7 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
return ret;
failed_out:
for (i = 0; i <index; i++)
- ext4_free_blocks(handle, inode, new_blocks[i], 1);
+ ext4_free_blocks(handle, inode, new_blocks[i], 1, 0);
return ret;
}

@@ -650,9 +650,9 @@ failed:
ext4_journal_forget(handle, branch[i].bh);
}
for (i = 0; i <indirect_blks; i++)
- ext4_free_blocks(handle, inode, new_blocks[i], 1);
+ ext4_free_blocks(handle, inode, new_blocks[i], 1, 0);

- ext4_free_blocks(handle, inode, new_blocks[i], num);
+ ext4_free_blocks(handle, inode, new_blocks[i], num, 0);

return err;
}
@@ -749,9 +749,10 @@ err_out:
for (i = 1; i <= num; i++) {
BUFFER_TRACE(where[i].bh, "call jbd2_journal_forget");
ext4_journal_forget(handle, where[i].bh);
- ext4_free_blocks(handle,inode,le32_to_cpu(where[i-1].key),1);
+ ext4_free_blocks(handle, inode,
+ le32_to_cpu(where[i-1].key), 1, 0);
}
- ext4_free_blocks(handle, inode, le32_to_cpu(where[num].key), blks);
+ ext4_free_blocks(handle, inode, le32_to_cpu(where[num].key), blks, 0);

return err;
}
@@ -2051,7 +2052,7 @@ static void ext4_clear_blocks(handle_t *handle, struct inode *inode,
}
}

- ext4_free_blocks(handle, inode, block_to_free, count);
+ ext4_free_blocks(handle, inode, block_to_free, count, 0);
}

/**
@@ -2224,7 +2225,7 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
ext4_journal_test_restart(handle, inode);
}

- ext4_free_blocks(handle, inode, nr, 1);
+ ext4_free_blocks(handle, inode, nr, 1, 1);

if (parent_bh) {
/*
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
new file mode 100644
index 0000000..0398aa0
--- /dev/null
+++ b/fs/ext4/mballoc.c
@@ -0,0 +1,4551 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, [email protected]
+ * Written by Alex Tomas <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
+ */
+
+
+/*
+ * mballoc.c contains the multiblocks allocation routines
+ */
+
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/ext4_jbd2.h>
+#include <linux/ext4_fs.h>
+#include <linux/quotaops.h>
+#include <linux/buffer_head.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/proc_fs.h>
+#include <linux/pagemap.h>
+#include <linux/seq_file.h>
+#include <linux/version.h>
+#include "group.h"
+
+/*
+ * MUSTDO:
+ * - test ext4_ext_search_left() and ext4_ext_search_right()
+ * - search for metadata in few groups
+ *
+ * TODO v4:
+ * - normalization should take into account whether file is still open
+ * - discard preallocations if no free space left (policy?)
+ * - don't normalize tails
+ * - quota
+ * - reservation for superuser
+ *
+ * TODO v3:
+ * - bitmap read-ahead (proposed by Oleg Drokin aka green)
+ * - track min/max extents in each group for better group selection
+ * - mb_mark_used() may allocate chunk right after splitting buddy
+ * - tree of groups sorted by number of free blocks
+ * - error handling
+ */
+
+/*
+ * The allocation request involve request for multiple number of blocks
+ * near to the goal(block) value specified.
+ *
+ * During initialization phase of the allocator we decide to use the group
+ * preallocation or inode preallocation depending on the size file. The
+ * size of the file could be the resulting file size we would have after
+ * allocation or the current file size which ever is larger. If the size is
+ * less that sbi->s_mb_stream_request we select the group
+ * preallocation. The default value of s_mb_stream_request is 16
+ * blocks. This can also be tuned via
+ * /proc/fs/ext4/<partition>/stream_req. The value is represented in terms
+ * of number of blocks.
+ *
+ * The main motivation for having small file use group preallocation is to
+ * ensure that we have small file closer in the disk.
+ *
+ * First stage the allocator looks at the inode prealloc list
+ * ext4_inode_info->i_prealloc_list contain list of prealloc spaces for
+ * this particular inode. The inode prealloc space is represented as:
+ *
+ * pa_lstart -> the logical start block for this prealloc space
+ * pa_pstart -> the physical start block for this prealloc space
+ * pa_len -> lenght for this prealloc space
+ * pa_free -> free space available in this prealloc space
+ *
+ * The inode preallocation space is used looking at the _logical_ start
+ * block. If only the logical file block falls within the range of prealloc
+ * space we will consume the particular prealloc space. This make sure that
+ * that the we have contiguous physical blocks representing the file blocks
+ *
+ * The important thing to be noted in case of inode prealloc space is that
+ * we don't modify the values associated to inode prealloc space except
+ * pa_free.
+ *
+ * If we are not able to find blocks in the inode prealloc space and if we
+ * have the group allocation flag set then we look at the locality group
+ * prealloc space. These are per CPU prealloc list repreasented as
+ *
+ * ext4_sb_info.s_locality_groups[smp_processor_id()]
+ *
+ * The reason for having a per cpu locality group is to reduce the contention
+ * between CPUs. It is possible to get scheduled at this point.
+ *
+ * The locality group prealloc space is used looking at whether we have
+ * enough free space (pa_free) withing the prealloc space.
+ *
+ * If we can't allocate blocks via inode prealloc or/and locality group
+ * prealloc then we look at the buddy cache. The buddy cache is represented
+ * by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets
+ * mapped to the buddy and bitmap information regarding different
+ * groups. The buddy information is attached to buddy cache inode so that
+ * we can access them through the page cache. The information regarding
+ * each group is loaded via ext4_mb_load_buddy. The information involve
+ * block bitmap and buddy information. The information are stored in the
+ * inode as:
+ *
+ * { page }
+ * [ group 0 buddy][ group 0 bitmap] [group 1][ group 1]...
+ *
+ *
+ * one block each for bitmap and buddy information. So for each group we
+ * take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE /
+ * blocksize) blocks. So it can have information regarding groups_per_page
+ * which is blocks_per_page/2
+ *
+ * The buddy cache inode is not stored on disk. The inode is thrown
+ * away when the filesystem is unmounted.
+ *
+ * We look for count number of blocks in the buddy cache. If we were able
+ * to locate that many free blocks we return with additional information
+ * regarding rest of the contiguous physical block available
+ *
+ * Before allocating blocks via buddy cache we normalize the request
+ * blocks. This ensure we ask for more blocks that we needed. The extra
+ * blocks that we get after allocation is added to the respective prealloc
+ * list. In case of inode preallocation we follow a list of heuristics
+ * based on file size. This can be found in ext4_mb_normalize_request. If
+ * we are doing a group prealloc we try to normalize the request to
+ * sbi->s_mb_group_prealloc. Default value of s_mb_group_prealloc is set to
+ * 512 blocks. This can be tuned via
+ * /proc/fs/ext4/<partition/group_prealloc. The value is represented in
+ * terms of number of blocks. If we have mounted the file system with -O
+ * stripe=<value> option the group prealloc request is normalized to the
+ * stripe value (sbi->s_stripe)
+ *
+ * The regular allocator(using the buddy cache) support few tunables.
+ *
+ * /proc/fs/ext4/<partition>/min_to_scan
+ * /proc/fs/ext4/<partition>/max_to_scan
+ * /proc/fs/ext4/<partition>/order2_req
+ *
+ * The regular allocator use buddy scan only if the request len is power of
+ * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
+ * value of s_mb_order2_reqs can be tuned via
+ * /proc/fs/ext4/<partition>/order2_req. If the request len is equal to
+ * stripe size (sbi->s_stripe), we try to search for contigous block in
+ * stripe size. This should result in better allocation on RAID setup. If
+ * not we search in the specific group using bitmap for best extents. The
+ * tunable min_to_scan and max_to_scan controll the behaviour here.
+ * min_to_scan indicate how long the mballoc __must__ look for a best
+ * extent and max_to_scanindicate how long the mballoc __can__ look for a
+ * best extent in the found extents. Searching for the blocks starts with
+ * the group specified as the goal value in allocation context via
+ * ac_g_ex. Each group is first checked based on the criteria whether it
+ * can used for allocation. ext4_mb_good_group explains how the groups are
+ * checked.
+ *
+ * Both the prealloc space are getting populated as above. So for the first
+ * request we will hit the buddy cache which will result in this prealloc
+ * space getting filled. The prealloc space is then later used for the
+ * subsequent request.
+ */
+
+/*
+ * mballoc operates on the following data:
+ * - on-disk bitmap
+ * - in-core buddy (actually includes buddy and bitmap)
+ * - preallocation descriptors (PAs)
+ *
+ * there are two types of preallocations:
+ * - inode
+ * assiged to specific inode and can be used for this inode only.
+ * it describes part of inode's space preallocated to specific
+ * physical blocks. any block from that preallocated can be used
+ * independent. the descriptor just tracks number of blocks left
+ * unused. so, before taking some block from descriptor, one must
+ * make sure corresponded logical block isn't allocated yet. this
+ * also means that freeing any block within descriptor's range
+ * must discard all preallocated blocks.
+ * - locality group
+ * assigned to specific locality group which does not translate to
+ * permanent set of inodes: inode can join and leave group. space
+ * from this type of preallocation can be used for any inode. thus
+ * it's consumed from the beginning to the end.
+ *
+ * relation between them can be expressed as:
+ * in-core buddy = on-disk bitmap + preallocation descriptors
+ *
+ * this mean blocks mballoc considers used are:
+ * - allocated blocks (persistent)
+ * - preallocated blocks (non-persistent)
+ *
+ * consistency in mballoc world means that at any time a block is either
+ * free or used in ALL structures. notice: "any time" should not be read
+ * literally -- time is discrete and delimited by locks.
+ *
+ * to keep it simple, we don't use block numbers, instead we count number of
+ * blocks: how many blocks marked used/free in on-disk bitmap, buddy and PA.
+ *
+ * all operations can be expressed as:
+ * - init buddy: buddy = on-disk + PAs
+ * - new PA: buddy += N; PA = N
+ * - use inode PA: on-disk += N; PA -= N
+ * - discard inode PA buddy -= on-disk - PA; PA = 0
+ * - use locality group PA on-disk += N; PA -= N
+ * - discard locality group PA buddy -= PA; PA = 0
+ * note: 'buddy -= on-disk - PA' is used to show that on-disk bitmap
+ * is used in real operation because we can't know actual used
+ * bits from PA, only from on-disk bitmap
+ *
+ * if we follow this strict logic, then all operations above should be atomic.
+ * given some of them can block, we'd have to use something like semaphores
+ * killing performance on high-end SMP hardware. let's try to relax it using
+ * the following knowledge:
+ * 1) if buddy is referenced, it's already initialized
+ * 2) while block is used in buddy and the buddy is referenced,
+ * nobody can re-allocate that block
+ * 3) we work on bitmaps and '+' actually means 'set bits'. if on-disk has
+ * bit set and PA claims same block, it's OK. IOW, one can set bit in
+ * on-disk bitmap if buddy has same bit set or/and PA covers corresponded
+ * block
+ *
+ * so, now we're building a concurrency table:
+ * - init buddy vs.
+ * - new PA
+ * blocks for PA are allocated in the buddy, buddy must be referenced
+ * until PA is linked to allocation group to avoid concurrent buddy init
+ * - use inode PA
+ * we need to make sure that either on-disk bitmap or PA has uptodate data
+ * given (3) we care that PA-=N operation doesn't interfere with init
+ * - discard inode PA
+ * the simplest way would be to have buddy initialized by the discard
+ * - use locality group PA
+ * again PA-=N must be serialized with init
+ * - discard locality group PA
+ * the simplest way would be to have buddy initialized by the discard
+ * - new PA vs.
+ * - use inode PA
+ * i_data_sem serializes them
+ * - discard inode PA
+ * discard process must wait until PA isn't used by another process
+ * - use locality group PA
+ * some mutex should serialize them
+ * - discard locality group PA
+ * discard process must wait until PA isn't used by another process
+ * - use inode PA
+ * - use inode PA
+ * i_data_sem or another mutex should serializes them
+ * - discard inode PA
+ * discard process must wait until PA isn't used by another process
+ * - use locality group PA
+ * nothing wrong here -- they're different PAs covering different blocks
+ * - discard locality group PA
+ * discard process must wait until PA isn't used by another process
+ *
+ * now we're ready to make few consequences:
+ * - PA is referenced and while it is no discard is possible
+ * - PA is referenced until block isn't marked in on-disk bitmap
+ * - PA changes only after on-disk bitmap
+ * - discard must not compete with init. either init is done before
+ * any discard or they're serialized somehow
+ * - buddy init as sum of on-disk bitmap and PAs is done atomically
+ *
+ * a special case when we've used PA to emptiness. no need to modify buddy
+ * in this case, but we should care about concurrent init
+ *
+ */
+
+ /*
+ * Logic in few words:
+ *
+ * - allocation:
+ * load group
+ * find blocks
+ * mark bits in on-disk bitmap
+ * release group
+ *
+ * - use preallocation:
+ * find proper PA (per-inode or group)
+ * load group
+ * mark bits in on-disk bitmap
+ * release group
+ * release PA
+ *
+ * - free:
+ * load group
+ * mark bits in on-disk bitmap
+ * release group
+ *
+ * - discard preallocations in group:
+ * mark PAs deleted
+ * move them onto local list
+ * load on-disk bitmap
+ * load group
+ * remove PA from object (inode or locality group)
+ * mark free blocks in-core
+ *
+ * - discard inode's preallocations:
+ */
+
+/*
+ * Locking rules
+ *
+ * Locks:
+ * - bitlock on a group (group)
+ * - object (inode/locality) (object)
+ * - per-pa lock (pa)
+ *
+ * Paths:
+ * - new pa
+ * object
+ * group
+ *
+ * - find and use pa:
+ * pa
+ *
+ * - release consumed pa:
+ * pa
+ * group
+ * object
+ *
+ * - generate in-core bitmap:
+ * group
+ * pa
+ *
+ * - discard all for given object (inode, locality group):
+ * object
+ * pa
+ * group
+ *
+ * - discard all for given group:
+ * group
+ * pa
+ * group
+ * object
+ *
+ */
+
+/*
+ * with AGGRESSIVE_CHECK allocator runs consistency checks over
+ * structures. these checks slow things down a lot
+ */
+#define AGGRESSIVE_CHECK__
+
+/*
+ * with DOUBLE_CHECK defined mballoc creates persistent in-core
+ * bitmaps, maintains and uses them to check for double allocations
+ */
+#define DOUBLE_CHECK__
+
+/*
+ */
+#define MB_DEBUG__
+#ifdef MB_DEBUG
+#define mb_debug(fmt, a...) printk(fmt, ##a)
+#else
+#define mb_debug(fmt, a...)
+#endif
+
+/*
+ * with EXT4_MB_HISTORY mballoc stores last N allocations in memory
+ * and you can monitor it in /proc/fs/ext4/<dev>/mb_history
+ */
+#define EXT4_MB_HISTORY
+#define EXT4_MB_HISTORY_ALLOC 1 /* allocation */
+#define EXT4_MB_HISTORY_PREALLOC 2 /* preallocated blocks used */
+#define EXT4_MB_HISTORY_DISCARD 4 /* preallocation discarded */
+#define EXT4_MB_HISTORY_FREE 8 /* free */
+
+#define EXT4_MB_HISTORY_DEFAULT (EXT4_MB_HISTORY_ALLOC | \
+ EXT4_MB_HISTORY_PREALLOC)
+
+/*
+ * How long mballoc can look for a best extent (in found extents)
+ */
+#define MB_DEFAULT_MAX_TO_SCAN 200
+
+/*
+ * How long mballoc must look for a best extent
+ */
+#define MB_DEFAULT_MIN_TO_SCAN 10
+
+/*
+ * How many groups mballoc will scan looking for the best chunk
+ */
+#define MB_DEFAULT_MAX_GROUPS_TO_SCAN 5
+
+/*
+ * with 'ext4_mb_stats' allocator will collect stats that will be
+ * shown at umount. The collecting costs though!
+ */
+#define MB_DEFAULT_STATS 1
+
+/*
+ * files smaller than MB_DEFAULT_STREAM_THRESHOLD are served
+ * by the stream allocator, which purpose is to pack requests
+ * as close each to other as possible to produce smooth I/O traffic
+ * We use locality group prealloc space for stream request.
+ * We can tune the same via /proc/fs/ext4/<parition>/stream_req
+ */
+#define MB_DEFAULT_STREAM_THRESHOLD 16 /* 64K */
+
+/*
+ * for which requests use 2^N search using buddies
+ */
+#define MB_DEFAULT_ORDER2_REQS 2
+
+/*
+ * default group prealloc size 512 blocks
+ */
+#define MB_DEFAULT_GROUP_PREALLOC 512
+
+static struct kmem_cache *ext4_pspace_cachep;
+
+#ifdef EXT4_BB_MAX_BLOCKS
+#undef EXT4_BB_MAX_BLOCKS
+#endif
+#define EXT4_BB_MAX_BLOCKS 30
+
+struct ext4_free_metadata {
+ ext4_group_t group;
+ unsigned short num;
+ ext4_grpblk_t blocks[EXT4_BB_MAX_BLOCKS];
+ struct list_head list;
+};
+
+struct ext4_group_info {
+ unsigned long bb_state;
+ unsigned long bb_tid;
+ struct ext4_free_metadata *bb_md_cur;
+ unsigned short bb_first_free;
+ unsigned short bb_free;
+ unsigned short bb_fragments;
+ struct list_head bb_prealloc_list;
+#ifdef DOUBLE_CHECK
+ void *bb_bitmap;
+#endif
+ unsigned short bb_counters[];
+};
+
+#define EXT4_GROUP_INFO_NEED_INIT_BIT 0
+#define EXT4_GROUP_INFO_LOCKED_BIT 1
+
+#define EXT4_MB_GRP_NEED_INIT(grp) \
+ (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
+
+
+struct ext4_prealloc_space {
+ struct list_head pa_inode_list;
+ struct list_head pa_group_list;
+ union {
+ struct list_head pa_tmp_list;
+ struct rcu_head pa_rcu;
+ } u;
+ spinlock_t pa_lock;
+ atomic_t pa_count;
+ unsigned pa_deleted;
+ ext4_fsblk_t pa_pstart; /* phys. block */
+ ext4_lblk_t pa_lstart; /* log. block */
+ unsigned short pa_len; /* len of preallocated chunk */
+ unsigned short pa_free; /* how many blocks are free */
+ unsigned short pa_linear; /* consumed in one direction
+ * strictly, for grp prealloc */
+ spinlock_t *pa_obj_lock;
+ struct inode *pa_inode; /* hack, for history only */
+};
+
+
+struct ext4_free_extent {
+ ext4_lblk_t fe_logical;
+ ext4_grpblk_t fe_start;
+ ext4_group_t fe_group;
+ int fe_len;
+};
+
+/*
+ * Locality group:
+ * we try to group all related changes together
+ * so that writeback can flush/allocate them together as well
+ */
+struct ext4_locality_group {
+ /* for allocator */
+ struct semaphore lg_sem; /* to serialize allocates */
+ struct list_head lg_prealloc_list;/* list of preallocations */
+ spinlock_t lg_prealloc_lock;
+};
+
+struct ext4_allocation_context {
+ struct inode *ac_inode;
+ struct super_block *ac_sb;
+
+ /* original request */
+ struct ext4_free_extent ac_o_ex;
+
+ /* goal request (after normalization) */
+ struct ext4_free_extent ac_g_ex;
+
+ /* the best found extent */
+ struct ext4_free_extent ac_b_ex;
+
+ /* copy of the bext found extent taken before preallocation efforts */
+ struct ext4_free_extent ac_f_ex;
+
+ /* number of iterations done. we have to track to limit searching */
+ unsigned long ac_ex_scanned;
+ __u16 ac_groups_scanned;
+ __u16 ac_found;
+ __u16 ac_tail;
+ __u16 ac_buddy;
+ __u16 ac_flags; /* allocation hints */
+ __u8 ac_status;
+ __u8 ac_criteria;
+ __u8 ac_repeats;
+ __u8 ac_2order; /* if request is to allocate 2^N blocks and
+ * N > 0, the field stores N, otherwise 0 */
+ __u8 ac_op; /* operation, for history only */
+ struct page *ac_bitmap_page;
+ struct page *ac_buddy_page;
+ struct ext4_prealloc_space *ac_pa;
+ struct ext4_locality_group *ac_lg;
+};
+
+#define AC_STATUS_CONTINUE 1
+#define AC_STATUS_FOUND 2
+#define AC_STATUS_BREAK 3
+
+struct ext4_mb_history {
+ struct ext4_free_extent orig; /* orig allocation */
+ struct ext4_free_extent goal; /* goal allocation */
+ struct ext4_free_extent result; /* result allocation */
+ unsigned pid;
+ unsigned ino;
+ __u16 found; /* how many extents have been found */
+ __u16 groups; /* how many groups have been scanned */
+ __u16 tail; /* what tail broke some buddy */
+ __u16 buddy; /* buddy the tail ^^^ broke */
+ __u16 flags;
+ __u8 cr:3; /* which phase the result extent was found at */
+ __u8 op:4;
+ __u8 merged:1;
+};
+
+struct ext4_buddy {
+ struct page *bd_buddy_page;
+ void *bd_buddy;
+ struct page *bd_bitmap_page;
+ void *bd_bitmap;
+ struct ext4_group_info *bd_info;
+ struct super_block *bd_sb;
+ __u16 bd_blkbits;
+ ext4_group_t bd_group;
+};
+#define EXT4_MB_BITMAP(e4b) ((e4b)->bd_bitmap)
+#define EXT4_MB_BUDDY(e4b) ((e4b)->bd_buddy)
+
+#ifndef EXT4_MB_HISTORY
+#define ext4_mb_store_history(ac)
+#else
+static void ext4_mb_store_history(struct ext4_allocation_context *ac);
+#endif
+
+#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1)
+
+static struct proc_dir_entry *proc_root_ext4;
+struct buffer_head *read_block_bitmap(struct super_block *, ext4_group_t);
+ext4_fsblk_t ext4_new_blocks_old(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t goal, unsigned long *count, int *errp);
+
+static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
+ ext4_group_t group);
+static void ext4_mb_poll_new_transaction(struct super_block *, handle_t *);
+static void ext4_mb_free_committed_blocks(struct super_block *);
+static void ext4_mb_return_to_preallocation(struct inode *inode,
+ struct ext4_buddy *e4b, sector_t block,
+ int count);
+static void ext4_mb_put_pa(struct ext4_allocation_context *,
+ struct super_block *, struct ext4_prealloc_space *pa);
+static int ext4_mb_init_per_dev_proc(struct super_block *sb);
+static int ext4_mb_destroy_per_dev_proc(struct super_block *sb);
+
+
+static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
+{
+ struct ext4_group_info *grinfo = ext4_get_group_info(sb, group);
+
+ bit_spin_lock(EXT4_GROUP_INFO_LOCKED_BIT, &(grinfo->bb_state));
+}
+
+static inline void ext4_unlock_group(struct super_block *sb,
+ ext4_group_t group)
+{
+ struct ext4_group_info *grinfo = ext4_get_group_info(sb, group);
+
+ bit_spin_unlock(EXT4_GROUP_INFO_LOCKED_BIT, &(grinfo->bb_state));
+}
+
+static inline int ext4_is_group_locked(struct super_block *sb,
+ ext4_group_t group)
+{
+ struct ext4_group_info *grinfo = ext4_get_group_info(sb, group);
+
+ return bit_spin_is_locked(EXT4_GROUP_INFO_LOCKED_BIT,
+ &(grinfo->bb_state));
+}
+
+static ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb,
+ struct ext4_free_extent *fex)
+{
+ ext4_fsblk_t block;
+
+ block = (ext4_fsblk_t) fex->fe_group * EXT4_BLOCKS_PER_GROUP(sb)
+ + fex->fe_start
+ + le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
+ return block;
+}
+
+#if BITS_PER_LONG == 64
+#define mb_correct_addr_and_bit(bit, addr) \
+{ \
+ bit += ((unsigned long) addr & 7UL) << 3; \
+ addr = (void *) ((unsigned long) addr & ~7UL); \
+}
+#elif BITS_PER_LONG == 32
+#define mb_correct_addr_and_bit(bit, addr) \
+{ \
+ bit += ((unsigned long) addr & 3UL) << 3; \
+ addr = (void *) ((unsigned long) addr & ~3UL); \
+}
+#else
+#error "how many bits you are?!"
+#endif
+
+static inline int mb_test_bit(int bit, void *addr)
+{
+ mb_correct_addr_and_bit(bit, addr);
+ return ext4_test_bit(bit, addr);
+}
+
+static inline void mb_set_bit(int bit, void *addr)
+{
+ mb_correct_addr_and_bit(bit, addr);
+ ext4_set_bit(bit, addr);
+}
+
+static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
+{
+ mb_correct_addr_and_bit(bit, addr);
+ ext4_set_bit_atomic(lock, bit, addr);
+}
+
+static inline void mb_clear_bit(int bit, void *addr)
+{
+ mb_correct_addr_and_bit(bit, addr);
+ ext4_clear_bit(bit, addr);
+}
+
+static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
+{
+ mb_correct_addr_and_bit(bit, addr);
+ ext4_clear_bit_atomic(lock, bit, addr);
+}
+
+static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
+{
+ char *bb;
+
+ /* FIXME!! is this needed */
+ BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
+ BUG_ON(max == NULL);
+
+ if (order > e4b->bd_blkbits + 1) {
+ *max = 0;
+ return NULL;
+ }
+
+ /* at order 0 we see each particular block */
+ *max = 1 << (e4b->bd_blkbits + 3);
+ if (order == 0)
+ return EXT4_MB_BITMAP(e4b);
+
+ bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
+ *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
+
+ return bb;
+}
+
+#ifdef DOUBLE_CHECK
+static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b,
+ int first, int count)
+{
+ int i;
+ struct super_block *sb = e4b->bd_sb;
+
+ if (unlikely(e4b->bd_info->bb_bitmap == NULL))
+ return;
+ BUG_ON(!ext4_is_group_locked(sb, e4b->bd_group));
+ for (i = 0; i < count; i++) {
+ if (!mb_test_bit(first + i, e4b->bd_info->bb_bitmap)) {
+ ext4_fsblk_t blocknr;
+ blocknr = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb);
+ blocknr += first + i;
+ blocknr +=
+ le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
+
+ ext4_error(sb, __FUNCTION__, "double-free of inode"
+ " %lu's block %llu(bit %u in group %lu)\n",
+ inode ? inode->i_ino : 0, blocknr,
+ first + i, e4b->bd_group);
+ }
+ mb_clear_bit(first + i, e4b->bd_info->bb_bitmap);
+ }
+}
+
+static void mb_mark_used_double(struct ext4_buddy *e4b, int first, int count)
+{
+ int i;
+
+ if (unlikely(e4b->bd_info->bb_bitmap == NULL))
+ return;
+ BUG_ON(!ext4_is_group_locked(e4b->bd_sb, e4b->bd_group));
+ for (i = 0; i < count; i++) {
+ BUG_ON(mb_test_bit(first + i, e4b->bd_info->bb_bitmap));
+ mb_set_bit(first + i, e4b->bd_info->bb_bitmap);
+ }
+}
+
+static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
+{
+ if (memcmp(e4b->bd_info->bb_bitmap, bitmap, e4b->bd_sb->s_blocksize)) {
+ unsigned char *b1, *b2;
+ int i;
+ b1 = (unsigned char *) e4b->bd_info->bb_bitmap;
+ b2 = (unsigned char *) bitmap;
+ for (i = 0; i < e4b->bd_sb->s_blocksize; i++) {
+ if (b1[i] != b2[i]) {
+ printk("corruption in group %lu at byte %u(%u):"
+ " %x in copy != %x on disk/prealloc\n",
+ e4b->bd_group, i, i * 8, b1[i], b2[i]);
+ BUG();
+ }
+ }
+ }
+}
+
+#else
+#define mb_free_blocks_double(a, b, c, d)
+#define mb_mark_used_double(a, b, c)
+#define mb_cmp_bitmaps(a, b)
+#endif
+
+#ifdef AGGRESSIVE_CHECK
+
+#define MB_CHECK_ASSERT(assert) \
+do { \
+ if (!(assert)) { \
+ printk(KERN_EMERG \
+ "Assertion failure in %s() at %s:%d: \"%s\"\n", \
+ function, file, line, # assert); \
+ BUG(); \
+ } \
+} while (0)
+
+static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
+ const char *function, int line)
+{
+ struct super_block *sb = e4b->bd_sb;
+ int order = e4b->bd_blkbits + 1;
+ int max;
+ int max2;
+ int i;
+ int j;
+ int k;
+ int count;
+ struct ext4_group_info *grp;
+ int fragments = 0;
+ int fstart;
+ struct list_head *cur;
+ void *buddy;
+ void *buddy2;
+
+ if (!test_opt(sb, MBALLOC))
+ return 0;
+
+ {
+ static int mb_check_counter;
+ if (mb_check_counter++ % 100 != 0)
+ return 0;
+ }
+
+ while (order > 1) {
+ buddy = mb_find_buddy(e4b, order, &max);
+ MB_CHECK_ASSERT(buddy);
+ buddy2 = mb_find_buddy(e4b, order - 1, &max2);
+ MB_CHECK_ASSERT(buddy2);
+ MB_CHECK_ASSERT(buddy != buddy2);
+ MB_CHECK_ASSERT(max * 2 == max2);
+
+ count = 0;
+ for (i = 0; i < max; i++) {
+
+ if (mb_test_bit(i, buddy)) {
+ /* only single bit in buddy2 may be 1 */
+ if (!mb_test_bit(i << 1, buddy2)) {
+ MB_CHECK_ASSERT(
+ mb_test_bit((i<<1)+1, buddy2));
+ } else if (!mb_test_bit((i << 1) + 1, buddy2)) {
+ MB_CHECK_ASSERT(
+ mb_test_bit(i << 1, buddy2));
+ }
+ continue;
+ }
+
+ /* both bits in buddy2 must be 0 */
+ MB_CHECK_ASSERT(mb_test_bit(i << 1, buddy2));
+ MB_CHECK_ASSERT(mb_test_bit((i << 1) + 1, buddy2));
+
+ for (j = 0; j < (1 << order); j++) {
+ k = (i * (1 << order)) + j;
+ MB_CHECK_ASSERT(
+ !mb_test_bit(k, EXT4_MB_BITMAP(e4b)));
+ }
+ count++;
+ }
+ MB_CHECK_ASSERT(e4b->bd_info->bb_counters[order] == count);
+ order--;
+ }
+
+ fstart = -1;
+ buddy = mb_find_buddy(e4b, 0, &max);
+ for (i = 0; i < max; i++) {
+ if (!mb_test_bit(i, buddy)) {
+ MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
+ if (fstart == -1) {
+ fragments++;
+ fstart = i;
+ }
+ continue;
+ }
+ fstart = -1;
+ /* check used bits only */
+ for (j = 0; j < e4b->bd_blkbits + 1; j++) {
+ buddy2 = mb_find_buddy(e4b, j, &max2);
+ k = i >> j;
+ MB_CHECK_ASSERT(k < max2);
+ MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
+ }
+ }
+ MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
+ MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
+
+ grp = ext4_get_group_info(sb, e4b->bd_group);
+ buddy = mb_find_buddy(e4b, 0, &max);
+ list_for_each(cur, &grp->bb_prealloc_list) {
+ ext4_group_t groupnr;
+ struct ext4_prealloc_space *pa;
+ pa = list_entry(cur, struct ext4_prealloc_space, group_list);
+ ext4_get_group_no_and_offset(sb, pa->pstart, &groupnr, &k);
+ MB_CHECK_ASSERT(groupnr == e4b->bd_group);
+ for (i = 0; i < pa->len; i++)
+ MB_CHECK_ASSERT(mb_test_bit(k + i, buddy));
+ }
+ return 0;
+}
+#undef MB_CHECK_ASSERT
+#define mb_check_buddy(e4b) __mb_check_buddy(e4b, \
+ __FILE__, __FUNCTION__, __LINE__)
+#else
+#define mb_check_buddy(e4b)
+#endif
+
+/* find most significant bit */
+static int fmsb(unsigned short word)
+{
+ int order;
+
+ if (word > 255) {
+ order = 7;
+ word >>= 8;
+ } else {
+ order = -1;
+ }
+
+ do {
+ order++;
+ word >>= 1;
+ } while (word != 0);
+
+ return order;
+}
+
+/* FIXME!! need more doc */
+static void ext4_mb_mark_free_simple(struct super_block *sb,
+ void *buddy, unsigned first, int len,
+ struct ext4_group_info *grp)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ unsigned short min;
+ unsigned short max;
+ unsigned short chunk;
+ unsigned short border;
+
+ BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb));
+
+ border = 2 << sb->s_blocksize_bits;
+
+ while (len > 0) {
+ /* find how many blocks can be covered since this position */
+ max = ffs(first | border) - 1;
+
+ /* find how many blocks of power 2 we need to mark */
+ min = fmsb(len);
+
+ if (max < min)
+ min = max;
+ chunk = 1 << min;
+
+ /* mark multiblock chunks only */
+ grp->bb_counters[min]++;
+ if (min > 0)
+ mb_clear_bit(first >> min,
+ buddy + sbi->s_mb_offsets[min]);
+
+ len -= chunk;
+ first += chunk;
+ }
+}
+
+static void ext4_mb_generate_buddy(struct super_block *sb,
+ void *buddy, void *bitmap, ext4_group_t group)
+{
+ struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+ unsigned short max = EXT4_BLOCKS_PER_GROUP(sb);
+ unsigned short i = 0;
+ unsigned short first;
+ unsigned short len;
+ unsigned free = 0;
+ unsigned fragments = 0;
+ unsigned long long period = get_cycles();
+
+ /* initialize buddy from bitmap which is aggregation
+ * of on-disk bitmap and preallocations */
+ i = ext4_find_next_zero_bit(bitmap, max, 0);
+ grp->bb_first_free = i;
+ while (i < max) {
+ fragments++;
+ first = i;
+ i = ext4_find_next_bit(bitmap, max, i);
+ len = i - first;
+ free += len;
+ if (len > 1)
+ ext4_mb_mark_free_simple(sb, buddy, first, len, grp);
+ else
+ grp->bb_counters[0]++;
+ if (i < max)
+ i = ext4_find_next_zero_bit(bitmap, max, i);
+ }
+ grp->bb_fragments = fragments;
+
+ if (free != grp->bb_free) {
+ printk(KERN_DEBUG
+ "EXT4-fs: group %lu: %u blocks in bitmap, %u in gd\n",
+ group, free, grp->bb_free);
+ grp->bb_free = free;
+ }
+
+ clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
+
+ period = get_cycles() - period;
+ spin_lock(&EXT4_SB(sb)->s_bal_lock);
+ EXT4_SB(sb)->s_mb_buddies_generated++;
+ EXT4_SB(sb)->s_mb_generation_time += period;
+ spin_unlock(&EXT4_SB(sb)->s_bal_lock);
+}
+
+/* The buddy information is attached the buddy cache inode
+ * for convenience. The information regarding each group
+ * is loaded via ext4_mb_load_buddy. The information involve
+ * block bitmap and buddy information. The information are
+ * stored in the inode as
+ *
+ * { page }
+ * [ group 0 buddy][ group 0 bitmap] [group 1][ group 1]...
+ *
+ *
+ * one block each for bitmap and buddy information.
+ * So for each group we take up 2 blocks. A page can
+ * contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks.
+ * So it can have information regarding groups_per_page which
+ * is blocks_per_page/2
+ */
+
+static int ext4_mb_init_cache(struct page *page, char *incore)
+{
+ int blocksize;
+ int blocks_per_page;
+ int groups_per_page;
+ int err = 0;
+ int i;
+ ext4_group_t first_group;
+ int first_block;
+ struct super_block *sb;
+ struct buffer_head *bhs;
+ struct buffer_head **bh;
+ struct inode *inode;
+ char *data;
+ char *bitmap;
+
+ mb_debug("init page %lu\n", page->index);
+
+ inode = page->mapping->host;
+ sb = inode->i_sb;
+ blocksize = 1 << inode->i_blkbits;
+ blocks_per_page = PAGE_CACHE_SIZE / blocksize;
+
+ groups_per_page = blocks_per_page >> 1;
+ if (groups_per_page == 0)
+ groups_per_page = 1;
+
+ /* allocate buffer_heads to read bitmaps */
+ if (groups_per_page > 1) {
+ err = -ENOMEM;
+ i = sizeof(struct buffer_head *) * groups_per_page;
+ bh = kmalloc(i, GFP_NOFS);
+ if (bh == NULL)
+ goto out;
+ memset(bh, 0, i);
+ } else
+ bh = &bhs;
+
+ first_group = page->index * blocks_per_page / 2;
+
+ /* read all groups the page covers into the cache */
+ for (i = 0; i < groups_per_page; i++) {
+ struct ext4_group_desc *desc;
+
+ if (first_group + i >= EXT4_SB(sb)->s_groups_count)
+ break;
+
+ err = -EIO;
+ desc = ext4_get_group_desc(sb, first_group + i, NULL);
+ if (desc == NULL)
+ goto out;
+
+ err = -ENOMEM;
+ bh[i] = sb_getblk(sb, ext4_block_bitmap(sb, desc));
+ if (bh[i] == NULL)
+ goto out;
+
+ if (buffer_uptodate(bh[i]))
+ continue;
+
+ lock_buffer(bh[i]);
+ if (buffer_uptodate(bh[i])) {
+ unlock_buffer(bh[i]);
+ continue;
+ }
+
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+ ext4_init_block_bitmap(sb, bh[i],
+ first_group + i, desc);
+ set_buffer_uptodate(bh[i]);
+ unlock_buffer(bh[i]);
+ continue;
+ }
+ get_bh(bh[i]);
+ bh[i]->b_end_io = end_buffer_read_sync;
+ submit_bh(READ, bh[i]);
+ mb_debug("read bitmap for group %lu\n", first_group + i);
+ }
+
+ /* wait for I/O completion */
+ for (i = 0; i < groups_per_page && bh[i]; i++)
+ wait_on_buffer(bh[i]);
+
+ err = -EIO;
+ for (i = 0; i < groups_per_page && bh[i]; i++)
+ if (!buffer_uptodate(bh[i]))
+ goto out;
+
+ first_block = page->index * blocks_per_page;
+ for (i = 0; i < blocks_per_page; i++) {
+ int group;
+ struct ext4_group_info *grinfo;
+
+ group = (first_block + i) >> 1;
+ if (group >= EXT4_SB(sb)->s_groups_count)
+ break;
+
+ /*
+ * data carry information regarding this
+ * particular group in the format specified
+ * above
+ *
+ */
+ data = page_address(page) + (i * blocksize);
+ bitmap = bh[group - first_group]->b_data;
+
+ /*
+ * We place the buddy block and bitmap block
+ * close together
+ */
+ if ((first_block + i) & 1) {
+ /* this is block of buddy */
+ BUG_ON(incore == NULL);
+ mb_debug("put buddy for group %u in page %lu/%x\n",
+ group, page->index, i * blocksize);
+ memset(data, 0xff, blocksize);
+ grinfo = ext4_get_group_info(sb, group);
+ grinfo->bb_fragments = 0;
+ memset(grinfo->bb_counters, 0,
+ sizeof(unsigned short)*(sb->s_blocksize_bits+2));
+ /*
+ * incore got set to the group block bitmap below
+ */
+ ext4_mb_generate_buddy(sb, data, incore, group);
+ incore = NULL;
+ } else {
+ /* this is block of bitmap */
+ BUG_ON(incore != NULL);
+ mb_debug("put bitmap for group %u in page %lu/%x\n",
+ group, page->index, i * blocksize);
+
+ /* see comments in ext4_mb_put_pa() */
+ ext4_lock_group(sb, group);
+ memcpy(data, bitmap, blocksize);
+
+ /* mark all preallocated blks used in in-core bitmap */
+ ext4_mb_generate_from_pa(sb, data, group);
+ ext4_unlock_group(sb, group);
+
+ /* set incore so that the buddy information can be
+ * generated using this
+ */
+ incore = data;
+ }
+ }
+ SetPageUptodate(page);
+
+out:
+ if (bh) {
+ for (i = 0; i < groups_per_page && bh[i]; i++)
+ brelse(bh[i]);
+ if (bh != &bhs)
+ kfree(bh);
+ }
+ return err;
+}
+
+static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
+ struct ext4_buddy *e4b)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct inode *inode = sbi->s_buddy_cache;
+ int blocks_per_page;
+ int block;
+ int pnum;
+ int poff;
+ struct page *page;
+
+ mb_debug("load group %lu\n", group);
+
+ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+
+ e4b->bd_blkbits = sb->s_blocksize_bits;
+ e4b->bd_info = ext4_get_group_info(sb, group);
+ e4b->bd_sb = sb;
+ e4b->bd_group = group;
+ e4b->bd_buddy_page = NULL;
+ e4b->bd_bitmap_page = NULL;
+
+ /*
+ * the buddy cache inode stores the block bitmap
+ * and buddy information in consecutive blocks.
+ * So for each group we need two blocks.
+ */
+ block = group * 2;
+ pnum = block / blocks_per_page;
+ poff = block % blocks_per_page;
+
+ /* we could use find_or_create_page(), but it locks page
+ * what we'd like to avoid in fast path ... */
+ page = find_get_page(inode->i_mapping, pnum);
+ if (page == NULL || !PageUptodate(page)) {
+ if (page)
+ page_cache_release(page);
+ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ if (page) {
+ BUG_ON(page->mapping != inode->i_mapping);
+ if (!PageUptodate(page)) {
+ ext4_mb_init_cache(page, NULL);
+ mb_cmp_bitmaps(e4b, page_address(page) +
+ (poff * sb->s_blocksize));
+ }
+ unlock_page(page);
+ }
+ }
+ if (page == NULL || !PageUptodate(page))
+ goto err;
+ e4b->bd_bitmap_page = page;
+ e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
+ mark_page_accessed(page);
+
+ block++;
+ pnum = block / blocks_per_page;
+ poff = block % blocks_per_page;
+
+ page = find_get_page(inode->i_mapping, pnum);
+ if (page == NULL || !PageUptodate(page)) {
+ if (page)
+ page_cache_release(page);
+ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ if (page) {
+ BUG_ON(page->mapping != inode->i_mapping);
+ if (!PageUptodate(page))
+ ext4_mb_init_cache(page, e4b->bd_bitmap);
+
+ unlock_page(page);
+ }
+ }
+ if (page == NULL || !PageUptodate(page))
+ goto err;
+ e4b->bd_buddy_page = page;
+ e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
+ mark_page_accessed(page);
+
+ BUG_ON(e4b->bd_bitmap_page == NULL);
+ BUG_ON(e4b->bd_buddy_page == NULL);
+
+ return 0;
+
+err:
+ if (e4b->bd_bitmap_page)
+ page_cache_release(e4b->bd_bitmap_page);
+ if (e4b->bd_buddy_page)
+ page_cache_release(e4b->bd_buddy_page);
+ e4b->bd_buddy = NULL;
+ e4b->bd_bitmap = NULL;
+ return -EIO;
+}
+
+static void ext4_mb_release_desc(struct ext4_buddy *e4b)
+{
+ if (e4b->bd_bitmap_page)
+ page_cache_release(e4b->bd_bitmap_page);
+ if (e4b->bd_buddy_page)
+ page_cache_release(e4b->bd_buddy_page);
+}
+
+
+static int mb_find_order_for_block(struct ext4_buddy *e4b, int block)
+{
+ int order = 1;
+ void *bb;
+
+ BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
+ BUG_ON(block >= (1 << (e4b->bd_blkbits + 3)));
+
+ bb = EXT4_MB_BUDDY(e4b);
+ while (order <= e4b->bd_blkbits + 1) {
+ block = block >> 1;
+ if (!mb_test_bit(block, bb)) {
+ /* this block is part of buddy of order 'order' */
+ return order;
+ }
+ bb += 1 << (e4b->bd_blkbits - order);
+ order++;
+ }
+ return 0;
+}
+
+static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len)
+{
+ __u32 *addr;
+
+ len = cur + len;
+ while (cur < len) {
+ if ((cur & 31) == 0 && (len - cur) >= 32) {
+ /* fast path: clear whole word at once */
+ addr = bm + (cur >> 3);
+ *addr = 0;
+ cur += 32;
+ continue;
+ }
+ mb_clear_bit_atomic(lock, cur, bm);
+ cur++;
+ }
+}
+
+static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
+{
+ __u32 *addr;
+
+ len = cur + len;
+ while (cur < len) {
+ if ((cur & 31) == 0 && (len - cur) >= 32) {
+ /* fast path: clear whole word at once */
+ addr = bm + (cur >> 3);
+ *addr = 0xffffffff;
+ cur += 32;
+ continue;
+ }
+ mb_set_bit_atomic(lock, cur, bm);
+ cur++;
+ }
+}
+
+static int mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
+ int first, int count)
+{
+ int block = 0;
+ int max = 0;
+ int order;
+ void *buddy;
+ void *buddy2;
+ struct super_block *sb = e4b->bd_sb;
+
+ BUG_ON(first + count > (sb->s_blocksize << 3));
+ BUG_ON(!ext4_is_group_locked(sb, e4b->bd_group));
+ mb_check_buddy(e4b);
+ mb_free_blocks_double(inode, e4b, first, count);
+
+ e4b->bd_info->bb_free += count;
+ if (first < e4b->bd_info->bb_first_free)
+ e4b->bd_info->bb_first_free = first;
+
+ /* let's maintain fragments counter */
+ if (first != 0)
+ block = !mb_test_bit(first - 1, EXT4_MB_BITMAP(e4b));
+ if (first + count < EXT4_SB(sb)->s_mb_maxs[0])
+ max = !mb_test_bit(first + count, EXT4_MB_BITMAP(e4b));
+ if (block && max)
+ e4b->bd_info->bb_fragments--;
+ else if (!block && !max)
+ e4b->bd_info->bb_fragments++;
+
+ /* let's maintain buddy itself */
+ while (count-- > 0) {
+ block = first++;
+ order = 0;
+
+ if (!mb_test_bit(block, EXT4_MB_BITMAP(e4b))) {
+ ext4_fsblk_t blocknr;
+ blocknr = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb);
+ blocknr += block;
+ blocknr +=
+ le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
+
+ ext4_error(sb, __FUNCTION__, "double-free of inode"
+ " %lu's block %llu(bit %u in group %lu)\n",
+ inode ? inode->i_ino : 0, blocknr, block,
+ e4b->bd_group);
+ }
+ mb_clear_bit(block, EXT4_MB_BITMAP(e4b));
+ e4b->bd_info->bb_counters[order]++;
+
+ /* start of the buddy */
+ buddy = mb_find_buddy(e4b, order, &max);
+
+ do {
+ block &= ~1UL;
+ if (mb_test_bit(block, buddy) ||
+ mb_test_bit(block + 1, buddy))
+ break;
+
+ /* both the buddies are free, try to coalesce them */
+ buddy2 = mb_find_buddy(e4b, order + 1, &max);
+
+ if (!buddy2)
+ break;
+
+ if (order > 0) {
+ /* for special purposes, we don't set
+ * free bits in bitmap */
+ mb_set_bit(block, buddy);
+ mb_set_bit(block + 1, buddy);
+ }
+ e4b->bd_info->bb_counters[order]--;
+ e4b->bd_info->bb_counters[order]--;
+
+ block = block >> 1;
+ order++;
+ e4b->bd_info->bb_counters[order]++;
+
+ mb_clear_bit(block, buddy2);
+ buddy = buddy2;
+ } while (1);
+ }
+ mb_check_buddy(e4b);
+
+ return 0;
+}
+
+static int mb_find_extent(struct ext4_buddy *e4b, int order, int block,
+ int needed, struct ext4_free_extent *ex)
+{
+ int next = block;
+ int max;
+ int ord;
+ void *buddy;
+
+ BUG_ON(!ext4_is_group_locked(e4b->bd_sb, e4b->bd_group));
+ BUG_ON(ex == NULL);
+
+ buddy = mb_find_buddy(e4b, order, &max);
+ BUG_ON(buddy == NULL);
+ BUG_ON(block >= max);
+ if (mb_test_bit(block, buddy)) {
+ ex->fe_len = 0;
+ ex->fe_start = 0;
+ ex->fe_group = 0;
+ return 0;
+ }
+
+ /* FIXME dorp order completely ? */
+ if (likely(order == 0)) {
+ /* find actual order */
+ order = mb_find_order_for_block(e4b, block);
+ block = block >> order;
+ }
+
+ ex->fe_len = 1 << order;
+ ex->fe_start = block << order;
+ ex->fe_group = e4b->bd_group;
+
+ /* calc difference from given start */
+ next = next - ex->fe_start;
+ ex->fe_len -= next;
+ ex->fe_start += next;
+
+ while (needed > ex->fe_len &&
+ (buddy = mb_find_buddy(e4b, order, &max))) {
+
+ if (block + 1 >= max)
+ break;
+
+ next = (block + 1) * (1 << order);
+ if (mb_test_bit(next, EXT4_MB_BITMAP(e4b)))
+ break;
+
+ ord = mb_find_order_for_block(e4b, next);
+
+ order = ord;
+ block = next >> order;
+ ex->fe_len += 1 << order;
+ }
+
+ BUG_ON(ex->fe_start + ex->fe_len > (1 << (e4b->bd_blkbits + 3)));
+ return ex->fe_len;
+}
+
+static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
+{
+ int ord;
+ int mlen = 0;
+ int max = 0;
+ int cur;
+ int start = ex->fe_start;
+ int len = ex->fe_len;
+ unsigned ret = 0;
+ int len0 = len;
+ void *buddy;
+
+ BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));
+ BUG_ON(e4b->bd_group != ex->fe_group);
+ BUG_ON(!ext4_is_group_locked(e4b->bd_sb, e4b->bd_group));
+ mb_check_buddy(e4b);
+ mb_mark_used_double(e4b, start, len);
+
+ e4b->bd_info->bb_free -= len;
+ if (e4b->bd_info->bb_first_free == start)
+ e4b->bd_info->bb_first_free += len;
+
+ /* let's maintain fragments counter */
+ if (start != 0)
+ mlen = !mb_test_bit(start - 1, EXT4_MB_BITMAP(e4b));
+ if (start + len < EXT4_SB(e4b->bd_sb)->s_mb_maxs[0])
+ max = !mb_test_bit(start + len, EXT4_MB_BITMAP(e4b));
+ if (mlen && max)
+ e4b->bd_info->bb_fragments++;
+ else if (!mlen && !max)
+ e4b->bd_info->bb_fragments--;
+
+ /* let's maintain buddy itself */
+ while (len) {
+ ord = mb_find_order_for_block(e4b, start);
+
+ if (((start >> ord) << ord) == start && len >= (1 << ord)) {
+ /* the whole chunk may be allocated at once! */
+ mlen = 1 << ord;
+ buddy = mb_find_buddy(e4b, ord, &max);
+ BUG_ON((start >> ord) >= max);
+ mb_set_bit(start >> ord, buddy);
+ e4b->bd_info->bb_counters[ord]--;
+ start += mlen;
+ len -= mlen;
+ BUG_ON(len < 0);
+ continue;
+ }
+
+ /* store for history */
+ if (ret == 0)
+ ret = len | (ord << 16);
+
+ /* we have to split large buddy */
+ BUG_ON(ord <= 0);
+ buddy = mb_find_buddy(e4b, ord, &max);
+ mb_set_bit(start >> ord, buddy);
+ e4b->bd_info->bb_counters[ord]--;
+
+ ord--;
+ cur = (start >> ord) & ~1U;
+ buddy = mb_find_buddy(e4b, ord, &max);
+ mb_clear_bit(cur, buddy);
+ mb_clear_bit(cur + 1, buddy);
+ e4b->bd_info->bb_counters[ord]++;
+ e4b->bd_info->bb_counters[ord]++;
+ }
+
+ mb_set_bits(sb_bgl_lock(EXT4_SB(e4b->bd_sb), ex->fe_group),
+ EXT4_MB_BITMAP(e4b), ex->fe_start, len0);
+ mb_check_buddy(e4b);
+
+ return ret;
+}
+
+/*
+ * Must be called under group lock!
+ */
+static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ int ret;
+
+ BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
+ BUG_ON(ac->ac_status == AC_STATUS_FOUND);
+
+ ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len);
+ ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical;
+ ret = mb_mark_used(e4b, &ac->ac_b_ex);
+
+ /* preallocation can change ac_b_ex, thus we store actually
+ * allocated blocks for history */
+ ac->ac_f_ex = ac->ac_b_ex;
+
+ ac->ac_status = AC_STATUS_FOUND;
+ ac->ac_tail = ret & 0xffff;
+ ac->ac_buddy = ret >> 16;
+
+ /* XXXXXXX: SUCH A HORRIBLE **CK */
+ /*FIXME!! Why ? */
+ ac->ac_bitmap_page = e4b->bd_bitmap_page;
+ get_page(ac->ac_bitmap_page);
+ ac->ac_buddy_page = e4b->bd_buddy_page;
+ get_page(ac->ac_buddy_page);
+
+ /* store last allocated for subsequent stream allocation */
+ if ((ac->ac_flags & EXT4_MB_HINT_DATA)) {
+ spin_lock(&sbi->s_md_lock);
+ sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
+ sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
+ spin_unlock(&sbi->s_md_lock);
+ }
+}
+
+/*
+ * regular allocator, for general purposes allocation
+ */
+
+static void ext4_mb_check_limits(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b,
+ int finish_group)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_free_extent *bex = &ac->ac_b_ex;
+ struct ext4_free_extent *gex = &ac->ac_g_ex;
+ struct ext4_free_extent ex;
+ int max;
+
+ /*
+ * We don't want to scan for a whole year
+ */
+ if (ac->ac_found > sbi->s_mb_max_to_scan &&
+ !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ ac->ac_status = AC_STATUS_BREAK;
+ return;
+ }
+
+ /*
+ * Haven't found good chunk so far, let's continue
+ */
+ if (bex->fe_len < gex->fe_len)
+ return;
+
+ if ((finish_group || ac->ac_found > sbi->s_mb_min_to_scan)
+ && bex->fe_group == e4b->bd_group) {
+ /* recheck chunk's availability - we don't know
+ * when it was found (within this lock-unlock
+ * period or not) */
+ max = mb_find_extent(e4b, 0, bex->fe_start, gex->fe_len, &ex);
+ if (max >= gex->fe_len) {
+ ext4_mb_use_best_found(ac, e4b);
+ return;
+ }
+ }
+}
+
+/*
+ * The routine checks whether found extent is good enough. If it is,
+ * then the extent gets marked used and flag is set to the context
+ * to stop scanning. Otherwise, the extent is compared with the
+ * previous found extent and if new one is better, then it's stored
+ * in the context. Later, the best found extent will be used, if
+ * mballoc can't find good enough extent.
+ *
+ * FIXME: real allocation policy is to be designed yet!
+ */
+static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
+ struct ext4_free_extent *ex,
+ struct ext4_buddy *e4b)
+{
+ struct ext4_free_extent *bex = &ac->ac_b_ex;
+ struct ext4_free_extent *gex = &ac->ac_g_ex;
+
+ BUG_ON(ex->fe_len <= 0);
+ BUG_ON(ex->fe_len >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ex->fe_start >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
+
+ ac->ac_found++;
+
+ /*
+ * The special case - take what you catch first
+ */
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ *bex = *ex;
+ ext4_mb_use_best_found(ac, e4b);
+ return;
+ }
+
+ /*
+ * Let's check whether the chuck is good enough
+ */
+ if (ex->fe_len == gex->fe_len) {
+ *bex = *ex;
+ ext4_mb_use_best_found(ac, e4b);
+ return;
+ }
+
+ /*
+ * If this is first found extent, just store it in the context
+ */
+ if (bex->fe_len == 0) {
+ *bex = *ex;
+ return;
+ }
+
+ /*
+ * If new found extent is better, store it in the context
+ */
+ if (bex->fe_len < gex->fe_len) {
+ /* if the request isn't satisfied, any found extent
+ * larger than previous best one is better */
+ if (ex->fe_len > bex->fe_len)
+ *bex = *ex;
+ } else if (ex->fe_len > gex->fe_len) {
+ /* if the request is satisfied, then we try to find
+ * an extent that still satisfy the request, but is
+ * smaller than previous one */
+ if (ex->fe_len < bex->fe_len)
+ *bex = *ex;
+ }
+
+ ext4_mb_check_limits(ac, e4b, 0);
+}
+
+static int ext4_mb_try_best_found(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b)
+{
+ struct ext4_free_extent ex = ac->ac_b_ex;
+ ext4_group_t group = ex.fe_group;
+ int max;
+ int err;
+
+ BUG_ON(ex.fe_len <= 0);
+ err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+ if (err)
+ return err;
+
+ ext4_lock_group(ac->ac_sb, group);
+ max = mb_find_extent(e4b, 0, ex.fe_start, ex.fe_len, &ex);
+
+ if (max > 0) {
+ ac->ac_b_ex = ex;
+ ext4_mb_use_best_found(ac, e4b);
+ }
+
+ ext4_unlock_group(ac->ac_sb, group);
+ ext4_mb_release_desc(e4b);
+
+ return 0;
+}
+
+static int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b)
+{
+ ext4_group_t group = ac->ac_g_ex.fe_group;
+ int max;
+ int err;
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_super_block *es = sbi->s_es;
+ struct ext4_free_extent ex;
+
+ if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL))
+ return 0;
+
+ err = ext4_mb_load_buddy(ac->ac_sb, group, e4b);
+ if (err)
+ return err;
+
+ ext4_lock_group(ac->ac_sb, group);
+ max = mb_find_extent(e4b, 0, ac->ac_g_ex.fe_start,
+ ac->ac_g_ex.fe_len, &ex);
+
+ if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) {
+ ext4_fsblk_t start;
+
+ start = (e4b->bd_group * EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) +
+ ex.fe_start + le32_to_cpu(es->s_first_data_block);
+ /* use do_div to get remainder (would be 64-bit modulo) */
+ if (do_div(start, sbi->s_stripe) == 0) {
+ ac->ac_found++;
+ ac->ac_b_ex = ex;
+ ext4_mb_use_best_found(ac, e4b);
+ }
+ } else if (max >= ac->ac_g_ex.fe_len) {
+ BUG_ON(ex.fe_len <= 0);
+ BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group);
+ BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start);
+ ac->ac_found++;
+ ac->ac_b_ex = ex;
+ ext4_mb_use_best_found(ac, e4b);
+ } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) {
+ /* Sometimes, caller may want to merge even small
+ * number of blocks to an existing extent */
+ BUG_ON(ex.fe_len <= 0);
+ BUG_ON(ex.fe_group != ac->ac_g_ex.fe_group);
+ BUG_ON(ex.fe_start != ac->ac_g_ex.fe_start);
+ ac->ac_found++;
+ ac->ac_b_ex = ex;
+ ext4_mb_use_best_found(ac, e4b);
+ }
+ ext4_unlock_group(ac->ac_sb, group);
+ ext4_mb_release_desc(e4b);
+
+ return 0;
+}
+
+/*
+ * The routine scans buddy structures (not bitmap!) from given order
+ * to max order and tries to find big enough chunk to satisfy the req
+ */
+static void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b)
+{
+ struct super_block *sb = ac->ac_sb;
+ struct ext4_group_info *grp = e4b->bd_info;
+ void *buddy;
+ int i;
+ int k;
+ int max;
+
+ BUG_ON(ac->ac_2order <= 0);
+ for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
+ if (grp->bb_counters[i] == 0)
+ continue;
+
+ buddy = mb_find_buddy(e4b, i, &max);
+ BUG_ON(buddy == NULL);
+
+ k = ext4_find_next_zero_bit(buddy, max, 0);
+ BUG_ON(k >= max);
+
+ ac->ac_found++;
+
+ ac->ac_b_ex.fe_len = 1 << i;
+ ac->ac_b_ex.fe_start = k << i;
+ ac->ac_b_ex.fe_group = e4b->bd_group;
+
+ ext4_mb_use_best_found(ac, e4b);
+
+ BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len);
+
+ if (EXT4_SB(sb)->s_mb_stats)
+ atomic_inc(&EXT4_SB(sb)->s_bal_2orders);
+
+ break;
+ }
+}
+
+/*
+ * The routine scans the group and measures all found extents.
+ * In order to optimize scanning, caller must pass number of
+ * free blocks in the group, so the routine can know upper limit.
+ */
+static void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b)
+{
+ struct super_block *sb = ac->ac_sb;
+ void *bitmap = EXT4_MB_BITMAP(e4b);
+ struct ext4_free_extent ex;
+ int i;
+ int free;
+
+ free = e4b->bd_info->bb_free;
+ BUG_ON(free <= 0);
+
+ i = e4b->bd_info->bb_first_free;
+
+ while (free && ac->ac_status == AC_STATUS_CONTINUE) {
+ i = ext4_find_next_zero_bit(bitmap,
+ EXT4_BLOCKS_PER_GROUP(sb), i);
+ if (i >= EXT4_BLOCKS_PER_GROUP(sb)) {
+ BUG_ON(free != 0);
+ break;
+ }
+
+ mb_find_extent(e4b, 0, i, ac->ac_g_ex.fe_len, &ex);
+ BUG_ON(ex.fe_len <= 0);
+ BUG_ON(free < ex.fe_len);
+
+ ext4_mb_measure_extent(ac, &ex, e4b);
+
+ i += ex.fe_len;
+ free -= ex.fe_len;
+ }
+
+ ext4_mb_check_limits(ac, e4b, 1);
+}
+
+/*
+ * This is a special case for storages like raid5
+ * we try to find stripe-aligned chunks for stripe-size requests
+ * XXX should do so at least for multiples of stripe size as well
+ */
+static void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
+ struct ext4_buddy *e4b)
+{
+ struct super_block *sb = ac->ac_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ void *bitmap = EXT4_MB_BITMAP(e4b);
+ struct ext4_free_extent ex;
+ ext4_fsblk_t first_group_block;
+ ext4_fsblk_t a;
+ ext4_grpblk_t i;
+ int max;
+
+ BUG_ON(sbi->s_stripe == 0);
+
+ /* find first stripe-aligned block in group */
+ first_group_block = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb)
+ + le32_to_cpu(sbi->s_es->s_first_data_block);
+ a = first_group_block + sbi->s_stripe - 1;
+ do_div(a, sbi->s_stripe);
+ i = (a * sbi->s_stripe) - first_group_block;
+
+ while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
+ if (!mb_test_bit(i, bitmap)) {
+ max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
+ if (max >= sbi->s_stripe) {
+ ac->ac_found++;
+ ac->ac_b_ex = ex;
+ ext4_mb_use_best_found(ac, e4b);
+ break;
+ }
+ }
+ i += sbi->s_stripe;
+ }
+}
+
+static int ext4_mb_good_group(struct ext4_allocation_context *ac,
+ ext4_group_t group, int cr)
+{
+ unsigned free, fragments;
+ unsigned i, bits;
+ struct ext4_group_desc *desc;
+ struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+
+ BUG_ON(cr < 0 || cr >= 4);
+ BUG_ON(EXT4_MB_GRP_NEED_INIT(grp));
+
+ free = grp->bb_free;
+ fragments = grp->bb_fragments;
+ if (free == 0)
+ return 0;
+ if (fragments == 0)
+ return 0;
+
+ switch (cr) {
+ case 0:
+ BUG_ON(ac->ac_2order == 0);
+ /* If this group is uninitialized, skip it initially */
+ desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
+ return 0;
+
+ bits = ac->ac_sb->s_blocksize_bits + 1;
+ for (i = ac->ac_2order; i <= bits; i++)
+ if (grp->bb_counters[i] > 0)
+ return 1;
+ break;
+ case 1:
+ if ((free / fragments) >= ac->ac_g_ex.fe_len)
+ return 1;
+ break;
+ case 2:
+ if (free >= ac->ac_g_ex.fe_len)
+ return 1;
+ break;
+ case 3:
+ return 1;
+ default:
+ BUG();
+ }
+
+ return 0;
+}
+
+static int ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
+{
+ ext4_group_t group;
+ ext4_group_t i;
+ int cr;
+ int err = 0;
+ int bsbits;
+ struct ext4_sb_info *sbi;
+ struct super_block *sb;
+ struct ext4_buddy e4b;
+ loff_t size, isize;
+
+ sb = ac->ac_sb;
+ sbi = EXT4_SB(sb);
+ BUG_ON(ac->ac_status == AC_STATUS_FOUND);
+
+ /* first, try the goal */
+ err = ext4_mb_find_by_goal(ac, &e4b);
+ if (err || ac->ac_status == AC_STATUS_FOUND)
+ goto out;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+ goto out;
+
+ /*
+ * ac->ac2_order is set only if the fe_len is a power of 2
+ * if ac2_order is set we also set criteria to 0 so that we
+ * try exact allocation using buddy.
+ */
+ i = fls(ac->ac_g_ex.fe_len);
+ ac->ac_2order = 0;
+ /*
+ * We search using buddy data only if the order of the request
+ * is greater than equal to the sbi_s_mb_order2_reqs
+ * You can tune it via /proc/fs/ext4/<partition>/order2_req
+ */
+ if (i >= sbi->s_mb_order2_reqs) {
+ /*
+ * This should tell if fe_len is exactly power of 2
+ */
+ if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
+ ac->ac_2order = i - 1;
+ }
+
+ bsbits = ac->ac_sb->s_blocksize_bits;
+ /* if stream allocation is enabled, use global goal */
+ size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
+ isize = i_size_read(ac->ac_inode) >> bsbits;
+ if (size < isize)
+ size = isize;
+
+ if (size < sbi->s_mb_stream_request &&
+ (ac->ac_flags & EXT4_MB_HINT_DATA)) {
+ /* TBD: may be hot point */
+ spin_lock(&sbi->s_md_lock);
+ ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
+ ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
+ spin_unlock(&sbi->s_md_lock);
+ }
+
+ /* searching for the right group start from the goal value specified */
+ group = ac->ac_g_ex.fe_group;
+
+ /* Let's just scan groups to find more-less suitable blocks */
+ cr = ac->ac_2order ? 0 : 1;
+ /*
+ * cr == 0 try to get exact allocation,
+ * cr == 3 try to get anything
+ */
+repeat:
+ for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
+ ac->ac_criteria = cr;
+ for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
+ struct ext4_group_info *grp;
+ struct ext4_group_desc *desc;
+
+ if (group == EXT4_SB(sb)->s_groups_count)
+ group = 0;
+
+ /* quick check to skip empty groups */
+ grp = ext4_get_group_info(ac->ac_sb, group);
+ if (grp->bb_free == 0)
+ continue;
+
+ /*
+ * if the group is already init we check whether it is
+ * a good group and if not we don't load the buddy
+ */
+ if (EXT4_MB_GRP_NEED_INIT(grp)) {
+ /*
+ * we need full data about the group
+ * to make a good selection
+ */
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ if (err)
+ goto out;
+ ext4_mb_release_desc(&e4b);
+ }
+
+ /*
+ * If the particular group doesn't satisfy our
+ * criteria we continue with the next group
+ */
+ if (!ext4_mb_good_group(ac, group, cr))
+ continue;
+
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ if (err)
+ goto out;
+
+ ext4_lock_group(sb, group);
+ if (!ext4_mb_good_group(ac, group, cr)) {
+ /* someone did allocation from this group */
+ ext4_unlock_group(sb, group);
+ ext4_mb_release_desc(&e4b);
+ continue;
+ }
+
+ ac->ac_groups_scanned++;
+ desc = ext4_get_group_desc(sb, group, NULL);
+ if (cr == 0 || (desc->bg_flags &
+ cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
+ ac->ac_2order != 0))
+ ext4_mb_simple_scan_group(ac, &e4b);
+ else if (cr == 1 &&
+ ac->ac_g_ex.fe_len == sbi->s_stripe)
+ ext4_mb_scan_aligned(ac, &e4b);
+ else
+ ext4_mb_complex_scan_group(ac, &e4b);
+
+ ext4_unlock_group(sb, group);
+ ext4_mb_release_desc(&e4b);
+
+ if (ac->ac_status != AC_STATUS_CONTINUE)
+ break;
+ }
+ }
+
+ if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
+ !(ac->ac_flags & EXT4_MB_HINT_FIRST)) {
+ /*
+ * We've been searching too long. Let's try to allocate
+ * the best chunk we've found so far
+ */
+
+ ext4_mb_try_best_found(ac, &e4b);
+ if (ac->ac_status != AC_STATUS_FOUND) {
+ /*
+ * Someone more lucky has already allocated it.
+ * The only thing we can do is just take first
+ * found block(s)
+ printk(KERN_DEBUG "EXT4-fs: someone won our chunk\n");
+ */
+ ac->ac_b_ex.fe_group = 0;
+ ac->ac_b_ex.fe_start = 0;
+ ac->ac_b_ex.fe_len = 0;
+ ac->ac_status = AC_STATUS_CONTINUE;
+ ac->ac_flags |= EXT4_MB_HINT_FIRST;
+ cr = 3;
+ atomic_inc(&sbi->s_mb_lost_chunks);
+ goto repeat;
+ }
+ }
+out:
+ return err;
+}
+
+#ifdef EXT4_MB_HISTORY
+struct ext4_mb_proc_session {
+ struct ext4_mb_history *history;
+ struct super_block *sb;
+ int start;
+ int max;
+};
+
+static void *ext4_mb_history_skip_empty(struct ext4_mb_proc_session *s,
+ struct ext4_mb_history *hs,
+ int first)
+{
+ if (hs == s->history + s->max)
+ hs = s->history;
+ if (!first && hs == s->history + s->start)
+ return NULL;
+ while (hs->orig.fe_len == 0) {
+ hs++;
+ if (hs == s->history + s->max)
+ hs = s->history;
+ if (hs == s->history + s->start)
+ return NULL;
+ }
+ return hs;
+}
+
+static void *ext4_mb_seq_history_start(struct seq_file *seq, loff_t *pos)
+{
+ struct ext4_mb_proc_session *s = seq->private;
+ struct ext4_mb_history *hs;
+ int l = *pos;
+
+ if (l == 0)
+ return SEQ_START_TOKEN;
+ hs = ext4_mb_history_skip_empty(s, s->history + s->start, 1);
+ if (!hs)
+ return NULL;
+ while (--l && (hs = ext4_mb_history_skip_empty(s, ++hs, 0)) != NULL);
+ return hs;
+}
+
+static void *ext4_mb_seq_history_next(struct seq_file *seq, void *v,
+ loff_t *pos)
+{
+ struct ext4_mb_proc_session *s = seq->private;
+ struct ext4_mb_history *hs = v;
+
+ ++*pos;
+ if (v == SEQ_START_TOKEN)
+ return ext4_mb_history_skip_empty(s, s->history + s->start, 1);
+ else
+ return ext4_mb_history_skip_empty(s, ++hs, 0);
+}
+
+static int ext4_mb_seq_history_show(struct seq_file *seq, void *v)
+{
+ char buf[25], buf2[25], buf3[25], *fmt;
+ struct ext4_mb_history *hs = v;
+
+ if (v == SEQ_START_TOKEN) {
+ seq_printf(seq, "%-5s %-8s %-23s %-23s %-23s %-5s "
+ "%-5s %-2s %-5s %-5s %-5s %-6s\n",
+ "pid", "inode", "original", "goal", "result", "found",
+ "grps", "cr", "flags", "merge", "tail", "broken");
+ return 0;
+ }
+
+ if (hs->op == EXT4_MB_HISTORY_ALLOC) {
+ fmt = "%-5u %-8u %-23s %-23s %-23s %-5u %-5u %-2u "
+ "%-5u %-5s %-5u %-6u\n";
+ sprintf(buf2, "%lu/%d/%u@%u", hs->result.fe_group,
+ hs->result.fe_start, hs->result.fe_len,
+ hs->result.fe_logical);
+ sprintf(buf, "%lu/%d/%u@%u", hs->orig.fe_group,
+ hs->orig.fe_start, hs->orig.fe_len,
+ hs->orig.fe_logical);
+ sprintf(buf3, "%lu/%d/%u@%u", hs->goal.fe_group,
+ hs->goal.fe_start, hs->goal.fe_len,
+ hs->goal.fe_logical);
+ seq_printf(seq, fmt, hs->pid, hs->ino, buf, buf3, buf2,
+ hs->found, hs->groups, hs->cr, hs->flags,
+ hs->merged ? "M" : "", hs->tail,
+ hs->buddy ? 1 << hs->buddy : 0);
+ } else if (hs->op == EXT4_MB_HISTORY_PREALLOC) {
+ fmt = "%-5u %-8u %-23s %-23s %-23s\n";
+ sprintf(buf2, "%lu/%d/%u@%u", hs->result.fe_group,
+ hs->result.fe_start, hs->result.fe_len,
+ hs->result.fe_logical);
+ sprintf(buf, "%lu/%d/%u@%u", hs->orig.fe_group,
+ hs->orig.fe_start, hs->orig.fe_len,
+ hs->orig.fe_logical);
+ seq_printf(seq, fmt, hs->pid, hs->ino, buf, "", buf2);
+ } else if (hs->op == EXT4_MB_HISTORY_DISCARD) {
+ sprintf(buf2, "%lu/%d/%u", hs->result.fe_group,
+ hs->result.fe_start, hs->result.fe_len);
+ seq_printf(seq, "%-5u %-8u %-23s discard\n",
+ hs->pid, hs->ino, buf2);
+ } else if (hs->op == EXT4_MB_HISTORY_FREE) {
+ sprintf(buf2, "%lu/%d/%u", hs->result.fe_group,
+ hs->result.fe_start, hs->result.fe_len);
+ seq_printf(seq, "%-5u %-8u %-23s free\n",
+ hs->pid, hs->ino, buf2);
+ }
+ return 0;
+}
+
+static void ext4_mb_seq_history_stop(struct seq_file *seq, void *v)
+{
+}
+
+static struct seq_operations ext4_mb_seq_history_ops = {
+ .start = ext4_mb_seq_history_start,
+ .next = ext4_mb_seq_history_next,
+ .stop = ext4_mb_seq_history_stop,
+ .show = ext4_mb_seq_history_show,
+};
+
+static int ext4_mb_seq_history_open(struct inode *inode, struct file *file)
+{
+ struct super_block *sb = PDE(inode)->data;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_mb_proc_session *s;
+ int rc;
+ int size;
+
+ s = kmalloc(sizeof(*s), GFP_KERNEL);
+ if (s == NULL)
+ return -ENOMEM;
+ s->sb = sb;
+ size = sizeof(struct ext4_mb_history) * sbi->s_mb_history_max;
+ s->history = kmalloc(size, GFP_KERNEL);
+ if (s->history == NULL) {
+ kfree(s);
+ return -ENOMEM;
+ }
+
+ spin_lock(&sbi->s_mb_history_lock);
+ memcpy(s->history, sbi->s_mb_history, size);
+ s->max = sbi->s_mb_history_max;
+ s->start = sbi->s_mb_history_cur % s->max;
+ spin_unlock(&sbi->s_mb_history_lock);
+
+ rc = seq_open(file, &ext4_mb_seq_history_ops);
+ if (rc == 0) {
+ struct seq_file *m = (struct seq_file *)file->private_data;
+ m->private = s;
+ } else {
+ kfree(s->history);
+ kfree(s);
+ }
+ return rc;
+
+}
+
+static int ext4_mb_seq_history_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq = (struct seq_file *)file->private_data;
+ struct ext4_mb_proc_session *s = seq->private;
+ kfree(s->history);
+ kfree(s);
+ return seq_release(inode, file);
+}
+
+static ssize_t ext4_mb_seq_history_write(struct file *file,
+ const char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *seq = (struct seq_file *)file->private_data;
+ struct ext4_mb_proc_session *s = seq->private;
+ struct super_block *sb = s->sb;
+ char str[32];
+ int value;
+
+ if (count >= sizeof(str)) {
+ printk(KERN_ERR "EXT4-fs: %s string too long, max %u bytes\n",
+ "mb_history", (int)sizeof(str));
+ return -EOVERFLOW;
+ }
+
+ if (copy_from_user(str, buffer, count))
+ return -EFAULT;
+
+ value = simple_strtol(str, NULL, 0);
+ if (value < 0)
+ return -ERANGE;
+ EXT4_SB(sb)->s_mb_history_filter = value;
+
+ return count;
+}
+
+static struct file_operations ext4_mb_seq_history_fops = {
+ .owner = THIS_MODULE,
+ .open = ext4_mb_seq_history_open,
+ .read = seq_read,
+ .write = ext4_mb_seq_history_write,
+ .llseek = seq_lseek,
+ .release = ext4_mb_seq_history_release,
+};
+
+static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
+{
+ struct super_block *sb = seq->private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ ext4_group_t group;
+
+ if (*pos < 0 || *pos >= sbi->s_groups_count)
+ return NULL;
+
+ group = *pos + 1;
+ return (void *) group;
+}
+
+static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct super_block *sb = seq->private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ ext4_group_t group;
+
+ ++*pos;
+ if (*pos < 0 || *pos >= sbi->s_groups_count)
+ return NULL;
+ group = *pos + 1;
+ return (void *) group;;
+}
+
+static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v)
+{
+ struct super_block *sb = seq->private;
+ long group = (long) v;
+ int i;
+ int err;
+ struct ext4_buddy e4b;
+ struct sg {
+ struct ext4_group_info info;
+ unsigned short counters[16];
+ } sg;
+
+ group--;
+ if (group == 0)
+ seq_printf(seq, "#%-5s: %-5s %-5s %-5s "
+ "[ %-5s %-5s %-5s %-5s %-5s %-5s %-5s "
+ "%-5s %-5s %-5s %-5s %-5s %-5s %-5s ]\n",
+ "group", "free", "frags", "first",
+ "2^0", "2^1", "2^2", "2^3", "2^4", "2^5", "2^6",
+ "2^7", "2^8", "2^9", "2^10", "2^11", "2^12", "2^13");
+
+ i = (sb->s_blocksize_bits + 2) * sizeof(sg.info.bb_counters[0]) +
+ sizeof(struct ext4_group_info);
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ if (err) {
+ seq_printf(seq, "#%-5lu: I/O error\n", group);
+ return 0;
+ }
+ ext4_lock_group(sb, group);
+ memcpy(&sg, ext4_get_group_info(sb, group), i);
+ ext4_unlock_group(sb, group);
+ ext4_mb_release_desc(&e4b);
+
+ seq_printf(seq, "#%-5lu: %-5u %-5u %-5u [", group, sg.info.bb_free,
+ sg.info.bb_fragments, sg.info.bb_first_free);
+ for (i = 0; i <= 13; i++)
+ seq_printf(seq, " %-5u", i <= sb->s_blocksize_bits + 1 ?
+ sg.info.bb_counters[i] : 0);
+ seq_printf(seq, " ]\n");
+
+ return 0;
+}
+
+static void ext4_mb_seq_groups_stop(struct seq_file *seq, void *v)
+{
+}
+
+static struct seq_operations ext4_mb_seq_groups_ops = {
+ .start = ext4_mb_seq_groups_start,
+ .next = ext4_mb_seq_groups_next,
+ .stop = ext4_mb_seq_groups_stop,
+ .show = ext4_mb_seq_groups_show,
+};
+
+static int ext4_mb_seq_groups_open(struct inode *inode, struct file *file)
+{
+ struct super_block *sb = PDE(inode)->data;
+ int rc;
+
+ rc = seq_open(file, &ext4_mb_seq_groups_ops);
+ if (rc == 0) {
+ struct seq_file *m = (struct seq_file *)file->private_data;
+ m->private = sb;
+ }
+ return rc;
+
+}
+
+static struct file_operations ext4_mb_seq_groups_fops = {
+ .owner = THIS_MODULE,
+ .open = ext4_mb_seq_groups_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static void ext4_mb_history_release(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ remove_proc_entry("mb_groups", sbi->s_mb_proc);
+ remove_proc_entry("mb_history", sbi->s_mb_proc);
+
+ kfree(sbi->s_mb_history);
+}
+
+static void ext4_mb_history_init(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int i;
+
+ if (sbi->s_mb_proc != NULL) {
+ struct proc_dir_entry *p;
+ p = create_proc_entry("mb_history", S_IRUGO, sbi->s_mb_proc);
+ if (p) {
+ p->proc_fops = &ext4_mb_seq_history_fops;
+ p->data = sb;
+ }
+ p = create_proc_entry("mb_groups", S_IRUGO, sbi->s_mb_proc);
+ if (p) {
+ p->proc_fops = &ext4_mb_seq_groups_fops;
+ p->data = sb;
+ }
+ }
+
+ sbi->s_mb_history_max = 1000;
+ sbi->s_mb_history_cur = 0;
+ spin_lock_init(&sbi->s_mb_history_lock);
+ i = sbi->s_mb_history_max * sizeof(struct ext4_mb_history);
+ sbi->s_mb_history = kmalloc(i, GFP_KERNEL);
+ if (likely(sbi->s_mb_history != NULL))
+ memset(sbi->s_mb_history, 0, i);
+ /* if we can't allocate history, then we simple won't use it */
+}
+
+static void ext4_mb_store_history(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_mb_history h;
+
+ if (unlikely(sbi->s_mb_history == NULL))
+ return;
+
+ if (!(ac->ac_op & sbi->s_mb_history_filter))
+ return;
+
+ h.op = ac->ac_op;
+ h.pid = current->pid;
+ h.ino = ac->ac_inode ? ac->ac_inode->i_ino : 0;
+ h.orig = ac->ac_o_ex;
+ h.result = ac->ac_b_ex;
+ h.flags = ac->ac_flags;
+ h.found = ac->ac_found;
+ h.groups = ac->ac_groups_scanned;
+ h.cr = ac->ac_criteria;
+ h.tail = ac->ac_tail;
+ h.buddy = ac->ac_buddy;
+ h.merged = 0;
+ if (ac->ac_op == EXT4_MB_HISTORY_ALLOC) {
+ if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
+ ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
+ h.merged = 1;
+ h.goal = ac->ac_g_ex;
+ h.result = ac->ac_f_ex;
+ }
+
+ spin_lock(&sbi->s_mb_history_lock);
+ memcpy(sbi->s_mb_history + sbi->s_mb_history_cur, &h, sizeof(h));
+ if (++sbi->s_mb_history_cur >= sbi->s_mb_history_max)
+ sbi->s_mb_history_cur = 0;
+ spin_unlock(&sbi->s_mb_history_lock);
+}
+
+#else
+#define ext4_mb_history_release(sb)
+#define ext4_mb_history_init(sb)
+#endif
+
+static int ext4_mb_init_backend(struct super_block *sb)
+{
+ ext4_group_t i;
+ int j, len, metalen;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int num_meta_group_infos =
+ (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) >>
+ EXT4_DESC_PER_BLOCK_BITS(sb);
+ struct ext4_group_info **meta_group_info;
+
+ /* An 8TB filesystem with 64-bit pointers requires a 4096 byte
+ * kmalloc. A 128kb malloc should suffice for a 256TB filesystem.
+ * So a two level scheme suffices for now. */
+ sbi->s_group_info = kmalloc(sizeof(*sbi->s_group_info) *
+ num_meta_group_infos, GFP_KERNEL);
+ if (sbi->s_group_info == NULL) {
+ printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n");
+ return -ENOMEM;
+ }
+ sbi->s_buddy_cache = new_inode(sb);
+ if (sbi->s_buddy_cache == NULL) {
+ printk(KERN_ERR "EXT4-fs: can't get new inode\n");
+ goto err_freesgi;
+ }
+ EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
+
+ metalen = sizeof(*meta_group_info) << EXT4_DESC_PER_BLOCK_BITS(sb);
+ for (i = 0; i < num_meta_group_infos; i++) {
+ if ((i + 1) == num_meta_group_infos)
+ metalen = sizeof(*meta_group_info) *
+ (sbi->s_groups_count -
+ (i << EXT4_DESC_PER_BLOCK_BITS(sb)));
+ meta_group_info = kmalloc(metalen, GFP_KERNEL);
+ if (meta_group_info == NULL) {
+ printk(KERN_ERR "EXT4-fs: can't allocate mem for a "
+ "buddy group\n");
+ goto err_freemeta;
+ }
+ sbi->s_group_info[i] = meta_group_info;
+ }
+
+ /*
+ * calculate needed size. if change bb_counters size,
+ * don't forget about ext4_mb_generate_buddy()
+ */
+ len = sizeof(struct ext4_group_info);
+ len += sizeof(unsigned short) * (sb->s_blocksize_bits + 2);
+ for (i = 0; i < sbi->s_groups_count; i++) {
+ struct ext4_group_desc *desc;
+
+ meta_group_info =
+ sbi->s_group_info[i >> EXT4_DESC_PER_BLOCK_BITS(sb)];
+ j = i & (EXT4_DESC_PER_BLOCK(sb) - 1);
+
+ meta_group_info[j] = kzalloc(len, GFP_KERNEL);
+ if (meta_group_info[j] == NULL) {
+ printk(KERN_ERR "EXT4-fs: can't allocate buddy mem\n");
+ i--;
+ goto err_freebuddy;
+ }
+ desc = ext4_get_group_desc(sb, i, NULL);
+ if (desc == NULL) {
+ printk(KERN_ERR
+ "EXT4-fs: can't read descriptor %lu\n", i);
+ goto err_freebuddy;
+ }
+ memset(meta_group_info[j], 0, len);
+ set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT,
+ &(meta_group_info[j]->bb_state));
+
+ /*
+ * initialize bb_free to be able to skip
+ * empty groups without initialization
+ */
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+ meta_group_info[j]->bb_free =
+ ext4_free_blocks_after_init(sb, i, desc);
+ } else {
+ meta_group_info[j]->bb_free =
+ le16_to_cpu(desc->bg_free_blocks_count);
+ }
+
+ INIT_LIST_HEAD(&meta_group_info[j]->bb_prealloc_list);
+
+#ifdef DOUBLE_CHECK
+ {
+ struct buffer_head *bh;
+ meta_group_info[j]->bb_bitmap =
+ kmalloc(sb->s_blocksize, GFP_KERNEL);
+ BUG_ON(meta_group_info[j]->bb_bitmap == NULL);
+ bh = read_block_bitmap(sb, i);
+ BUG_ON(bh == NULL);
+ memcpy(meta_group_info[j]->bb_bitmap, bh->b_data,
+ sb->s_blocksize);
+ brelse(bh);
+ }
+#endif
+
+ }
+
+ return 0;
+
+err_freebuddy:
+ while (i >= 0) {
+ kfree(ext4_get_group_info(sb, i));
+ i--;
+ }
+ i = num_meta_group_infos;
+err_freemeta:
+ while (--i >= 0)
+ kfree(sbi->s_group_info[i]);
+ iput(sbi->s_buddy_cache);
+err_freesgi:
+ kfree(sbi->s_group_info);
+ return -ENOMEM;
+}
+
+int ext4_mb_init(struct super_block *sb, int needs_recovery)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ unsigned i;
+ unsigned offset;
+ unsigned max;
+
+ if (!test_opt(sb, MBALLOC))
+ return 0;
+
+ i = (sb->s_blocksize_bits + 2) * sizeof(unsigned short);
+
+ sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
+ if (sbi->s_mb_offsets == NULL) {
+ clear_opt(sbi->s_mount_opt, MBALLOC);
+ return -ENOMEM;
+ }
+ sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
+ if (sbi->s_mb_maxs == NULL) {
+ clear_opt(sbi->s_mount_opt, MBALLOC);
+ kfree(sbi->s_mb_maxs);
+ return -ENOMEM;
+ }
+
+ /* order 0 is regular bitmap */
+ sbi->s_mb_maxs[0] = sb->s_blocksize << 3;
+ sbi->s_mb_offsets[0] = 0;
+
+ i = 1;
+ offset = 0;
+ max = sb->s_blocksize << 2;
+ do {
+ sbi->s_mb_offsets[i] = offset;
+ sbi->s_mb_maxs[i] = max;
+ offset += 1 << (sb->s_blocksize_bits - i);
+ max = max >> 1;
+ i++;
+ } while (i <= sb->s_blocksize_bits + 1);
+
+ /* init file for buddy data */
+ i = ext4_mb_init_backend(sb);
+ if (i) {
+ clear_opt(sbi->s_mount_opt, MBALLOC);
+ kfree(sbi->s_mb_offsets);
+ kfree(sbi->s_mb_maxs);
+ return i;
+ }
+
+ spin_lock_init(&sbi->s_md_lock);
+ INIT_LIST_HEAD(&sbi->s_active_transaction);
+ INIT_LIST_HEAD(&sbi->s_closed_transaction);
+ INIT_LIST_HEAD(&sbi->s_committed_transaction);
+ spin_lock_init(&sbi->s_bal_lock);
+
+ sbi->s_mb_max_to_scan = MB_DEFAULT_MAX_TO_SCAN;
+ sbi->s_mb_min_to_scan = MB_DEFAULT_MIN_TO_SCAN;
+ sbi->s_mb_stats = MB_DEFAULT_STATS;
+ sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
+ sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
+ sbi->s_mb_history_filter = EXT4_MB_HISTORY_DEFAULT;
+ sbi->s_mb_group_prealloc = MB_DEFAULT_GROUP_PREALLOC;
+
+ i = sizeof(struct ext4_locality_group) * NR_CPUS;
+ sbi->s_locality_groups = kmalloc(i, GFP_NOFS);
+ if (sbi->s_locality_groups == NULL) {
+ clear_opt(sbi->s_mount_opt, MBALLOC);
+ kfree(sbi->s_mb_offsets);
+ kfree(sbi->s_mb_maxs);
+ return -ENOMEM;
+ }
+ for (i = 0; i < NR_CPUS; i++) {
+ struct ext4_locality_group *lg;
+ lg = &sbi->s_locality_groups[i];
+ sema_init(&lg->lg_sem, 1);
+ INIT_LIST_HEAD(&lg->lg_prealloc_list);
+ spin_lock_init(&lg->lg_prealloc_lock);
+ }
+
+ ext4_mb_init_per_dev_proc(sb);
+ ext4_mb_history_init(sb);
+
+ printk("EXT4-fs: mballoc enabled\n");
+ return 0;
+}
+
+static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
+{
+ struct ext4_prealloc_space *pa;
+ struct list_head *cur, *tmp;
+ int count = 0;
+
+ list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) {
+ pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
+ list_del_rcu(&pa->pa_group_list);
+ count++;
+ kfree(pa);
+ }
+ if (count)
+ mb_debug("mballoc: %u PAs left\n", count);
+
+}
+
+int ext4_mb_release(struct super_block *sb)
+{
+ ext4_group_t i;
+ int num_meta_group_infos;
+ struct ext4_group_info *grinfo;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (!test_opt(sb, MBALLOC))
+ return 0;
+
+ /* release freed, non-committed blocks */
+ spin_lock(&sbi->s_md_lock);
+ list_splice_init(&sbi->s_closed_transaction,
+ &sbi->s_committed_transaction);
+ list_splice_init(&sbi->s_active_transaction,
+ &sbi->s_committed_transaction);
+ spin_unlock(&sbi->s_md_lock);
+ ext4_mb_free_committed_blocks(sb);
+
+ if (sbi->s_group_info) {
+ for (i = 0; i < sbi->s_groups_count; i++) {
+ grinfo = ext4_get_group_info(sb, i);
+#ifdef DOUBLE_CHECK
+ kfree(grinfo->bb_bitmap);
+#endif
+ ext4_mb_cleanup_pa(grinfo);
+ kfree(grinfo);
+ }
+ num_meta_group_infos = (sbi->s_groups_count +
+ EXT4_DESC_PER_BLOCK(sb) - 1) >>
+ EXT4_DESC_PER_BLOCK_BITS(sb);
+ for (i = 0; i < num_meta_group_infos; i++)
+ kfree(sbi->s_group_info[i]);
+ kfree(sbi->s_group_info);
+ }
+ kfree(sbi->s_mb_offsets);
+ kfree(sbi->s_mb_maxs);
+ if (sbi->s_buddy_cache)
+ iput(sbi->s_buddy_cache);
+ if (sbi->s_mb_stats) {
+ printk(KERN_INFO
+ "EXT4-fs: mballoc: %u blocks %u reqs (%u success)\n",
+ atomic_read(&sbi->s_bal_allocated),
+ atomic_read(&sbi->s_bal_reqs),
+ atomic_read(&sbi->s_bal_success));
+ printk(KERN_INFO
+ "EXT4-fs: mballoc: %u extents scanned, %u goal hits, "
+ "%u 2^N hits, %u breaks, %u lost\n",
+ atomic_read(&sbi->s_bal_ex_scanned),
+ atomic_read(&sbi->s_bal_goals),
+ atomic_read(&sbi->s_bal_2orders),
+ atomic_read(&sbi->s_bal_breaks),
+ atomic_read(&sbi->s_mb_lost_chunks));
+ printk(KERN_INFO
+ "EXT4-fs: mballoc: %lu generated and it took %Lu\n",
+ sbi->s_mb_buddies_generated++,
+ sbi->s_mb_generation_time);
+ printk(KERN_INFO
+ "EXT4-fs: mballoc: %u preallocated, %u discarded\n",
+ atomic_read(&sbi->s_mb_preallocated),
+ atomic_read(&sbi->s_mb_discarded));
+ }
+
+ kfree(sbi->s_locality_groups);
+
+ ext4_mb_history_release(sb);
+ ext4_mb_destroy_per_dev_proc(sb);
+
+ return 0;
+}
+
+static void ext4_mb_free_committed_blocks(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int err;
+ int i;
+ int count = 0;
+ int count2 = 0;
+ struct ext4_free_metadata *md;
+ struct ext4_buddy e4b;
+
+ if (list_empty(&sbi->s_committed_transaction))
+ return;
+
+ /* there is committed blocks to be freed yet */
+ do {
+ /* get next array of blocks */
+ md = NULL;
+ spin_lock(&sbi->s_md_lock);
+ if (!list_empty(&sbi->s_committed_transaction)) {
+ md = list_entry(sbi->s_committed_transaction.next,
+ struct ext4_free_metadata, list);
+ list_del(&md->list);
+ }
+ spin_unlock(&sbi->s_md_lock);
+
+ if (md == NULL)
+ break;
+
+ mb_debug("gonna free %u blocks in group %lu (0x%p):",
+ md->num, md->group, md);
+
+ err = ext4_mb_load_buddy(sb, md->group, &e4b);
+ /* we expect to find existing buddy because it's pinned */
+ BUG_ON(err != 0);
+
+ /* there are blocks to put in buddy to make them really free */
+ count += md->num;
+ count2++;
+ ext4_lock_group(sb, md->group);
+ for (i = 0; i < md->num; i++) {
+ mb_debug(" %u", md->blocks[i]);
+ err = mb_free_blocks(NULL, &e4b, md->blocks[i], 1);
+ BUG_ON(err != 0);
+ }
+ mb_debug("\n");
+ ext4_unlock_group(sb, md->group);
+
+ /* balance refcounts from ext4_mb_free_metadata() */
+ page_cache_release(e4b.bd_buddy_page);
+ page_cache_release(e4b.bd_bitmap_page);
+
+ kfree(md);
+ ext4_mb_release_desc(&e4b);
+
+ } while (md);
+
+ mb_debug("freed %u blocks in %u structures\n", count, count2);
+}
+
+#define EXT4_ROOT "ext4"
+#define EXT4_MB_STATS_NAME "stats"
+#define EXT4_MB_MAX_TO_SCAN_NAME "max_to_scan"
+#define EXT4_MB_MIN_TO_SCAN_NAME "min_to_scan"
+#define EXT4_MB_ORDER2_REQ "order2_req"
+#define EXT4_MB_STREAM_REQ "stream_req"
+#define EXT4_MB_GROUP_PREALLOC "group_prealloc"
+
+
+
+#define MB_PROC_VALUE_READ(name) \
+static int ext4_mb_read_##name(char *page, char **start, \
+ off_t off, int count, int *eof, void *data) \
+{ \
+ struct ext4_sb_info *sbi = data; \
+ int len; \
+ *eof = 1; \
+ if (off != 0) \
+ return 0; \
+ len = sprintf(page, "%ld\n", sbi->s_mb_##name); \
+ *start = page; \
+ return len; \
+}
+
+#define MB_PROC_VALUE_WRITE(name) \
+static int ext4_mb_write_##name(struct file *file, \
+ const char __user *buf, unsigned long cnt, void *data) \
+{ \
+ struct ext4_sb_info *sbi = data; \
+ char str[32]; \
+ long value; \
+ if (cnt >= sizeof(str)) \
+ return -EINVAL; \
+ if (copy_from_user(str, buf, cnt)) \
+ return -EFAULT; \
+ value = simple_strtol(str, NULL, 0); \
+ if (value <= 0) \
+ return -ERANGE; \
+ sbi->s_mb_##name = value; \
+ return cnt; \
+}
+
+MB_PROC_VALUE_READ(stats);
+MB_PROC_VALUE_WRITE(stats);
+MB_PROC_VALUE_READ(max_to_scan);
+MB_PROC_VALUE_WRITE(max_to_scan);
+MB_PROC_VALUE_READ(min_to_scan);
+MB_PROC_VALUE_WRITE(min_to_scan);
+MB_PROC_VALUE_READ(order2_reqs);
+MB_PROC_VALUE_WRITE(order2_reqs);
+MB_PROC_VALUE_READ(stream_request);
+MB_PROC_VALUE_WRITE(stream_request);
+MB_PROC_VALUE_READ(group_prealloc);
+MB_PROC_VALUE_WRITE(group_prealloc);
+
+#define MB_PROC_HANDLER(name, var) \
+do { \
+ proc = create_proc_entry(name, mode, sbi->s_mb_proc); \
+ if (proc == NULL) { \
+ printk(KERN_ERR "EXT4-fs: can't to create %s\n", name); \
+ goto err_out; \
+ } \
+ proc->data = sbi; \
+ proc->read_proc = ext4_mb_read_##var ; \
+ proc->write_proc = ext4_mb_write_##var; \
+} while (0)
+
+static int ext4_mb_init_per_dev_proc(struct super_block *sb)
+{
+ mode_t mode = S_IFREG | S_IRUGO | S_IWUSR;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct proc_dir_entry *proc;
+ char devname[64];
+
+ snprintf(devname, sizeof(devname) - 1, "%s",
+ bdevname(sb->s_bdev, devname));
+ sbi->s_mb_proc = proc_mkdir(devname, proc_root_ext4);
+
+ MB_PROC_HANDLER(EXT4_MB_STATS_NAME, stats);
+ MB_PROC_HANDLER(EXT4_MB_MAX_TO_SCAN_NAME, max_to_scan);
+ MB_PROC_HANDLER(EXT4_MB_MIN_TO_SCAN_NAME, min_to_scan);
+ MB_PROC_HANDLER(EXT4_MB_ORDER2_REQ, order2_reqs);
+ MB_PROC_HANDLER(EXT4_MB_STREAM_REQ, stream_request);
+ MB_PROC_HANDLER(EXT4_MB_GROUP_PREALLOC, group_prealloc);
+
+ return 0;
+
+err_out:
+ printk(KERN_ERR "EXT4-fs: Unable to create %s\n", devname);
+ remove_proc_entry(EXT4_MB_GROUP_PREALLOC, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_STREAM_REQ, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_ORDER2_REQ, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_MIN_TO_SCAN_NAME, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_MAX_TO_SCAN_NAME, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_STATS_NAME, sbi->s_mb_proc);
+ remove_proc_entry(devname, proc_root_ext4);
+ sbi->s_mb_proc = NULL;
+
+ return -ENOMEM;
+}
+
+static int ext4_mb_destroy_per_dev_proc(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ char devname[64];
+
+ if (sbi->s_mb_proc == NULL)
+ return -EINVAL;
+
+ snprintf(devname, sizeof(devname) - 1, "%s",
+ bdevname(sb->s_bdev, devname));
+ remove_proc_entry(EXT4_MB_GROUP_PREALLOC, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_STREAM_REQ, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_ORDER2_REQ, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_MIN_TO_SCAN_NAME, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_MAX_TO_SCAN_NAME, sbi->s_mb_proc);
+ remove_proc_entry(EXT4_MB_STATS_NAME, sbi->s_mb_proc);
+ remove_proc_entry(devname, proc_root_ext4);
+
+ return 0;
+}
+
+int __init init_ext4_mballoc(void)
+{
+ ext4_pspace_cachep =
+ kmem_cache_create("ext4_prealloc_space",
+ sizeof(struct ext4_prealloc_space),
+ 0, SLAB_RECLAIM_ACCOUNT, NULL);
+ if (ext4_pspace_cachep == NULL)
+ return -ENOMEM;
+
+#ifdef CONFIG_PROC_FS
+ proc_root_ext4 = proc_mkdir(EXT4_ROOT, proc_root_fs);
+ if (proc_root_ext4 == NULL)
+ printk(KERN_ERR "EXT4-fs: Unable to create %s\n", EXT4_ROOT);
+#endif
+
+ return 0;
+}
+
+void exit_ext4_mballoc(void)
+{
+ /* XXX: synchronize_rcu(); */
+ kmem_cache_destroy(ext4_pspace_cachep);
+#ifdef CONFIG_PROC_FS
+ remove_proc_entry(EXT4_ROOT, proc_root_fs);
+#endif
+}
+
+
+/*
+ * Check quota and mark choosed space (ac->ac_b_ex) non-free in bitmaps
+ * Returns 0 if success or error code
+ */
+static int ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
+ handle_t *handle)
+{
+ struct buffer_head *bitmap_bh = NULL;
+ struct ext4_super_block *es;
+ struct ext4_group_desc *gdp;
+ struct buffer_head *gdp_bh;
+ struct ext4_sb_info *sbi;
+ struct super_block *sb;
+ ext4_fsblk_t block;
+ int err;
+
+ BUG_ON(ac->ac_status != AC_STATUS_FOUND);
+ BUG_ON(ac->ac_b_ex.fe_len <= 0);
+
+ sb = ac->ac_sb;
+ sbi = EXT4_SB(sb);
+ es = sbi->s_es;
+
+ ext4_debug("using block group %lu(%d)\n", ac->ac_b_ex.fe_group,
+ gdp->bg_free_blocks_count);
+
+ err = -EIO;
+ bitmap_bh = read_block_bitmap(sb, ac->ac_b_ex.fe_group);
+ if (!bitmap_bh)
+ goto out_err;
+
+ err = ext4_journal_get_write_access(handle, bitmap_bh);
+ if (err)
+ goto out_err;
+
+ err = -EIO;
+ gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh);
+ if (!gdp)
+ goto out_err;
+
+ err = ext4_journal_get_write_access(handle, gdp_bh);
+ if (err)
+ goto out_err;
+
+ block = ac->ac_b_ex.fe_group * EXT4_BLOCKS_PER_GROUP(sb)
+ + ac->ac_b_ex.fe_start
+ + le32_to_cpu(es->s_first_data_block);
+
+ if (block == ext4_block_bitmap(sb, gdp) ||
+ block == ext4_inode_bitmap(sb, gdp) ||
+ in_range(block, ext4_inode_table(sb, gdp),
+ EXT4_SB(sb)->s_itb_per_group)) {
+
+ ext4_error(sb, __FUNCTION__,
+ "Allocating block in system zone - block = %llu",
+ block);
+ }
+#ifdef AGGRESSIVE_CHECK
+ {
+ int i;
+ for (i = 0; i < ac->ac_b_ex.fe_len; i++) {
+ BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i,
+ bitmap_bh->b_data));
+ }
+ }
+#endif
+ mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
+ ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
+
+ spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
+ if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+ gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+ gdp->bg_free_blocks_count =
+ cpu_to_le16(ext4_free_blocks_after_init(sb,
+ ac->ac_b_ex.fe_group,
+ gdp));
+ }
+ gdp->bg_free_blocks_count =
+ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)
+ - ac->ac_b_ex.fe_len);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);
+ spin_unlock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
+ percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
+
+ err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+ if (err)
+ goto out_err;
+ err = ext4_journal_dirty_metadata(handle, gdp_bh);
+
+out_err:
+ sb->s_dirt = 1;
+ brelse(bitmap_bh);
+ return err;
+}
+
+/*
+ * here we normalize request for locality group
+ * Group request are normalized to s_strip size if we set the same via mount
+ * option. If not we set it to s_mb_group_prealloc which can be configured via
+ * /proc/fs/ext4/<partition>/group_prealloc
+ *
+ * XXX: should we try to preallocate more than the group has now?
+ */
+static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
+{
+ struct super_block *sb = ac->ac_sb;
+ struct ext4_locality_group *lg = ac->ac_lg;
+
+ BUG_ON(lg == NULL);
+ if (EXT4_SB(sb)->s_stripe)
+ ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_stripe;
+ else
+ ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc;
+ mb_debug("#%u: goal %lu blocks for locality group\n",
+ current->pid, ac->ac_g_ex.fe_len);
+}
+
+/*
+ * Normalization means making request better in terms of
+ * size and alignment
+ */
+static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
+ struct ext4_allocation_request *ar)
+{
+ int bsbits, max;
+ ext4_lblk_t end;
+ struct list_head *cur;
+ loff_t size, orig_size, start_off;
+ ext4_lblk_t start, orig_start;
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+
+ /* do normalize only data requests, metadata requests
+ do not need preallocation */
+ if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
+ return;
+
+ /* sometime caller may want exact blocks */
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+ return;
+
+ /* caller may indicate that preallocation isn't
+ * required (it's a tail, for example) */
+ if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
+ return;
+
+ if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
+ ext4_mb_normalize_group_request(ac);
+ return ;
+ }
+
+ bsbits = ac->ac_sb->s_blocksize_bits;
+
+ /* first, let's learn actual file size
+ * given current request is allocated */
+ size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
+ size = size << bsbits;
+ if (size < i_size_read(ac->ac_inode))
+ size = i_size_read(ac->ac_inode);
+
+ /* max available blocks in a free group */
+ max = EXT4_BLOCKS_PER_GROUP(ac->ac_sb) - 1 - 1 -
+ EXT4_SB(ac->ac_sb)->s_itb_per_group;
+
+#define NRL_CHECK_SIZE(req, size, max,bits) \
+ (req <= (size) || max <= ((size) >> bits))
+
+ /* first, try to predict filesize */
+ /* XXX: should this table be tunable? */
+ start_off = 0;
+ if (size <= 16 * 1024) {
+ size = 16 * 1024;
+ } else if (size <= 32 * 1024) {
+ size = 32 * 1024;
+ } else if (size <= 64 * 1024) {
+ size = 64 * 1024;
+ } else if (size <= 128 * 1024) {
+ size = 128 * 1024;
+ } else if (size <= 256 * 1024) {
+ size = 256 * 1024;
+ } else if (size <= 512 * 1024) {
+ size = 512 * 1024;
+ } else if (size <= 1024 * 1024) {
+ size = 1024 * 1024;
+ } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) {
+ start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
+ (20 - bsbits)) << 20;
+ size = 1024 * 1024;
+ } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) {
+ start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
+ (22 - bsbits)) << 22;
+ size = 4 * 1024 * 1024;
+ } else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
+ (8<<20)>>bsbits, max, bsbits)) {
+ start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
+ (23 - bsbits)) << 23;
+ size = 8 * 1024 * 1024;
+ } else {
+ start_off = (loff_t)ac->ac_o_ex.fe_logical << bsbits;
+ size = ac->ac_o_ex.fe_len << bsbits;
+ }
+ orig_size = size = size >> bsbits;
+ orig_start = start = start_off >> bsbits;
+
+ /* don't cover already allocated blocks in selected range */
+ if (ar->pleft && start <= ar->lleft) {
+ size -= ar->lleft + 1 - start;
+ start = ar->lleft + 1;
+ }
+ if (ar->pright && start + size - 1 >= ar->lright)
+ size -= start + size - ar->lright;
+
+ end = start + size;
+
+ /* check we don't cross already preallocated blocks */
+ rcu_read_lock();
+ list_for_each_rcu(cur, &ei->i_prealloc_list) {
+ struct ext4_prealloc_space *pa;
+ unsigned long pa_end;
+
+ pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
+
+ if (pa->pa_deleted)
+ continue;
+ spin_lock(&pa->pa_lock);
+ if (pa->pa_deleted) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+
+ pa_end = pa->pa_lstart + pa->pa_len;
+
+ /* PA must not overlap original request */
+ BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
+ ac->ac_o_ex.fe_logical < pa->pa_lstart));
+
+ /* skip PA normalized request doesn't overlap with */
+ if (pa->pa_lstart >= end) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+ if (pa_end <= start) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+ BUG_ON(pa->pa_lstart <= start && pa_end >= end);
+
+ if (pa_end <= ac->ac_o_ex.fe_logical) {
+ BUG_ON(pa_end < start);
+ start = pa_end;
+ }
+
+ if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
+ BUG_ON(pa->pa_lstart > end);
+ end = pa->pa_lstart;
+ }
+ spin_unlock(&pa->pa_lock);
+ }
+ rcu_read_unlock();
+ size = end - start;
+
+ /* XXX: extra loop to check we really don't overlap preallocations */
+ rcu_read_lock();
+ list_for_each_rcu(cur, &ei->i_prealloc_list) {
+ struct ext4_prealloc_space *pa;
+ unsigned long pa_end;
+ pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
+ spin_lock(&pa->pa_lock);
+ if (pa->pa_deleted == 0) {
+ pa_end = pa->pa_lstart + pa->pa_len;
+ BUG_ON(!(start >= pa_end || end <= pa->pa_lstart));
+ }
+ spin_unlock(&pa->pa_lock);
+ }
+ rcu_read_unlock();
+
+ if (start + size <= ac->ac_o_ex.fe_logical &&
+ start > ac->ac_o_ex.fe_logical) {
+ printk(KERN_ERR "start %lu, size %lu, fe_logical %lu\n",
+ (unsigned long) start, (unsigned long) size,
+ (unsigned long) ac->ac_o_ex.fe_logical);
+ }
+ BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
+ start > ac->ac_o_ex.fe_logical);
+ BUG_ON(size <= 0 || size >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+
+ /* now prepare goal request */
+
+ /* XXX: is it better to align blocks WRT to logical
+ * placement or satisfy big request as is */
+ ac->ac_g_ex.fe_logical = start;
+ ac->ac_g_ex.fe_len = size;
+
+ /* define goal start in order to merge */
+ if (ar->pright && (ar->lright == (start + size))) {
+ /* merge to the right */
+ ext4_get_group_no_and_offset(ac->ac_sb, ar->pright - size,
+ &ac->ac_f_ex.fe_group,
+ &ac->ac_f_ex.fe_start);
+ ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
+ }
+ if (ar->pleft && (ar->lleft + 1 == start)) {
+ /* merge to the left */
+ ext4_get_group_no_and_offset(ac->ac_sb, ar->pleft + 1,
+ &ac->ac_f_ex.fe_group,
+ &ac->ac_f_ex.fe_start);
+ ac->ac_flags |= EXT4_MB_HINT_TRY_GOAL;
+ }
+
+ mb_debug("goal: %u(was %u) blocks at %u\n", (unsigned) size,
+ (unsigned) orig_size, (unsigned) start);
+}
+
+static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+
+ if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) {
+ atomic_inc(&sbi->s_bal_reqs);
+ atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
+ if (ac->ac_o_ex.fe_len >= ac->ac_g_ex.fe_len)
+ atomic_inc(&sbi->s_bal_success);
+ atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
+ if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
+ ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
+ atomic_inc(&sbi->s_bal_goals);
+ if (ac->ac_found > sbi->s_mb_max_to_scan)
+ atomic_inc(&sbi->s_bal_breaks);
+ }
+
+ ext4_mb_store_history(ac);
+}
+
+/*
+ * use blocks preallocated to inode
+ */
+static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
+ struct ext4_prealloc_space *pa)
+{
+ ext4_fsblk_t start;
+ ext4_fsblk_t end;
+ int len;
+
+ /* found preallocated blocks, use them */
+ start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
+ end = min(pa->pa_pstart + pa->pa_len, start + ac->ac_o_ex.fe_len);
+ len = end - start;
+ ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group,
+ &ac->ac_b_ex.fe_start);
+ ac->ac_b_ex.fe_len = len;
+ ac->ac_status = AC_STATUS_FOUND;
+ ac->ac_pa = pa;
+
+ BUG_ON(start < pa->pa_pstart);
+ BUG_ON(start + len > pa->pa_pstart + pa->pa_len);
+ BUG_ON(pa->pa_free < len);
+ pa->pa_free -= len;
+
+ mb_debug("use %llu/%lu from inode pa %p\n", start, len, pa);
+}
+
+/*
+ * use blocks preallocated to locality group
+ */
+static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
+ struct ext4_prealloc_space *pa)
+{
+ unsigned len = ac->ac_o_ex.fe_len;
+
+ ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart,
+ &ac->ac_b_ex.fe_group,
+ &ac->ac_b_ex.fe_start);
+ ac->ac_b_ex.fe_len = len;
+ ac->ac_status = AC_STATUS_FOUND;
+ ac->ac_pa = pa;
+
+ /* we don't correct pa_pstart or pa_plen here to avoid
+ * possible race when tte group is being loaded concurrently
+ * instead we correct pa later, after blocks are marked
+ * in on-disk bitmap -- see ext4_mb_release_context() */
+ /*
+ * FIXME!! but the other CPUs can look at this particular
+ * pa and think that it have enought free blocks if we
+ * don't update pa_free here right ?
+ */
+ mb_debug("use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
+}
+
+/*
+ * search goal blocks in preallocated space
+ */
+static int ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
+{
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+ struct ext4_locality_group *lg;
+ struct ext4_prealloc_space *pa;
+ struct list_head *cur;
+
+ /* only data can be preallocated */
+ if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
+ return 0;
+
+ /* first, try per-file preallocation */
+ rcu_read_lock();
+ list_for_each_rcu(cur, &ei->i_prealloc_list) {
+ pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
+
+ /* all fields in this condition don't change,
+ * so we can skip locking for them */
+ if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
+ ac->ac_o_ex.fe_logical >= pa->pa_lstart + pa->pa_len)
+ continue;
+
+ /* found preallocated blocks, use them */
+ spin_lock(&pa->pa_lock);
+ if (pa->pa_deleted == 0 && pa->pa_free) {
+ atomic_inc(&pa->pa_count);
+ ext4_mb_use_inode_pa(ac, pa);
+ spin_unlock(&pa->pa_lock);
+ ac->ac_criteria = 10;
+ rcu_read_unlock();
+ return 1;
+ }
+ spin_unlock(&pa->pa_lock);
+ }
+ rcu_read_unlock();
+
+ /* can we use group allocation? */
+ if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
+ return 0;
+
+ /* inode may have no locality group for some reason */
+ lg = ac->ac_lg;
+ if (lg == NULL)
+ return 0;
+
+ rcu_read_lock();
+ list_for_each_rcu(cur, &lg->lg_prealloc_list) {
+ pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
+ spin_lock(&pa->pa_lock);
+ if (pa->pa_deleted == 0 && pa->pa_free >= ac->ac_o_ex.fe_len) {
+ atomic_inc(&pa->pa_count);
+ ext4_mb_use_group_pa(ac, pa);
+ spin_unlock(&pa->pa_lock);
+ ac->ac_criteria = 20;
+ rcu_read_unlock();
+ return 1;
+ }
+ spin_unlock(&pa->pa_lock);
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
+/*
+ * the function goes through all preallocation in this group and marks them
+ * used in in-core bitmap. buddy must be generated from this bitmap
+ */
+static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
+ ext4_group_t group)
+{
+ struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+ struct ext4_prealloc_space *pa;
+ struct list_head *cur;
+ ext4_group_t groupnr;
+ ext4_grpblk_t start;
+ int preallocated = 0;
+ int count = 0;
+ int len;
+
+ /* all form of preallocation discards first load group,
+ * so the only competing code is preallocation use.
+ * we don't need any locking here
+ * notice we do NOT ignore preallocations with pa_deleted
+ * otherwise we could leave used blocks available for
+ * allocation in buddy when concurrent ext4_mb_put_pa()
+ * is dropping preallocation
+ */
+ list_for_each_rcu(cur, &grp->bb_prealloc_list) {
+ pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
+ spin_lock(&pa->pa_lock);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart,
+ &groupnr, &start);
+ len = pa->pa_len;
+ spin_unlock(&pa->pa_lock);
+ if (unlikely(len == 0))
+ continue;
+ BUG_ON(groupnr != group);
+ mb_set_bits(sb_bgl_lock(EXT4_SB(sb), group),
+ bitmap, start, len);
+ preallocated += len;
+ count++;
+ }
+ mb_debug("prellocated %u for group %lu\n", preallocated, group);
+}
+
+static void ext4_mb_pa_callback(struct rcu_head *head)
+{
+ struct ext4_prealloc_space *pa;
+ pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
+ kmem_cache_free(ext4_pspace_cachep, pa);
+}
+#define mb_call_rcu(__pa) call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)
+
+/*
+ * drops a reference to preallocated space descriptor
+ * if this was the last reference and the space is consumed
+ */
+static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
+ struct super_block *sb, struct ext4_prealloc_space *pa)
+{
+ unsigned long grp;
+
+ if (!atomic_dec_and_test(&pa->pa_count) || pa->pa_free != 0)
+ return;
+
+ /* in this short window concurrent discard can set pa_deleted */
+ spin_lock(&pa->pa_lock);
+ if (pa->pa_deleted == 1) {
+ spin_unlock(&pa->pa_lock);
+ return;
+ }
+
+ pa->pa_deleted = 1;
+ spin_unlock(&pa->pa_lock);
+
+ /* -1 is to protect from crossing allocation group */
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart - 1, &grp, NULL);
+
+ /*
+ * possible race:
+ *
+ * P1 (buddy init) P2 (regular allocation)
+ * find block B in PA
+ * copy on-disk bitmap to buddy
+ * mark B in on-disk bitmap
+ * drop PA from group
+ * mark all PAs in buddy
+ *
+ * thus, P1 initializes buddy with B available. to prevent this
+ * we make "copy" and "mark all PAs" atomic and serialize "drop PA"
+ * against that pair
+ */
+ ext4_lock_group(sb, grp);
+ list_del_rcu(&pa->pa_group_list);
+ ext4_unlock_group(sb, grp);
+
+ spin_lock(pa->pa_obj_lock);
+ list_del_rcu(&pa->pa_inode_list);
+ spin_unlock(pa->pa_obj_lock);
+
+ mb_call_rcu(pa);
+}
+
+/*
+ * creates new preallocated space for given inode
+ */
+static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
+{
+ struct super_block *sb = ac->ac_sb;
+ struct ext4_prealloc_space *pa;
+ struct ext4_group_info *grp;
+ struct ext4_inode_info *ei;
+
+ /* preallocate only when found space is larger then requested */
+ BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
+ BUG_ON(ac->ac_status != AC_STATUS_FOUND);
+ BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
+
+ pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+ if (pa == NULL)
+ return -ENOMEM;
+
+ if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
+ int winl;
+ int wins;
+ int win;
+ int offs;
+
+ /* we can't allocate as much as normalizer wants.
+ * so, found space must get proper lstart
+ * to cover original request */
+ BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical);
+ BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len);
+
+ /* we're limited by original request in that
+ * logical block must be covered any way
+ * winl is window we can move our chunk within */
+ winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical;
+
+ /* also, we should cover whole original request */
+ wins = ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len;
+
+ /* the smallest one defines real window */
+ win = min(winl, wins);
+
+ offs = ac->ac_o_ex.fe_logical % ac->ac_b_ex.fe_len;
+ if (offs && offs < win)
+ win = offs;
+
+ ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - win;
+ BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
+ BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
+ }
+
+ /* preallocation can change ac_b_ex, thus we store actually
+ * allocated blocks for history */
+ ac->ac_f_ex = ac->ac_b_ex;
+
+ pa->pa_lstart = ac->ac_b_ex.fe_logical;
+ pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
+ pa->pa_len = ac->ac_b_ex.fe_len;
+ pa->pa_free = pa->pa_len;
+ atomic_set(&pa->pa_count, 1);
+ spin_lock_init(&pa->pa_lock);
+ pa->pa_deleted = 0;
+ pa->pa_linear = 0;
+
+ mb_debug("new inode pa %p: %llu/%u for %u\n", pa,
+ pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+
+ ext4_mb_use_inode_pa(ac, pa);
+ atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
+
+ ei = EXT4_I(ac->ac_inode);
+ grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
+
+ pa->pa_obj_lock = &ei->i_prealloc_lock;
+ pa->pa_inode = ac->ac_inode;
+
+ ext4_lock_group(sb, ac->ac_b_ex.fe_group);
+ list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
+ ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
+
+ spin_lock(pa->pa_obj_lock);
+ list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
+ spin_unlock(pa->pa_obj_lock);
+
+ return 0;
+}
+
+/*
+ * creates new preallocated space for locality group inodes belongs to
+ */
+static int ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
+{
+ struct super_block *sb = ac->ac_sb;
+ struct ext4_locality_group *lg;
+ struct ext4_prealloc_space *pa;
+ struct ext4_group_info *grp;
+
+ /* preallocate only when found space is larger then requested */
+ BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
+ BUG_ON(ac->ac_status != AC_STATUS_FOUND);
+ BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
+
+ BUG_ON(ext4_pspace_cachep == NULL);
+ pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+ if (pa == NULL)
+ return -ENOMEM;
+
+ /* preallocation can change ac_b_ex, thus we store actually
+ * allocated blocks for history */
+ ac->ac_f_ex = ac->ac_b_ex;
+
+ pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
+ pa->pa_lstart = pa->pa_pstart;
+ pa->pa_len = ac->ac_b_ex.fe_len;
+ pa->pa_free = pa->pa_len;
+ atomic_set(&pa->pa_count, 1);
+ spin_lock_init(&pa->pa_lock);
+ pa->pa_deleted = 0;
+ pa->pa_linear = 1;
+
+ mb_debug("new group pa %p: %llu/%u for %u\n", pa,
+ pa->pa_pstart, pa->pa_len, pa->pa_lstart);
+
+ ext4_mb_use_group_pa(ac, pa);
+ atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
+
+ grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
+ lg = ac->ac_lg;
+ BUG_ON(lg == NULL);
+
+ pa->pa_obj_lock = &lg->lg_prealloc_lock;
+ pa->pa_inode = NULL;
+
+ ext4_lock_group(sb, ac->ac_b_ex.fe_group);
+ list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
+ ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
+
+ spin_lock(pa->pa_obj_lock);
+ list_add_tail_rcu(&pa->pa_inode_list, &lg->lg_prealloc_list);
+ spin_unlock(pa->pa_obj_lock);
+
+ return 0;
+}
+
+static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
+{
+ int err;
+
+ if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
+ err = ext4_mb_new_group_pa(ac);
+ else
+ err = ext4_mb_new_inode_pa(ac);
+ return err;
+}
+
+/*
+ * finds all unused blocks in on-disk bitmap, frees them in
+ * in-core bitmap and buddy.
+ * @pa must be unlinked from inode and group lists, so that
+ * nobody else can find/use it.
+ * the caller MUST hold group/inode locks.
+ * TODO: optimize the case when there are no in-core structures yet
+ */
+static int ext4_mb_release_inode_pa(struct ext4_buddy *e4b,
+ struct buffer_head *bitmap_bh,
+ struct ext4_prealloc_space *pa)
+{
+ struct ext4_allocation_context ac;
+ struct super_block *sb = e4b->bd_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ unsigned long end;
+ unsigned long next;
+ ext4_group_t group;
+ ext4_grpblk_t bit;
+ sector_t start;
+ int err = 0;
+ int free = 0;
+
+ BUG_ON(pa->pa_deleted == 0);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
+ BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
+ end = bit + pa->pa_len;
+
+ ac.ac_sb = sb;
+ ac.ac_inode = pa->pa_inode;
+ ac.ac_op = EXT4_MB_HISTORY_DISCARD;
+
+ while (bit < end) {
+ bit = ext4_find_next_zero_bit(bitmap_bh->b_data, end, bit);
+ if (bit >= end)
+ break;
+ next = ext4_find_next_bit(bitmap_bh->b_data, end, bit);
+ if (next > end)
+ next = end;
+ start = group * EXT4_BLOCKS_PER_GROUP(sb) + bit +
+ le32_to_cpu(sbi->s_es->s_first_data_block);
+ mb_debug(" free preallocated %u/%u in group %u\n",
+ (unsigned) start, (unsigned) next - bit,
+ (unsigned) group);
+ free += next - bit;
+
+ ac.ac_b_ex.fe_group = group;
+ ac.ac_b_ex.fe_start = bit;
+ ac.ac_b_ex.fe_len = next - bit;
+ ac.ac_b_ex.fe_logical = 0;
+ ext4_mb_store_history(&ac);
+
+ mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
+ bit = next + 1;
+ }
+ if (free != pa->pa_free) {
+ printk(KERN_ERR "pa %p: logic %lu, phys. %lu, len %lu\n",
+ pa, (unsigned long) pa->pa_lstart,
+ (unsigned long) pa->pa_pstart,
+ (unsigned long) pa->pa_len);
+ printk(KERN_ERR "free %u, pa_free %u\n", free, pa->pa_free);
+ }
+ BUG_ON(free != pa->pa_free);
+ atomic_add(free, &sbi->s_mb_discarded);
+
+ return err;
+}
+
+static int ext4_mb_release_group_pa(struct ext4_buddy *e4b,
+ struct ext4_prealloc_space *pa)
+{
+ struct ext4_allocation_context ac;
+ struct super_block *sb = e4b->bd_sb;
+ ext4_group_t group;
+ ext4_grpblk_t bit;
+
+ ac.ac_op = EXT4_MB_HISTORY_DISCARD;
+
+ BUG_ON(pa->pa_deleted == 0);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
+ BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
+ mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
+ atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
+
+ ac.ac_sb = sb;
+ ac.ac_inode = NULL;
+ ac.ac_b_ex.fe_group = group;
+ ac.ac_b_ex.fe_start = bit;
+ ac.ac_b_ex.fe_len = pa->pa_len;
+ ac.ac_b_ex.fe_logical = 0;
+ ext4_mb_store_history(&ac);
+
+ return 0;
+}
+
+/*
+ * releases all preallocations in given group
+ *
+ * first, we need to decide discard policy:
+ * - when do we discard
+ * 1) ENOSPC
+ * - how many do we discard
+ * 1) how many requested
+ */
+static int ext4_mb_discard_group_preallocations(struct super_block *sb,
+ ext4_group_t group, int needed)
+{
+ struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+ struct buffer_head *bitmap_bh = NULL;
+ struct ext4_prealloc_space *pa, *tmp;
+ struct list_head list;
+ struct ext4_buddy e4b;
+ int err;
+ int busy = 0;
+ int free = 0;
+
+ mb_debug("discard preallocation for group %lu\n", group);
+
+ if (list_empty(&grp->bb_prealloc_list))
+ return 0;
+
+ bitmap_bh = read_block_bitmap(sb, group);
+ if (bitmap_bh == NULL) {
+ /* error handling here */
+ ext4_mb_release_desc(&e4b);
+ BUG_ON(bitmap_bh == NULL);
+ }
+
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ BUG_ON(err != 0); /* error handling here */
+
+ if (needed == 0)
+ needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
+
+ grp = ext4_get_group_info(sb, group);
+ INIT_LIST_HEAD(&list);
+
+repeat:
+ ext4_lock_group(sb, group);
+ list_for_each_entry_safe(pa, tmp,
+ &grp->bb_prealloc_list, pa_group_list) {
+ spin_lock(&pa->pa_lock);
+ if (atomic_read(&pa->pa_count)) {
+ spin_unlock(&pa->pa_lock);
+ busy = 1;
+ continue;
+ }
+ if (pa->pa_deleted) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+
+ /* seems this one can be freed ... */
+ pa->pa_deleted = 1;
+
+ /* we can trust pa_free ... */
+ free += pa->pa_free;
+
+ spin_unlock(&pa->pa_lock);
+
+ list_del_rcu(&pa->pa_group_list);
+ list_add(&pa->u.pa_tmp_list, &list);
+ }
+
+ /* if we still need more blocks and some PAs were used, try again */
+ if (free < needed && busy) {
+ busy = 0;
+ ext4_unlock_group(sb, group);
+ /*
+ * Yield the CPU here so that we don't get soft lockup
+ * in non preempt case.
+ */
+ yield();
+ goto repeat;
+ }
+
+ /* found anything to free? */
+ if (list_empty(&list)) {
+ BUG_ON(free != 0);
+ goto out;
+ }
+
+ /* now free all selected PAs */
+ list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
+
+ /* remove from object (inode or locality group) */
+ spin_lock(pa->pa_obj_lock);
+ list_del_rcu(&pa->pa_inode_list);
+ spin_unlock(pa->pa_obj_lock);
+
+ if (pa->pa_linear)
+ ext4_mb_release_group_pa(&e4b, pa);
+ else
+ ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
+
+ list_del(&pa->u.pa_tmp_list);
+ mb_call_rcu(pa);
+ }
+
+out:
+ ext4_unlock_group(sb, group);
+ ext4_mb_release_desc(&e4b);
+ brelse(bitmap_bh);
+ return free;
+}
+
+/*
+ * releases all non-used preallocated blocks for given inode
+ *
+ * It's important to discard preallocations under i_data_sem
+ * We don't want another block to be served from the prealloc
+ * space when we are discarding the inode prealloc space.
+ *
+ * FIXME!! Make sure it is valid at all the call sites
+ */
+void ext4_mb_discard_inode_preallocations(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct super_block *sb = inode->i_sb;
+ struct buffer_head *bitmap_bh = NULL;
+ struct ext4_prealloc_space *pa, *tmp;
+ ext4_group_t group = 0;
+ struct list_head list;
+ struct ext4_buddy e4b;
+ int err;
+
+ if (!test_opt(sb, MBALLOC) || !S_ISREG(inode->i_mode)) {
+ /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/
+ return;
+ }
+
+ mb_debug("discard preallocation for inode %lu\n", inode->i_ino);
+
+ INIT_LIST_HEAD(&list);
+
+repeat:
+ /* first, collect all pa's in the inode */
+ spin_lock(&ei->i_prealloc_lock);
+ while (!list_empty(&ei->i_prealloc_list)) {
+ pa = list_entry(ei->i_prealloc_list.next,
+ struct ext4_prealloc_space, pa_inode_list);
+ BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
+ spin_lock(&pa->pa_lock);
+ if (atomic_read(&pa->pa_count)) {
+ /* this shouldn't happen often - nobody should
+ * use preallocation while we're discarding it */
+ spin_unlock(&pa->pa_lock);
+ spin_unlock(&ei->i_prealloc_lock);
+ printk(KERN_ERR "uh-oh! used pa while discarding\n");
+ dump_stack();
+ current->state = TASK_UNINTERRUPTIBLE;
+ schedule_timeout(HZ);
+ goto repeat;
+
+ }
+ if (pa->pa_deleted == 0) {
+ pa->pa_deleted = 1;
+ spin_unlock(&pa->pa_lock);
+ list_del_rcu(&pa->pa_inode_list);
+ list_add(&pa->u.pa_tmp_list, &list);
+ continue;
+ }
+
+ /* someone is deleting pa right now */
+ spin_unlock(&pa->pa_lock);
+ spin_unlock(&ei->i_prealloc_lock);
+
+ /* we have to wait here because pa_deleted
+ * doesn't mean pa is already unlinked from
+ * the list. as we might be called from
+ * ->clear_inode() the inode will get freed
+ * and concurrent thread which is unlinking
+ * pa from inode's list may access already
+ * freed memory, bad-bad-bad */
+
+ /* XXX: if this happens too often, we can
+ * add a flag to force wait only in case
+ * of ->clear_inode(), but not in case of
+ * regular truncate */
+ current->state = TASK_UNINTERRUPTIBLE;
+ schedule_timeout(HZ);
+ goto repeat;
+ }
+ spin_unlock(&ei->i_prealloc_lock);
+
+ list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
+ BUG_ON(pa->pa_linear != 0);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
+
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ BUG_ON(err != 0); /* error handling here */
+
+ bitmap_bh = read_block_bitmap(sb, group);
+ if (bitmap_bh == NULL) {
+ /* error handling here */
+ ext4_mb_release_desc(&e4b);
+ BUG_ON(bitmap_bh == NULL);
+ }
+
+ ext4_lock_group(sb, group);
+ list_del_rcu(&pa->pa_group_list);
+ ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
+ ext4_unlock_group(sb, group);
+
+ ext4_mb_release_desc(&e4b);
+ brelse(bitmap_bh);
+
+ list_del(&pa->u.pa_tmp_list);
+ mb_call_rcu(pa);
+ }
+}
+
+/*
+ * finds all preallocated spaces and return blocks being freed to them
+ * if preallocated space becomes full (no block is used from the space)
+ * then the function frees space in buddy
+ * XXX: at the moment, truncate (which is the only way to free blocks)
+ * discards all preallocations
+ */
+static void ext4_mb_return_to_preallocation(struct inode *inode,
+ struct ext4_buddy *e4b,
+ sector_t block, int count)
+{
+ BUG_ON(!list_empty(&EXT4_I(inode)->i_prealloc_list));
+}
+#ifdef MB_DEBUG
+static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
+{
+ struct super_block *sb = ac->ac_sb;
+ ext4_group_t i;
+
+ printk(KERN_ERR "EXT4-fs: Can't allocate:"
+ " Allocation context details:\n");
+ printk(KERN_ERR "EXT4-fs: status %d flags %d\n",
+ ac->ac_status, ac->ac_flags);
+ printk(KERN_ERR "EXT4-fs: orig %lu/%lu/%lu@%lu, goal %lu/%lu/%lu@%lu, "
+ "best %lu/%lu/%lu@%lu cr %d\n",
+ (unsigned long)ac->ac_o_ex.fe_group,
+ (unsigned long)ac->ac_o_ex.fe_start,
+ (unsigned long)ac->ac_o_ex.fe_len,
+ (unsigned long)ac->ac_o_ex.fe_logical,
+ (unsigned long)ac->ac_g_ex.fe_group,
+ (unsigned long)ac->ac_g_ex.fe_start,
+ (unsigned long)ac->ac_g_ex.fe_len,
+ (unsigned long)ac->ac_g_ex.fe_logical,
+ (unsigned long)ac->ac_b_ex.fe_group,
+ (unsigned long)ac->ac_b_ex.fe_start,
+ (unsigned long)ac->ac_b_ex.fe_len,
+ (unsigned long)ac->ac_b_ex.fe_logical,
+ (int)ac->ac_criteria);
+ printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
+ ac->ac_found);
+ printk(KERN_ERR "EXT4-fs: groups: \n");
+ for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ struct ext4_group_info *grp = ext4_get_group_info(sb, i);
+ struct ext4_prealloc_space *pa;
+ ext4_grpblk_t start;
+ struct list_head *cur;
+ list_for_each_rcu(cur, &grp->bb_prealloc_list) {
+ pa = list_entry(cur, struct ext4_prealloc_space,
+ pa_group_list);
+ spin_lock(&pa->pa_lock);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart,
+ NULL, &start);
+ spin_unlock(&pa->pa_lock);
+ printk(KERN_ERR "PA:%lu:%d:%u \n", i,
+ start, pa->pa_len);
+ }
+
+ if (grp->bb_free == 0)
+ continue;
+ printk(KERN_ERR "%lu: %d/%d \n",
+ i, grp->bb_free, grp->bb_fragments);
+ }
+ printk(KERN_ERR "\n");
+}
+#else
+#define ext4_mb_show_ac(x)
+#endif
+
+/*
+ * We use locality group preallocation for small size file. The size of the
+ * file is determined by the current size or the resulting size after
+ * allocation which ever is larger
+ *
+ * One can tune this size via /proc/fs/ext4/<partition>/stream_req
+ */
+static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ int bsbits = ac->ac_sb->s_blocksize_bits;
+ loff_t size, isize;
+
+ if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
+ return;
+
+ size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
+ isize = i_size_read(ac->ac_inode) >> bsbits;
+ if (size < isize)
+ size = isize;
+
+ /* don't use group allocation for large files */
+ if (size >= sbi->s_mb_stream_request)
+ return;
+
+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+ return;
+
+ BUG_ON(ac->ac_lg != NULL);
+ ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
+ put_cpu();
+
+ /* we're going to use group allocation */
+ ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
+
+ /* serialize all allocations in the group */
+ down(&ac->ac_lg->lg_sem);
+}
+
+static int ext4_mb_initialize_context(struct ext4_allocation_context *ac,
+ struct ext4_allocation_request *ar)
+{
+ struct super_block *sb = ar->inode->i_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_super_block *es = sbi->s_es;
+ ext4_group_t group;
+ unsigned long len;
+ unsigned long goal;
+ ext4_grpblk_t block;
+
+ /* we can't allocate > group size */
+ len = ar->len;
+
+ /* just a dirty hack to filter too big requests */
+ if (len >= EXT4_BLOCKS_PER_GROUP(sb) - 10)
+ len = EXT4_BLOCKS_PER_GROUP(sb) - 10;
+
+ /* start searching from the goal */
+ goal = ar->goal;
+ if (goal < le32_to_cpu(es->s_first_data_block) ||
+ goal >= ext4_blocks_count(es))
+ goal = le32_to_cpu(es->s_first_data_block);
+ ext4_get_group_no_and_offset(sb, goal, &group, &block);
+
+ /* set up allocation goals */
+ ac->ac_b_ex.fe_logical = ar->logical;
+ ac->ac_b_ex.fe_group = 0;
+ ac->ac_b_ex.fe_start = 0;
+ ac->ac_b_ex.fe_len = 0;
+ ac->ac_status = AC_STATUS_CONTINUE;
+ ac->ac_groups_scanned = 0;
+ ac->ac_ex_scanned = 0;
+ ac->ac_found = 0;
+ ac->ac_sb = sb;
+ ac->ac_inode = ar->inode;
+ ac->ac_o_ex.fe_logical = ar->logical;
+ ac->ac_o_ex.fe_group = group;
+ ac->ac_o_ex.fe_start = block;
+ ac->ac_o_ex.fe_len = len;
+ ac->ac_g_ex.fe_logical = ar->logical;
+ ac->ac_g_ex.fe_group = group;
+ ac->ac_g_ex.fe_start = block;
+ ac->ac_g_ex.fe_len = len;
+ ac->ac_f_ex.fe_len = 0;
+ ac->ac_flags = ar->flags;
+ ac->ac_2order = 0;
+ ac->ac_criteria = 0;
+ ac->ac_pa = NULL;
+ ac->ac_bitmap_page = NULL;
+ ac->ac_buddy_page = NULL;
+ ac->ac_lg = NULL;
+
+ /* we have to define context: we'll we work with a file or
+ * locality group. this is a policy, actually */
+ ext4_mb_group_or_file(ac);
+
+ mb_debug("init ac: %u blocks @ %u, goal %u, flags %x, 2^%d, "
+ "left: %u/%u, right %u/%u to %swritable\n",
+ (unsigned) ar->len, (unsigned) ar->logical,
+ (unsigned) ar->goal, ac->ac_flags, ac->ac_2order,
+ (unsigned) ar->lleft, (unsigned) ar->pleft,
+ (unsigned) ar->lright, (unsigned) ar->pright,
+ atomic_read(&ar->inode->i_writecount) ? "" : "non-");
+ return 0;
+
+}
+
+/*
+ * release all resource we used in allocation
+ */
+static int ext4_mb_release_context(struct ext4_allocation_context *ac)
+{
+ if (ac->ac_pa) {
+ if (ac->ac_pa->pa_linear) {
+ /* see comment in ext4_mb_use_group_pa() */
+ spin_lock(&ac->ac_pa->pa_lock);
+ ac->ac_pa->pa_pstart += ac->ac_b_ex.fe_len;
+ ac->ac_pa->pa_lstart += ac->ac_b_ex.fe_len;
+ ac->ac_pa->pa_free -= ac->ac_b_ex.fe_len;
+ ac->ac_pa->pa_len -= ac->ac_b_ex.fe_len;
+ spin_unlock(&ac->ac_pa->pa_lock);
+ }
+ ext4_mb_put_pa(ac, ac->ac_sb, ac->ac_pa);
+ }
+ if (ac->ac_bitmap_page)
+ page_cache_release(ac->ac_bitmap_page);
+ if (ac->ac_buddy_page)
+ page_cache_release(ac->ac_buddy_page);
+ if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
+ up(&ac->ac_lg->lg_sem);
+ ext4_mb_collect_stats(ac);
+ return 0;
+}
+
+static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
+{
+ ext4_group_t i;
+ int ret;
+ int freed = 0;
+
+ for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
+ ret = ext4_mb_discard_group_preallocations(sb, i, needed);
+ freed += ret;
+ needed -= ret;
+ }
+
+ return freed;
+}
+
+/*
+ * Main entry point into mballoc to allocate blocks
+ * it tries to use preallocation first, then falls back
+ * to usual allocation
+ */
+ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
+ struct ext4_allocation_request *ar, int *errp)
+{
+ struct ext4_allocation_context ac;
+ struct ext4_sb_info *sbi;
+ struct super_block *sb;
+ ext4_fsblk_t block = 0;
+ int freed;
+ int inquota;
+
+ sb = ar->inode->i_sb;
+ sbi = EXT4_SB(sb);
+
+ if (!test_opt(sb, MBALLOC)) {
+ block = ext4_new_blocks_old(handle, ar->inode, ar->goal,
+ &(ar->len), errp);
+ return block;
+ }
+
+ while (ar->len && DQUOT_ALLOC_BLOCK(ar->inode, ar->len)) {
+ ar->flags |= EXT4_MB_HINT_NOPREALLOC;
+ ar->len--;
+ }
+ if (ar->len == 0) {
+ *errp = -EDQUOT;
+ return 0;
+ }
+ inquota = ar->len;
+
+ ext4_mb_poll_new_transaction(sb, handle);
+
+ *errp = ext4_mb_initialize_context(&ac, ar);
+ if (*errp) {
+ ar->len = 0;
+ goto out;
+ }
+
+ ac.ac_op = EXT4_MB_HISTORY_PREALLOC;
+ if (!ext4_mb_use_preallocated(&ac)) {
+
+ ac.ac_op = EXT4_MB_HISTORY_ALLOC;
+ ext4_mb_normalize_request(&ac, ar);
+
+repeat:
+ /* allocate space in core */
+ ext4_mb_regular_allocator(&ac);
+
+ /* as we've just preallocated more space than
+ * user requested orinally, we store allocated
+ * space in a special descriptor */
+ if (ac.ac_status == AC_STATUS_FOUND &&
+ ac.ac_o_ex.fe_len < ac.ac_b_ex.fe_len)
+ ext4_mb_new_preallocation(&ac);
+ }
+
+ if (likely(ac.ac_status == AC_STATUS_FOUND)) {
+ ext4_mb_mark_diskspace_used(&ac, handle);
+ *errp = 0;
+ block = ext4_grp_offs_to_block(sb, &ac.ac_b_ex);
+ ar->len = ac.ac_b_ex.fe_len;
+ } else {
+ freed = ext4_mb_discard_preallocations(sb, ac.ac_o_ex.fe_len);
+ if (freed)
+ goto repeat;
+ *errp = -ENOSPC;
+ ac.ac_b_ex.fe_len = 0;
+ ar->len = 0;
+ ext4_mb_show_ac(&ac);
+ }
+
+ ext4_mb_release_context(&ac);
+
+out:
+ if (ar->len < inquota)
+ DQUOT_FREE_BLOCK(ar->inode, inquota - ar->len);
+
+ return block;
+}
+static void ext4_mb_poll_new_transaction(struct super_block *sb,
+ handle_t *handle)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (sbi->s_last_transaction == handle->h_transaction->t_tid)
+ return;
+
+ /* new transaction! time to close last one and free blocks for
+ * committed transaction. we know that only transaction can be
+ * active, so previos transaction can be being logged and we
+ * know that transaction before previous is known to be already
+ * logged. this means that now we may free blocks freed in all
+ * transactions before previous one. hope I'm clear enough ... */
+
+ spin_lock(&sbi->s_md_lock);
+ if (sbi->s_last_transaction != handle->h_transaction->t_tid) {
+ mb_debug("new transaction %lu, old %lu\n",
+ (unsigned long) handle->h_transaction->t_tid,
+ (unsigned long) sbi->s_last_transaction);
+ list_splice_init(&sbi->s_closed_transaction,
+ &sbi->s_committed_transaction);
+ list_splice_init(&sbi->s_active_transaction,
+ &sbi->s_closed_transaction);
+ sbi->s_last_transaction = handle->h_transaction->t_tid;
+ }
+ spin_unlock(&sbi->s_md_lock);
+
+ ext4_mb_free_committed_blocks(sb);
+}
+
+static int ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
+ ext4_group_t group, ext4_grpblk_t block, int count)
+{
+ struct ext4_group_info *db = e4b->bd_info;
+ struct super_block *sb = e4b->bd_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_free_metadata *md;
+ int i;
+
+ BUG_ON(e4b->bd_bitmap_page == NULL);
+ BUG_ON(e4b->bd_buddy_page == NULL);
+
+ ext4_lock_group(sb, group);
+ for (i = 0; i < count; i++) {
+ md = db->bb_md_cur;
+ if (md && db->bb_tid != handle->h_transaction->t_tid) {
+ db->bb_md_cur = NULL;
+ md = NULL;
+ }
+
+ if (md == NULL) {
+ ext4_unlock_group(sb, group);
+ md = kmalloc(sizeof(*md), GFP_KERNEL);
+ if (md == NULL)
+ return -ENOMEM;
+ md->num = 0;
+ md->group = group;
+
+ ext4_lock_group(sb, group);
+ if (db->bb_md_cur == NULL) {
+ spin_lock(&sbi->s_md_lock);
+ list_add(&md->list, &sbi->s_active_transaction);
+ spin_unlock(&sbi->s_md_lock);
+ /* protect buddy cache from being freed,
+ * otherwise we'll refresh it from
+ * on-disk bitmap and lose not-yet-available
+ * blocks */
+ page_cache_get(e4b->bd_buddy_page);
+ page_cache_get(e4b->bd_bitmap_page);
+ db->bb_md_cur = md;
+ db->bb_tid = handle->h_transaction->t_tid;
+ mb_debug("new md 0x%p for group %lu\n",
+ md, md->group);
+ } else {
+ kfree(md);
+ md = db->bb_md_cur;
+ }
+ }
+
+ BUG_ON(md->num >= EXT4_BB_MAX_BLOCKS);
+ md->blocks[md->num] = block + i;
+ md->num++;
+ if (md->num == EXT4_BB_MAX_BLOCKS) {
+ /* no more space, put full container on a sb's list */
+ db->bb_md_cur = NULL;
+ }
+ }
+ ext4_unlock_group(sb, group);
+ return 0;
+}
+
+/*
+ * Main entry point into mballoc to free blocks
+ */
+void ext4_mb_free_blocks(handle_t *handle, struct inode *inode,
+ unsigned long block, unsigned long count,
+ int metadata, unsigned long *freed)
+{
+ struct buffer_head *bitmap_bh = NULL;
+ struct super_block *sb = inode->i_sb;
+ struct ext4_allocation_context ac;
+ struct ext4_group_desc *gdp;
+ struct ext4_super_block *es;
+ unsigned long overflow;
+ ext4_grpblk_t bit;
+ struct buffer_head *gd_bh;
+ ext4_group_t block_group;
+ struct ext4_sb_info *sbi;
+ struct ext4_buddy e4b;
+ int err = 0;
+ int ret;
+
+ *freed = 0;
+
+ ext4_mb_poll_new_transaction(sb, handle);
+
+ sbi = EXT4_SB(sb);
+ es = EXT4_SB(sb)->s_es;
+ if (block < le32_to_cpu(es->s_first_data_block) ||
+ block + count < block ||
+ block + count > ext4_blocks_count(es)) {
+ ext4_error(sb, __FUNCTION__,
+ "Freeing blocks not in datazone - "
+ "block = %lu, count = %lu", block, count);
+ goto error_return;
+ }
+
+ ext4_debug("freeing block %lu\n", block);
+
+ ac.ac_op = EXT4_MB_HISTORY_FREE;
+ ac.ac_inode = inode;
+ ac.ac_sb = sb;
+
+do_more:
+ overflow = 0;
+ ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
+
+ /*
+ * Check to see if we are freeing blocks across a group
+ * boundary.
+ */
+ if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
+ overflow = bit + count - EXT4_BLOCKS_PER_GROUP(sb);
+ count -= overflow;
+ }
+ brelse(bitmap_bh);
+ bitmap_bh = read_block_bitmap(sb, block_group);
+ if (!bitmap_bh)
+ goto error_return;
+ gdp = ext4_get_group_desc(sb, block_group, &gd_bh);
+ if (!gdp)
+ goto error_return;
+
+ if (in_range(ext4_block_bitmap(sb, gdp), block, count) ||
+ in_range(ext4_inode_bitmap(sb, gdp), block, count) ||
+ in_range(block, ext4_inode_table(sb, gdp),
+ EXT4_SB(sb)->s_itb_per_group) ||
+ in_range(block + count - 1, ext4_inode_table(sb, gdp),
+ EXT4_SB(sb)->s_itb_per_group)) {
+
+ ext4_error(sb, __FUNCTION__,
+ "Freeing blocks in system zone - "
+ "Block = %lu, count = %lu", block, count);
+ }
+
+ BUFFER_TRACE(bitmap_bh, "getting write access");
+ err = ext4_journal_get_write_access(handle, bitmap_bh);
+ if (err)
+ goto error_return;
+
+ /*
+ * We are about to modify some metadata. Call the journal APIs
+ * to unshare ->b_data if a currently-committing transaction is
+ * using it
+ */
+ BUFFER_TRACE(gd_bh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, gd_bh);
+ if (err)
+ goto error_return;
+
+ err = ext4_mb_load_buddy(sb, block_group, &e4b);
+ if (err)
+ goto error_return;
+
+#ifdef AGGRESSIVE_CHECK
+ {
+ int i;
+ for (i = 0; i < count; i++)
+ BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
+ }
+#endif
+ mb_clear_bits(sb_bgl_lock(sbi, block_group), bitmap_bh->b_data,
+ bit, count);
+
+ /* We dirtied the bitmap block */
+ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
+ err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+
+ ac.ac_b_ex.fe_group = block_group;
+ ac.ac_b_ex.fe_start = bit;
+ ac.ac_b_ex.fe_len = count;
+ ext4_mb_store_history(&ac);
+
+ if (metadata) {
+ /* blocks being freed are metadata. these blocks shouldn't
+ * be used until this transaction is committed */
+ ext4_mb_free_metadata(handle, &e4b, block_group, bit, count);
+ } else {
+ ext4_lock_group(sb, block_group);
+ err = mb_free_blocks(inode, &e4b, bit, count);
+ ext4_mb_return_to_preallocation(inode, &e4b, block, count);
+ ext4_unlock_group(sb, block_group);
+ BUG_ON(err != 0);
+ }
+
+ spin_lock(sb_bgl_lock(sbi, block_group));
+ gdp->bg_free_blocks_count =
+ cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) + count);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ spin_unlock(sb_bgl_lock(sbi, block_group));
+ percpu_counter_add(&sbi->s_freeblocks_counter, count);
+
+ ext4_mb_release_desc(&e4b);
+
+ *freed += count;
+
+ /* And the group descriptor block */
+ BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
+ ret = ext4_journal_dirty_metadata(handle, gd_bh);
+ if (!err)
+ err = ret;
+
+ if (overflow && !err) {
+ block += count;
+ count = overflow;
+ goto do_more;
+ }
+ sb->s_dirt = 1;
+error_return:
+ brelse(bitmap_bh);
+ ext4_std_error(sb, err);
+ return;
+}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 7203d3d..6b40f55 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -253,11 +253,11 @@ static int free_dind_blocks(handle_t *handle,
for (i = 0; i < max_entries; i++) {
if (tmp_idata[i]) {
ext4_free_blocks(handle, inode,
- le32_to_cpu(tmp_idata[i]), 1);
+ le32_to_cpu(tmp_idata[i]), 1, 1);
}
}
brelse(bh);
- ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+ ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);

return 0;

@@ -289,7 +289,7 @@ static int free_tind_blocks(handle_t *handle,
}
}
brelse(bh);
- ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+ ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);

return 0;

@@ -304,7 +304,7 @@ static int free_ind_block(handle_t *handle, struct inode *inode)
if (ei->i_data[EXT4_IND_BLOCK]) {

ext4_free_blocks(handle, inode,
- le32_to_cpu(ei->i_data[EXT4_IND_BLOCK]), 1);
+ le32_to_cpu(ei->i_data[EXT4_IND_BLOCK]), 1, 1);

}

@@ -402,7 +402,7 @@ static int free_ext_idx(handle_t *handle, struct inode *inode,
if (eh->eh_depth == 0) {

brelse(bh);
- ext4_free_blocks(handle, inode, block, 1);
+ ext4_free_blocks(handle, inode, block, 1, 1);

} else {

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 64fc7f1..136d095 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -503,6 +503,7 @@ static void ext4_put_super (struct super_block * sb)
struct ext4_super_block *es = sbi->s_es;
int i;

+ ext4_mb_release(sb);
ext4_ext_release(sb);
ext4_xattr_put_super(sb);
jbd2_journal_destroy(sbi->s_journal);
@@ -569,6 +570,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
memset(&ei->i_cached_extent, 0, sizeof(struct ext4_ext_cache));
+ INIT_LIST_HEAD(&ei->i_prealloc_list);
+ spin_lock_init(&ei->i_prealloc_lock);
return &ei->vfs_inode;
}

@@ -881,6 +884,7 @@ enum {
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
Opt_grpquota, Opt_extents, Opt_noextents, Opt_i_version,
+ Opt_mballoc, Opt_nomballoc, Opt_stripe,
};

static match_table_t tokens = {
@@ -935,6 +939,9 @@ static match_table_t tokens = {
{Opt_extents, "extents"},
{Opt_noextents, "noextents"},
{Opt_i_version, "i_version"},
+ {Opt_mballoc, "mballoc"},
+ {Opt_nomballoc, "nomballoc"},
+ {Opt_stripe, "stripe=%u"},
{Opt_err, NULL},
{Opt_resize, "resize"},
};
@@ -1284,6 +1291,19 @@ clear_qf_name:
set_opt(sbi->s_mount_opt, I_VERSION);
sb->s_flags |= MS_I_VERSION;
break;
+ case Opt_mballoc:
+ set_opt(sbi->s_mount_opt, MBALLOC);
+ break;
+ case Opt_nomballoc:
+ clear_opt(sbi->s_mount_opt, MBALLOC);
+ break;
+ case Opt_stripe:
+ if (match_int(&args[0], &option))
+ return 0;
+ if (option < 0)
+ return 0;
+ sbi->s_stripe = option;
+ break;
default:
printk (KERN_ERR
"EXT4-fs: Unrecognized mount option \"%s\" "
@@ -1742,6 +1762,33 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,
return (has_super + ext4_group_first_block_no(sb, bg));
}

+/**
+ * ext4_get_stripe_size: Get the stripe size.
+ * @sbi: In memory super block info
+ *
+ * If we have specified it via mount option, then
+ * use the mount option value. If the value specified at mount time is
+ * greater than the blocks per group use the super block value.
+ * If the super block value is greater than blocks per group return 0.
+ * Allocator needs it be less than blocks per group.
+ *
+ */
+static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
+{
+ unsigned long stride = le16_to_cpu(sbi->s_es->s_raid_stride);
+ unsigned long stripe_width =
+ le32_to_cpu(sbi->s_es->s_raid_stripe_width);
+
+ if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
+ return sbi->s_stripe;
+ } else if (stripe_width <= sbi->s_blocks_per_group) {
+ return stripe_width;
+ } else if (stride <= sbi->s_blocks_per_group) {
+ return stride;
+ }
+
+ return 0;
+}

static int ext4_fill_super (struct super_block *sb, void *data, int silent)
__releases(kernel_sem)
@@ -2091,6 +2138,8 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
sbi->s_rsv_window_head.rsv_goal_size = 0;
ext4_rsv_window_add(sb, &sbi->s_rsv_window_head);

+ sbi->s_stripe = ext4_get_stripe_size(sbi);
+
/*
* set up enough so that it can read an inode
*/
@@ -2250,6 +2299,7 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
"writeback");

ext4_ext_init(sb);
+ ext4_mb_init(sb, needs_recovery);

lock_kernel();
return 0;
@@ -3232,9 +3282,15 @@ static struct file_system_type ext4dev_fs_type = {

static int __init init_ext4_fs(void)
{
- int err = init_ext4_xattr();
+ int err;
+
+ err = init_ext4_mballoc();
if (err)
return err;
+
+ err = init_ext4_xattr();
+ if (err)
+ goto out2;
err = init_inodecache();
if (err)
goto out1;
@@ -3246,6 +3302,8 @@ out:
destroy_inodecache();
out1:
exit_ext4_xattr();
+out2:
+ exit_ext4_mballoc();
return err;
}

@@ -3254,6 +3312,7 @@ static void __exit exit_ext4_fs(void)
unregister_filesystem(&ext4dev_fs_type);
destroy_inodecache();
exit_ext4_xattr();
+ exit_ext4_mballoc();
}

MODULE_AUTHOR("Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others");
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 8638730..d796213 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -480,7 +480,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
ea_bdebug(bh, "refcount now=0; freeing");
if (ce)
mb_cache_entry_free(ce);
- ext4_free_blocks(handle, inode, bh->b_blocknr, 1);
+ ext4_free_blocks(handle, inode, bh->b_blocknr, 1, 1);
get_bh(bh);
ext4_forget(handle, 1, inode, bh, bh->b_blocknr);
} else {
@@ -821,7 +821,7 @@ inserted:
new_bh = sb_getblk(sb, block);
if (!new_bh) {
getblk_failed:
- ext4_free_blocks(handle, inode, block, 1);
+ ext4_free_blocks(handle, inode, block, 1, 1);
error = -EIO;
goto cleanup;
}
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index d0b7ca9..1852313 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -20,6 +20,8 @@
#include <linux/blkdev.h>
#include <linux/magic.h>

+#include <linux/ext4_fs_i.h>
+
/*
* The second extended filesystem constants/structures
*/
@@ -51,6 +53,50 @@
#define ext4_debug(f, a...) do {} while (0)
#endif

+#define EXT4_MULTIBLOCK_ALLOCATOR 1
+
+/* prefer goal again. length */
+#define EXT4_MB_HINT_MERGE 1
+/* blocks already reserved */
+#define EXT4_MB_HINT_RESERVED 2
+/* metadata is being allocated */
+#define EXT4_MB_HINT_METADATA 4
+/* first blocks in the file */
+#define EXT4_MB_HINT_FIRST 8
+/* search for the best chunk */
+#define EXT4_MB_HINT_BEST 16
+/* data is being allocated */
+#define EXT4_MB_HINT_DATA 32
+/* don't preallocate (for tails) */
+#define EXT4_MB_HINT_NOPREALLOC 64
+/* allocate for locality group */
+#define EXT4_MB_HINT_GROUP_ALLOC 128
+/* allocate goal blocks or none */
+#define EXT4_MB_HINT_GOAL_ONLY 256
+/* goal is meaningful */
+#define EXT4_MB_HINT_TRY_GOAL 512
+
+struct ext4_allocation_request {
+ /* target inode for block we're allocating */
+ struct inode *inode;
+ /* logical block in target inode */
+ ext4_lblk_t logical;
+ /* phys. target (a hint) */
+ ext4_fsblk_t goal;
+ /* the closest logical allocated block to the left */
+ ext4_lblk_t lleft;
+ /* phys. block for ^^^ */
+ ext4_fsblk_t pleft;
+ /* the closest logical allocated block to the right */
+ ext4_lblk_t lright;
+ /* phys. block for ^^^ */
+ ext4_fsblk_t pright;
+ /* how many blocks we want to allocate */
+ unsigned long len;
+ /* flags. see above EXT4_MB_HINT_* */
+ unsigned long flags;
+};
+
/*
* Special inodes numbers
*/
@@ -474,6 +520,7 @@ do { \
#define EXT4_MOUNT_JOURNAL_CHECKSUM 0x800000 /* Journal checksums */
#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
#define EXT4_MOUNT_I_VERSION 0x2000000 /* i_version support */
+#define EXT4_MOUNT_MBALLOC 0x4000000 /* Buddy allocation support */
/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
#define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt
@@ -912,7 +959,7 @@ extern ext4_fsblk_t ext4_new_blocks (handle_t *handle, struct inode *inode,
extern ext4_fsblk_t ext4_new_blocks_old(handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, unsigned long *count, int *errp);
extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
- ext4_fsblk_t block, unsigned long count);
+ ext4_fsblk_t block, unsigned long count, int metadata);
extern void ext4_free_blocks_sb (handle_t *handle, struct super_block *sb,
ext4_fsblk_t block, unsigned long count,
unsigned long *pdquot_freed_blocks);
@@ -950,6 +997,20 @@ extern unsigned long ext4_count_dirs (struct super_block *);
extern void ext4_check_inodes_bitmap (struct super_block *);
extern unsigned long ext4_count_free (struct buffer_head *, unsigned);

+/* mballoc.c */
+extern long ext4_mb_stats;
+extern long ext4_mb_max_to_scan;
+extern int ext4_mb_init(struct super_block *, int);
+extern int ext4_mb_release(struct super_block *);
+extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
+ struct ext4_allocation_request *, int *);
+extern int ext4_mb_reserve_blocks(struct super_block *, int);
+extern void ext4_mb_discard_inode_preallocations(struct inode *);
+extern int __init init_ext4_mballoc(void);
+extern void exit_ext4_mballoc(void);
+extern void ext4_mb_free_blocks(handle_t *, struct inode *,
+ unsigned long, unsigned long, int, unsigned long *);
+

/* inode.c */
int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
@@ -1080,6 +1141,19 @@ static inline void ext4_isize_set(struct ext4_inode *raw_inode, loff_t i_size)
raw_inode->i_size_high = cpu_to_le32(i_size >> 32);
}

+static inline
+struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
+ ext4_group_t group)
+{
+ struct ext4_group_info ***grp_info;
+ long indexv, indexh;
+ grp_info = EXT4_SB(sb)->s_group_info;
+ indexv = group >> (EXT4_DESC_PER_BLOCK_BITS(sb));
+ indexh = group & ((EXT4_DESC_PER_BLOCK(sb)) - 1);
+ return grp_info[indexv][indexh];
+}
+
+
#define ext4_std_error(sb, errno) \
do { \
if ((errno)) \
diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
index 4377d24..d5508d3 100644
--- a/include/linux/ext4_fs_i.h
+++ b/include/linux/ext4_fs_i.h
@@ -158,6 +158,10 @@ struct ext4_inode_info {
* struct timespec i_{a,c,m}time in the generic inode.
*/
struct timespec i_crtime;
+
+ /* mballoc */
+ struct list_head i_prealloc_list;
+ spinlock_t i_prealloc_lock;
};

#endif /* _LINUX_EXT4_FS_I */
diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h
index 38a47ec..abaae2c 100644
--- a/include/linux/ext4_fs_sb.h
+++ b/include/linux/ext4_fs_sb.h
@@ -91,6 +91,58 @@ struct ext4_sb_info {
unsigned long s_ext_blocks;
unsigned long s_ext_extents;
#endif
+
+ /* for buddy allocator */
+ struct ext4_group_info ***s_group_info;
+ struct inode *s_buddy_cache;
+ long s_blocks_reserved;
+ spinlock_t s_reserve_lock;
+ struct list_head s_active_transaction;
+ struct list_head s_closed_transaction;
+ struct list_head s_committed_transaction;
+ spinlock_t s_md_lock;
+ tid_t s_last_transaction;
+ unsigned short *s_mb_offsets, *s_mb_maxs;
+
+ /* tunables */
+ unsigned long s_stripe;
+ unsigned long s_mb_stream_request;
+ unsigned long s_mb_max_to_scan;
+ unsigned long s_mb_min_to_scan;
+ unsigned long s_mb_stats;
+ unsigned long s_mb_order2_reqs;
+ unsigned long s_mb_group_prealloc;
+ /* where last allocation was done - for stream allocation */
+ unsigned long s_mb_last_group;
+ unsigned long s_mb_last_start;
+
+ /* history to debug policy */
+ struct ext4_mb_history *s_mb_history;
+ int s_mb_history_cur;
+ int s_mb_history_max;
+ int s_mb_history_num;
+ struct proc_dir_entry *s_mb_proc;
+ spinlock_t s_mb_history_lock;
+ int s_mb_history_filter;
+
+ /* stats for buddy allocator */
+ spinlock_t s_mb_pa_lock;
+ atomic_t s_bal_reqs; /* number of reqs with len > 1 */
+ atomic_t s_bal_success; /* we found long enough chunks */
+ atomic_t s_bal_allocated; /* in blocks */
+ atomic_t s_bal_ex_scanned; /* total extents scanned */
+ atomic_t s_bal_goals; /* goal hits */
+ atomic_t s_bal_breaks; /* too long searches */
+ atomic_t s_bal_2orders; /* 2^order hits */
+ spinlock_t s_bal_lock;
+ unsigned long s_mb_buddies_generated;
+ unsigned long long s_mb_generation_time;
+ atomic_t s_mb_lost_chunks;
+ atomic_t s_mb_preallocated;
+ atomic_t s_mb_discarded;
+
+ /* locality groups */
+ struct ext4_locality_group *s_locality_groups;
};

#endif /* _LINUX_EXT4_FS_SB */
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:05:52

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 47/49] jbd2: Mark jbd2 slabs as SLAB_TEMPORARY

From: Mingming Cao <[email protected]>

This patch marks slab allocations by jbd2 as short-lived in support of
Mel Gorman's "Group short-lived and reclaimable kernel allocations"
patch. (Ported from similar changes made to fs/jbd/journal.c and
fs/jbd/revoke.c in Mel's patch.)

Cc: Mel Gorman <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/journal.c | 4 ++--
fs/jbd2/revoke.c | 6 ++++--
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index f8b0f8c..8301e8d 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1975,7 +1975,7 @@ static int journal_init_jbd2_journal_head_cache(void)
jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
sizeof(struct journal_head),
0, /* offset */
- 0, /* flags */
+ SLAB_TEMPORARY, /* flags */
NULL); /* ctor */
retval = 0;
if (jbd2_journal_head_cache == 0) {
@@ -2271,7 +2271,7 @@ static int __init journal_init_handle_cache(void)
jbd2_handle_cache = kmem_cache_create("jbd2_journal_handle",
sizeof(handle_t),
0, /* offset */
- 0, /* flags */
+ SLAB_TEMPORARY, /* flags */
NULL); /* ctor */
if (jbd2_handle_cache == NULL) {
printk(KERN_EMERG "JBD: failed to create handle cache\n");
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 3595fd4..df36f42 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -171,13 +171,15 @@ int __init jbd2_journal_init_revoke_caches(void)
{
jbd2_revoke_record_cache = kmem_cache_create("jbd2_revoke_record",
sizeof(struct jbd2_revoke_record_s),
- 0, SLAB_HWCACHE_ALIGN, NULL);
+ 0,
+ SLAB_HWCACHE_ALIGN|SLAB_TEMPORARY,
+ NULL);
if (jbd2_revoke_record_cache == 0)
return -ENOMEM;

jbd2_revoke_table_cache = kmem_cache_create("jbd2_revoke_table",
sizeof(struct jbd2_revoke_table_s),
- 0, 0, NULL);
+ 0, SLAB_TEMPORARY, NULL);
if (jbd2_revoke_table_cache == 0) {
kmem_cache_destroy(jbd2_revoke_record_cache);
jbd2_revoke_record_cache = NULL;
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:06:16

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 25/49] jbd2: Remove printk from J_ASSERT to preserve registers during BUG

From: Chris Snook <[email protected]>

Signed-off-by: Chris Snook <[email protected]>
Cc: "Stephen C. Tweedie" <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
include/linux/jbd2.h | 16 +---------------
1 files changed, 1 insertions(+), 15 deletions(-)

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 06ef114..d5f7cff 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -256,17 +256,7 @@ typedef struct journal_superblock_s
#include <linux/fs.h>
#include <linux/sched.h>

-#define JBD2_ASSERTIONS
-#ifdef JBD2_ASSERTIONS
-#define J_ASSERT(assert) \
-do { \
- if (!(assert)) { \
- printk (KERN_EMERG \
- "Assertion failure in %s() at %s:%d: \"%s\"\n", \
- __FUNCTION__, __FILE__, __LINE__, # assert); \
- BUG(); \
- } \
-} while (0)
+#define J_ASSERT(assert) BUG_ON(!(assert))

#if defined(CONFIG_BUFFER_DEBUG)
void buffer_assertion_failure(struct buffer_head *bh);
@@ -282,10 +272,6 @@ void buffer_assertion_failure(struct buffer_head *bh);
#define J_ASSERT_JH(jh, expr) J_ASSERT(expr)
#endif

-#else
-#define J_ASSERT(assert) do { } while (0)
-#endif /* JBD2_ASSERTIONS */
-
#if defined(JBD2_PARANOID_IOFAIL)
#define J_EXPECT(expr, why...) J_ASSERT(expr)
#define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr)
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:06:39

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 04/49] ext4 extents: remove unneeded casts

From: Eric Sandeen <[email protected]>

There are many casts in extents.c which are not needed,
as the variables are already the type of the cast, or
are being promoted for no particular reason in printk's.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/extents.c | 49 ++++++++++++++++++++++---------------------------
1 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 19d8059..6853722 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -374,7 +374,7 @@ ext4_ext_binsearch_idx(struct inode *inode,
struct ext4_extent_idx *r, *l, *m;


- ext_debug("binsearch for %lu(idx): ", (unsigned long)block);
+ ext_debug("binsearch for %u(idx): ", block);

l = EXT_FIRST_INDEX(eh) + 1;
r = EXT_LAST_INDEX(eh);
@@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode,
return;
}

- ext_debug("binsearch for %lu: ", (unsigned long)block);
+ ext_debug("binsearch for %u: ", block);

l = EXT_FIRST_EXTENT(eh) + 1;
r = EXT_LAST_EXTENT(eh);
@@ -766,7 +766,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
while (k--) {
oldblock = newblock;
newblock = ablocks[--a];
- bh = sb_getblk(inode->i_sb, (ext4_fsblk_t)newblock);
+ bh = sb_getblk(inode->i_sb, newblock);
if (!bh) {
err = -EIO;
goto cleanup;
@@ -786,9 +786,8 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
fidx->ei_block = border;
ext4_idx_store_pblock(fidx, oldblock);

- ext_debug("int.index at %d (block %llu): %lu -> %llu\n", i,
- newblock, (unsigned long) le32_to_cpu(border),
- oldblock);
+ ext_debug("int.index at %d (block %llu): %u -> %llu\n",
+ i, newblock, le32_to_cpu(border), oldblock);
/* copy indexes */
m = 0;
path[i].p_idx++;
@@ -1476,10 +1475,10 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
} else if (block < le32_to_cpu(ex->ee_block)) {
lblock = block;
len = le32_to_cpu(ex->ee_block) - block;
- ext_debug("cache gap(before): %lu [%lu:%lu]",
- (unsigned long) block,
- (unsigned long) le32_to_cpu(ex->ee_block),
- (unsigned long) ext4_ext_get_actual_len(ex));
+ ext_debug("cache gap(before): %u [%u:%u]",
+ block,
+ le32_to_cpu(ex->ee_block),
+ ext4_ext_get_actual_len(ex));
} else if (block >= le32_to_cpu(ex->ee_block)
+ ext4_ext_get_actual_len(ex)) {
ext4_lblk_t next;
@@ -1487,10 +1486,10 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
+ ext4_ext_get_actual_len(ex);

next = ext4_ext_next_allocated_block(path);
- ext_debug("cache gap(after): [%lu:%lu] %lu",
- (unsigned long) le32_to_cpu(ex->ee_block),
- (unsigned long) ext4_ext_get_actual_len(ex),
- (unsigned long) block);
+ ext_debug("cache gap(after): [%u:%u] %u",
+ le32_to_cpu(ex->ee_block),
+ ext4_ext_get_actual_len(ex),
+ block);
BUG_ON(next == lblock);
len = next - lblock;
} else {
@@ -1498,7 +1497,7 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
BUG();
}

- ext_debug(" -> %lu:%lu\n", (unsigned long) lblock, len);
+ ext_debug(" -> %u:%lu\n", lblock, len);
ext4_ext_put_in_cache(inode, lblock, len, 0, EXT4_EXT_CACHE_GAP);
}

@@ -1520,11 +1519,9 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
ex->ee_block = cpu_to_le32(cex->ec_block);
ext4_ext_store_pblock(ex, cex->ec_start);
ex->ee_len = cpu_to_le16(cex->ec_len);
- ext_debug("%lu cached by %lu:%lu:%llu\n",
- (unsigned long) block,
- (unsigned long) cex->ec_block,
- (unsigned long) cex->ec_len,
- cex->ec_start);
+ ext_debug("%u cached by %u:%u:%llu\n",
+ block,
+ cex->ec_block, cex->ec_len, cex->ec_start);
return cex->ec_type;
}

@@ -2145,9 +2142,8 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
unsigned long allocated = 0;

__clear_bit(BH_New, &bh_result->b_state);
- ext_debug("blocks %lu/%lu requested for inode %u\n",
- (unsigned long) iblock, max_blocks,
- (unsigned) inode->i_ino);
+ ext_debug("blocks %u/%lu requested for inode %u\n",
+ iblock, max_blocks, inode->i_ino);
mutex_lock(&EXT4_I(inode)->truncate_mutex);

/* check in cache */
@@ -2210,7 +2206,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
newblock = iblock - ee_block + ee_start;
/* number of remaining blocks in the extent */
allocated = ee_len - (iblock - ee_block);
- ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
+ ext_debug("%u fit into %lu:%d -> %llu\n", iblock,
ee_block, ee_len, newblock);

/* Do not put uninitialized extent in the cache */
@@ -2470,9 +2466,8 @@ retry:
if (!ret) {
ext4_error(inode->i_sb, "ext4_fallocate",
"ext4_ext_get_blocks returned 0! inode#%lu"
- ", block=%lu, max_blocks=%lu",
- inode->i_ino, (unsigned long)block,
- (unsigned long)max_blocks);
+ ", block=%u, max_blocks=%lu",
+ inode->i_ino, block, max_blocks);
ret = -EIO;
ext4_mark_inode_dirty(handle, inode);
ret2 = ext4_journal_stop(handle);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:07:30

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 09/49] ext4: Rename i_file_acl to i_file_acl_lo

From: Aneesh Kumar K.V <[email protected]>

Rename i_file_acl to i_file_acl_lo. This helps
in finding bugs where we use i_file_acl instead
of the combined i_file_acl_lo and i_file_acl_high

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 4 ++--
include/linux/ext4_fs.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 76ceba2..7bcec18 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2718,7 +2718,7 @@ void ext4_read_inode(struct inode * inode)
}
inode->i_blocks = le32_to_cpu(raw_inode->i_blocks);
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
- ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl);
+ ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
cpu_to_le32(EXT4_OS_HURD))
ei->i_file_acl |=
@@ -2866,7 +2866,7 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le32(EXT4_OS_HURD))
raw_inode->i_file_acl_high =
cpu_to_le16(ei->i_file_acl >> 32);
- raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
+ raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl);
if (!S_ISREG(inode->i_mode)) {
raw_inode->i_dir_acl = cpu_to_le32(ei->i_dir_acl);
} else {
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 1a27433..6894f36 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -297,7 +297,7 @@ struct ext4_inode {
} osd1; /* OS dependent 1 */
__le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
- __le32 i_file_acl; /* File ACL */
+ __le32 i_file_acl_lo; /* File ACL */
__le32 i_dir_acl; /* Directory ACL */
__le32 i_obso_faddr; /* Obsoleted fragment address */
union {
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:07:56

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 28/49] ext4: remove unused code from ext4_find_entry()

From: Mariusz Kozlowski <[email protected]>

The unused code found in ext3_find_entry() is also present (and still
unused) in the ext4_find_entry() code. This patch removes it.

Signed-off-by: Mariusz Kozlowski <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/namei.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index fb673b1..67b6d8a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -861,14 +861,10 @@ static struct buffer_head * ext4_find_entry (struct dentry *dentry,
int i, err;
struct inode *dir = dentry->d_parent->d_inode;
int namelen;
- const u8 *name;
- unsigned blocksize;

*res_dir = NULL;
sb = dir->i_sb;
- blocksize = sb->s_blocksize;
namelen = dentry->d_name.len;
- name = dentry->d_name.name;
if (namelen > EXT4_NAME_LEN)
return NULL;
if (is_dx(dir)) {
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:08:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 45/49] ext4: Use the ext4_ext_actual_len() helper function

From: Aneesh Kumar K.V <[email protected]>

ext4 uses the high bit of the extent length to encode whether the extent
is intialized or not. The helper function ext4_ext_get_actual_len should
be used to get the actual length of the extent.

This addresses the kernel bug documented here:
http://bugzilla.kernel.org/show_bug.cgi?id=9732

kernel BUG at fs/ext4/extents.c:1056!
....
Call Trace:
[<ffffffff88366073>] :ext4dev:ext4_ext_get_blocks+0x5ba/0x8c1
[<ffffffff81053c91>] lock_release_holdtime+0x27/0x49
[<ffffffff812748f6>] _spin_unlock+0x17/0x20
[<ffffffff883400a6>] :jbd2:start_this_handle+0x4e0/0x4fe
[<ffffffff88366564>] :ext4dev:ext4_fallocate+0x175/0x39a
[<ffffffff81053c91>] lock_release_holdtime+0x27/0x49
[<ffffffff81056480>] __lock_acquire+0x4e7/0xc4d
[<ffffffff81053c91>] lock_release_holdtime+0x27/0x49
[<ffffffff810a8de7>] sys_fallocate+0xe4/0x10d
[<ffffffff8100c043>] tracesys+0xd5/0xda

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 24 +++++++++++++-----------
1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 13e3e8c..b6b9ec7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path,
{
struct ext4_extent_idx *ix;
struct ext4_extent *ex;
- int depth;
+ int depth, ee_len;

BUG_ON(path == NULL);
depth = path->p_depth;
@@ -1043,6 +1043,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path,
* first one in the file */

ex = path[depth].p_ext;
+ ee_len = ext4_ext_get_actual_len(ex);
if (*logical < le32_to_cpu(ex->ee_block)) {
BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
while (--depth >= 0) {
@@ -1052,10 +1053,10 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path,
return 0;
}

- BUG_ON(*logical < le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len));
+ BUG_ON(*logical < (le32_to_cpu(ex->ee_block) + ee_len));

- *logical = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1;
- *phys = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - 1;
+ *logical = le32_to_cpu(ex->ee_block) + ee_len - 1;
+ *phys = ext_pblock(ex) + ee_len - 1;
return 0;
}

@@ -1075,7 +1076,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path,
struct ext4_extent_idx *ix;
struct ext4_extent *ex;
ext4_fsblk_t block;
- int depth;
+ int depth, ee_len;

BUG_ON(path == NULL);
depth = path->p_depth;
@@ -1089,6 +1090,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path,
* first one in the file */

ex = path[depth].p_ext;
+ ee_len = ext4_ext_get_actual_len(ex);
if (*logical < le32_to_cpu(ex->ee_block)) {
BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
while (--depth >= 0) {
@@ -1100,7 +1102,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path,
return 0;
}

- BUG_ON(*logical < le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len));
+ BUG_ON(*logical < (le32_to_cpu(ex->ee_block) + ee_len));

if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
/* next allocated block in this leaf */
@@ -1316,7 +1318,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
if (ext1_ee_len + ext2_ee_len > max_len)
return 0;
#ifdef AGGRESSIVE_TEST
- if (le16_to_cpu(ex1->ee_len) >= 4)
+ if (ext1_ee_len >= 4)
return 0;
#endif

@@ -2313,7 +2315,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
- le32_to_cpu(newex.ee_block)
+ ext_pblock(&newex);
/* number of remaining blocks in the extent */
- allocated = le16_to_cpu(newex.ee_len) -
+ allocated = ext4_ext_get_actual_len(&newex) -
(iblock - le32_to_cpu(newex.ee_block));
goto out;
} else {
@@ -2429,7 +2431,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
newex.ee_len = cpu_to_le16(max_blocks);
err = ext4_ext_check_overlap(inode, &newex, path);
if (err)
- allocated = le16_to_cpu(newex.ee_len);
+ allocated = ext4_ext_get_actual_len(&newex);
else
allocated = max_blocks;

@@ -2461,7 +2463,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
* but otherwise we'd need to call it every free() */
ext4_mb_discard_inode_preallocations(inode);
ext4_free_blocks(handle, inode, ext_pblock(&newex),
- le16_to_cpu(newex.ee_len), 0);
+ ext4_ext_get_actual_len(&newex), 0);
goto out2;
}

@@ -2470,7 +2472,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,

/* previous routine could use block we allocated */
newblock = ext_pblock(&newex);
- allocated = le16_to_cpu(newex.ee_len);
+ allocated = ext4_ext_get_actual_len(&newex);
outnew:
__set_bit(BH_New, &bh_result->b_state);

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:08:41

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 42/49] ext4: Enable the multiblock allocator by default

From: Aneesh Kumar K.V <[email protected]>

Enable the multiblock allocator by default.

Fix ext4_show_options() so if it is not enabled, the nomballoc option
included in /proc/mounts.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Eric Sandeen <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 136d095..91a11ec 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -736,6 +736,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",nobh");
if (!test_opt(sb, EXTENTS))
seq_puts(seq, ",noextents");
+ if (!test_opt(sb, MBALLOC))
+ seq_puts(seq, ",nomballoc");
if (test_opt(sb, I_VERSION))
seq_puts(seq, ",i_version");

@@ -1902,6 +1904,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
* User -o noextents to turn it off
*/
set_opt(sbi->s_mount_opt, EXTENTS);
+ /*
+ * turn on mballoc feature by default in ext4 filesystem
+ * User -o nomballoc to turn it off
+ */
+ set_opt(sbi->s_mount_opt, MBALLOC);

if (!parse_options ((char *) data, sb, &journal_inum, &journal_devnum,
NULL, 0))
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:08:58

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 27/49] ext4: Check for the correct error return from

From: Aneesh Kumar K.V <[email protected]>

ext4_ext_get_blocks returns negative values on error. We should
check for <= 0

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/extents.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 754c0d3..8593e59 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2462,12 +2462,12 @@ retry:
ret = ext4_ext_get_blocks(handle, inode, block,
max_blocks, &map_bh,
EXT4_CREATE_UNINITIALIZED_EXT, 0);
- WARN_ON(!ret);
- if (!ret) {
+ WARN_ON(ret <= 0);
+ if (ret <= 0) {
ext4_error(inode->i_sb, "ext4_fallocate",
- "ext4_ext_get_blocks returned 0! inode#%lu"
- ", block=%u, max_blocks=%lu",
- inode->i_ino, block, max_blocks);
+ "ext4_ext_get_blocks returned error: "
+ "inode#%lu, block=%u, max_blocks=%lu",
+ inode->i_ino, block, max_blocks);
ret = -EIO;
ext4_mark_inode_dirty(handle, inode);
ret2 = ext4_journal_stop(handle);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:09:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 19/49] ext4: Return after ext4_error in case of failures

From: Aneesh Kumar K.V <[email protected]>

This fix some instances where we were continuing after calling
ext4_error. ext4_error call panic only if errors=panic mount option is
set. So we need to make sure we return correctly after ext4_error call

Reported by: Adrian Bunk <[email protected]>

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/balloc.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 9568a57..ff3428e 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -587,11 +587,13 @@ do_more:
in_range(ext4_inode_bitmap(sb, desc), block, count) ||
in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) ||
in_range(block + count - 1, ext4_inode_table(sb, desc),
- sbi->s_itb_per_group))
+ sbi->s_itb_per_group)) {
ext4_error (sb, "ext4_free_blocks",
"Freeing blocks in system zones - "
"Block = %llu, count = %lu",
block, count);
+ goto error_return;
+ }

/*
* We are about to start releasing blocks in the bitmap,
@@ -1690,11 +1692,13 @@ allocated:
in_range(ret_block, ext4_inode_table(sb, gdp),
EXT4_SB(sb)->s_itb_per_group) ||
in_range(ret_block + num - 1, ext4_inode_table(sb, gdp),
- EXT4_SB(sb)->s_itb_per_group))
+ EXT4_SB(sb)->s_itb_per_group)) {
ext4_error(sb, "ext4_new_block",
"Allocating block in system zone - "
"blocks from %llu, length %lu",
ret_block, num);
+ goto out;
+ }

performed_allocation = 1;

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:09:47

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 16/49] ext2: Fix the max file size for ext2 file system.

From: Aneesh Kumar K.V <[email protected]>

The max file size for ext2 file system is now calculated
with hardcoded 4K block size. The patch fixes it to be
calculated with the right block size.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext2/super.c | 32 ++++++++++++++++++++++++++++----
1 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 154e25f..6abaf75 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -680,11 +680,31 @@ static int ext2_check_descriptors (struct super_block * sb)
static loff_t ext2_max_size(int bits)
{
loff_t res = EXT2_NDIR_BLOCKS;
- /* This constant is calculated to be the largest file size for a
- * dense, 4k-blocksize file such that the total number of
+ int meta_blocks;
+ loff_t upper_limit;
+
+ /* This is calculated to be the largest file size for a
+ * dense, file such that the total number of
* sectors in the file, including data and all indirect blocks,
- * does not exceed 2^32. */
- const loff_t upper_limit = 0x1ff7fffd000LL;
+ * does not exceed 2^32 -1
+ * __u32 i_blocks representing the total number of
+ * 512 bytes blocks of the file
+ */
+ upper_limit = (1LL << 32) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (bits - 9);
+
+
+ /* indirect blocks */
+ meta_blocks = 1;
+ /* double indirect blocks */
+ meta_blocks += 1 + (1LL << (bits-2));
+ /* tripple indirect blocks */
+ meta_blocks += 1 + (1LL << (bits-2)) + (1LL << (2*(bits-2)));
+
+ upper_limit -= meta_blocks;
+ upper_limit <<= bits;

res += 1LL << (bits-2);
res += 1LL << (2*(bits-2));
@@ -692,6 +712,10 @@ static loff_t ext2_max_size(int bits)
res <<= bits;
if (res > upper_limit)
res = upper_limit;
+
+ if (res > MAX_LFS_FILESIZE)
+ res = MAX_LFS_FILESIZE;
+
return res;
}

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:10:10

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 17/49] ext3: Fix the max file size for ext3 file system.

From: Aneesh Kumar K.V <[email protected]>

The max file size for ext3 file system is now calculated
with hardcoded 4K block size. The patch fixes it to be
calculated with the right block size.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext3/super.c | 32 ++++++++++++++++++++++++++++----
1 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index cb14de1..f3675cc 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1436,11 +1436,31 @@ static void ext3_orphan_cleanup (struct super_block * sb,
static loff_t ext3_max_size(int bits)
{
loff_t res = EXT3_NDIR_BLOCKS;
- /* This constant is calculated to be the largest file size for a
- * dense, 4k-blocksize file such that the total number of
+ int meta_blocks;
+ loff_t upper_limit;
+
+ /* This is calculated to be the largest file size for a
+ * dense, file such that the total number of
* sectors in the file, including data and all indirect blocks,
- * does not exceed 2^32. */
- const loff_t upper_limit = 0x1ff7fffd000LL;
+ * does not exceed 2^32 -1
+ * __u32 i_blocks representing the total number of
+ * 512 bytes blocks of the file
+ */
+ upper_limit = (1LL << 32) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (bits - 9);
+
+
+ /* indirect blocks */
+ meta_blocks = 1;
+ /* double indirect blocks */
+ meta_blocks += 1 + (1LL << (bits-2));
+ /* tripple indirect blocks */
+ meta_blocks += 1 + (1LL << (bits-2)) + (1LL << (2*(bits-2)));
+
+ upper_limit -= meta_blocks;
+ upper_limit <<= bits;

res += 1LL << (bits-2);
res += 1LL << (2*(bits-2));
@@ -1448,6 +1468,10 @@ static loff_t ext3_max_size(int bits)
res <<= bits;
if (res > upper_limit)
res = upper_limit;
+
+ if (res > MAX_LFS_FILESIZE)
+ res = MAX_LFS_FILESIZE;
+
return res;
}

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:10:40

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 23/49] Add buffer head related helper functions

From: Aneesh Kumar K.V <[email protected]>

Add buffer head related helper function bh_uptodate_or_lock and
bh_submit_read which can be used by file system

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/buffer.c | 41 +++++++++++++++++++++++++++++++++++++++++
include/linux/buffer_head.h | 2 ++
2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..7593ff3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3213,6 +3213,47 @@ static int buffer_cpu_notify(struct notifier_block *self,
return NOTIFY_OK;
}

+/**
+ * bh_uptodate_or_lock: Test whether the buffer is uptodate
+ * @bh: struct buffer_head
+ *
+ * Return true if the buffer is up-to-date and false,
+ * with the buffer locked, if not.
+ */
+int bh_uptodate_or_lock(struct buffer_head *bh)
+{
+ if (!buffer_uptodate(bh)) {
+ lock_buffer(bh);
+ if (!buffer_uptodate(bh))
+ return 0;
+ unlock_buffer(bh);
+ }
+ return 1;
+}
+EXPORT_SYMBOL(bh_uptodate_or_lock);
+/**
+ * bh_submit_read: Submit a locked buffer for reading
+ * @bh: struct buffer_head
+ *
+ * Returns a negative error
+ */
+int bh_submit_read(struct buffer_head *bh)
+{
+ if (!buffer_locked(bh))
+ lock_buffer(bh);
+
+ if (buffer_uptodate(bh))
+ return 0;
+
+ get_bh(bh);
+ bh->b_end_io = end_buffer_read_sync;
+ submit_bh(READ, bh);
+ wait_on_buffer(bh);
+ if (buffer_uptodate(bh))
+ return 0;
+ return -EIO;
+}
+EXPORT_SYMBOL(bh_submit_read);
void __init buffer_init(void)
{
int nrpages;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index da0d83f..e98801f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh);
int submit_bh(int, struct buffer_head *);
void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
+int bh_uptodate_or_lock(struct buffer_head *bh);
+int bh_submit_read(struct buffer_head *bh);

extern int buffer_heads_over_limit;

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:10:55

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 31/49] ext4: Take read lock during overwrite case.

From: Aneesh Kumar K.V <[email protected]>

When we are overwriting a file and not actually allocating new file system
blocks we need to take only the read lock on i_data_sem.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 32 ++++++++++++++++++++++++--------
1 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 596b3ab..ee0bc3a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -901,11 +901,31 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
int create, int extend_disksize)
{
int retval;
- if (create) {
- down_write((&EXT4_I(inode)->i_data_sem));
+ /*
+ * Try to see if we can get the block without requesting
+ * for new file system block.
+ */
+ down_read((&EXT4_I(inode)->i_data_sem));
+ if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
+ retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
+ bh, 0, 0);
} else {
- down_read((&EXT4_I(inode)->i_data_sem));
+ retval = ext4_get_blocks_handle(handle,
+ inode, block, max_blocks, bh, 0, 0);
}
+ up_read((&EXT4_I(inode)->i_data_sem));
+ if (!create || (retval > 0))
+ return retval;
+
+ /*
+ * We need to allocate new blocks which will result
+ * in i_data update
+ */
+ down_write((&EXT4_I(inode)->i_data_sem));
+ /*
+ * We need to check for EXT4 here because migrate
+ * could have changed the inode type in between
+ */
if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
bh, create, extend_disksize);
@@ -913,11 +933,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
retval = ext4_get_blocks_handle(handle, inode, block,
max_blocks, bh, create, extend_disksize);
}
- if (create) {
- up_write((&EXT4_I(inode)->i_data_sem));
- } else {
- up_read((&EXT4_I(inode)->i_data_sem));
- }
+ up_write((&EXT4_I(inode)->i_data_sem));
return retval;
}
static int ext4_get_block(struct inode *inode, sector_t iblock,
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:11:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 14/49] ext4: export iov_shorten from kernel for ext4's use

From: Eric Sandeen <[email protected]>

Export iov_shorten() from kernel so that ext4 can
truncate too-large writes to bitmapped files.

Signed-off-by: Eric Sandeen <[email protected]>
---
fs/read_write.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index ea1f94c..dfaee3f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -450,6 +450,7 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
}
return seg;
}
+EXPORT_SYMBOL(iov_shorten);

ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov,
unsigned long nr_segs, size_t len, loff_t *ppos, iov_fn_t fn)
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:11:40

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 21/49] ext4: fix oops on corrupted ext4 mount

From: Eric Sandeen <[email protected]>

When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.

Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:

blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);

This is then assigned to s_groups_count which is an unsigned long:

sbi->s_groups_count = blocks_count;

This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:

db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);

and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:

sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);

and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.

The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.

Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Mingming Cao <[email protected]>
---
fs/ext4/super.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1484a08..32e3ecb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1997,6 +1997,17 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)

if (EXT4_BLOCKS_PER_GROUP(sb) == 0)
goto cantfind_ext4;
+
+ /* ensure blocks_count calculation below doesn't sign-extend */
+ if (ext4_blocks_count(es) + EXT4_BLOCKS_PER_GROUP(sb) <
+ le32_to_cpu(es->s_first_data_block) + 1) {
+ printk(KERN_WARNING "EXT4-fs: bad geometry: block count %llu, "
+ "first data block %u, blocks per group %lu\n",
+ ext4_blocks_count(es),
+ le32_to_cpu(es->s_first_data_block),
+ EXT4_BLOCKS_PER_GROUP(sb));
+ goto failed_mount;
+ }
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:11:58

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 26/49] jbd2: Fix assertion failure in fs/jbd2/checkpoint.c

From: Jan Kara <[email protected]>

Before we start committing a transaction, we call
__journal_clean_checkpoint_list() to cleanup transaction's written-back
buffers.

If this call happens to remove all of them (and there were already some
buffers), __journal_remove_checkpoint() will decide to free the transaction
because it isn't (yet) a committing transaction and soon we fail some
assertion - the transaction really isn't ready to be freed :).

We change the check in __journal_remove_checkpoint() to free only a
transaction in T_FINISHED state. The locking there is subtle though (as
everywhere in JBD ;(). We use j_list_lock to protect the check and a
subsequent call to __journal_drop_transaction() and do the same in the end
of journal_commit_transaction() which is the only place where a transaction
can get to T_FINISHED state.

Probably I'm too paranoid here and such locking is not really necessary -
checkpoint lists are processed only from log_do_checkpoint() where a
transaction must be already committed to be processed or from
__journal_clean_checkpoint_list() where kjournald itself calls it and thus
transaction cannot change state either. Better be safe if something
changes in future...

Signed-off-by: Jan Kara <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/jbd2/checkpoint.c | 12 ++++++------
fs/jbd2/commit.c | 8 ++++----
include/linux/jbd2.h | 2 ++
3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 3fccde7..7e958c8 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -602,15 +602,15 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)

/*
* There is one special case to worry about: if we have just pulled the
- * buffer off a committing transaction's forget list, then even if the
- * checkpoint list is empty, the transaction obviously cannot be
- * dropped!
+ * buffer off a running or committing transaction's checkpoing list,
+ * then even if the checkpoint list is empty, the transaction obviously
+ * cannot be dropped!
*
- * The locking here around j_committing_transaction is a bit sleazy.
+ * The locking here around t_state is a bit sleazy.
* See the comment at the end of jbd2_journal_commit_transaction().
*/
- if (transaction == journal->j_committing_transaction) {
- JBUFFER_TRACE(jh, "belongs to committing transaction");
+ if (transaction->t_state != T_FINISHED) {
+ JBUFFER_TRACE(jh, "belongs to running/committing transaction");
goto out;
}

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6986f33..39b5cee 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -867,10 +867,10 @@ restart_loop:
}
spin_unlock(&journal->j_list_lock);
/*
- * This is a bit sleazy. We borrow j_list_lock to protect
- * journal->j_committing_transaction in __jbd2_journal_remove_checkpoint.
- * Really, __jbd2_journal_remove_checkpoint should be using j_state_lock but
- * it's a bit hassle to hold that across __jbd2_journal_remove_checkpoint
+ * This is a bit sleazy. We use j_list_lock to protect transition
+ * of a transaction into T_FINISHED state and calling
+ * __jbd2_journal_drop_transaction(). Otherwise we could race with
+ * other checkpointing code processing the transaction...
*/
spin_lock(&journal->j_state_lock);
spin_lock(&journal->j_list_lock);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index d5f7cff..d861ffd 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -442,6 +442,8 @@ struct transaction_s
/*
* Transaction's current state
* [no locking - only kjournald2 alters this]
+ * [j_list_lock] guards transition of a transaction into T_FINISHED
+ * state and subsequent call of __jbd2_journal_drop_transaction()
* FIXME: needs barriers
* KLUDGE: [use j_state_lock]
*/
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:12:27

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 07/49] ext4: Introduce ext4_update_*_feature

From: Aneesh Kumar K.V <[email protected]>

Introduce ext4_update_*_feature and use them instead
of opencoding.


Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/ialloc.c | 11 +++-----
fs/ext4/super.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ext4_fs.h | 6 ++++
3 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 7b5cfa6..00b152b 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -748,13 +748,10 @@ got:
if (test_opt(sb, EXTENTS)) {
EXT4_I(inode)->i_flags |= EXT4_EXTENTS_FL;
ext4_ext_tree_init(handle, inode);
- if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
- err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
- if (err) goto fail;
- EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS);
- BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "call ext4_journal_dirty_metadata");
- err = ext4_journal_dirty_metadata(handle, EXT4_SB(sb)->s_sbh);
- }
+ err = ext4_update_incompat_feature(handle, sb,
+ EXT4_FEATURE_INCOMPAT_EXTENTS);
+ if (err)
+ goto fail;
}

ext4_debug("allocating inode %lu\n", inode->i_ino);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index df8842b..4d7f33f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -373,6 +373,66 @@ void ext4_update_dynamic_rev(struct super_block *sb)
*/
}

+int ext4_update_compat_feature(handle_t *handle,
+ struct super_block *sb, __u32 compat)
+{
+ int err = 0;
+ if (!EXT4_HAS_COMPAT_FEATURE(sb, compat)) {
+ err = ext4_journal_get_write_access(handle,
+ EXT4_SB(sb)->s_sbh);
+ if (err)
+ return err;
+ EXT4_SET_COMPAT_FEATURE(sb, compat);
+ sb->s_dirt = 1;
+ handle->h_sync = 1;
+ BUFFER_TRACE(EXT4_SB(sb)->s_sbh,
+ "call ext4_journal_dirty_met adata");
+ err = ext4_journal_dirty_metadata(handle,
+ EXT4_SB(sb)->s_sbh);
+ }
+ return err;
+}
+
+int ext4_update_rocompat_feature(handle_t *handle,
+ struct super_block *sb, __u32 rocompat)
+{
+ int err = 0;
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, rocompat)) {
+ err = ext4_journal_get_write_access(handle,
+ EXT4_SB(sb)->s_sbh);
+ if (err)
+ return err;
+ EXT4_SET_RO_COMPAT_FEATURE(sb, rocompat);
+ sb->s_dirt = 1;
+ handle->h_sync = 1;
+ BUFFER_TRACE(EXT4_SB(sb)->s_sbh,
+ "call ext4_journal_dirty_met adata");
+ err = ext4_journal_dirty_metadata(handle,
+ EXT4_SB(sb)->s_sbh);
+ }
+ return err;
+}
+
+int ext4_update_incompat_feature(handle_t *handle,
+ struct super_block *sb, __u32 incompat)
+{
+ int err = 0;
+ if (!EXT4_HAS_INCOMPAT_FEATURE(sb, incompat)) {
+ err = ext4_journal_get_write_access(handle,
+ EXT4_SB(sb)->s_sbh);
+ if (err)
+ return err;
+ EXT4_SET_INCOMPAT_FEATURE(sb, incompat);
+ sb->s_dirt = 1;
+ handle->h_sync = 1;
+ BUFFER_TRACE(EXT4_SB(sb)->s_sbh,
+ "call ext4_journal_dirty_met adata");
+ err = ext4_journal_dirty_metadata(handle,
+ EXT4_SB(sb)->s_sbh);
+ }
+ return err;
+}
+
/*
* Open the external journal device
*/
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index e1103c2..429dbfc 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -989,6 +989,12 @@ extern void ext4_abort (struct super_block *, const char *, const char *, ...)
extern void ext4_warning (struct super_block *, const char *, const char *, ...)
__attribute__ ((format (printf, 3, 4)));
extern void ext4_update_dynamic_rev (struct super_block *sb);
+extern int ext4_update_compat_feature(handle_t *handle, struct super_block *sb,
+ __u32 compat);
+extern int ext4_update_rocompat_feature(handle_t *handle,
+ struct super_block *sb, __u32 rocompat);
+extern int ext4_update_incompat_feature(handle_t *handle,
+ struct super_block *sb, __u32 incompat);
extern ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
struct ext4_group_desc *bg);
extern ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:12:45

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 08/49] ext4: Fix sparse warnings.

From: Aneesh Kumar K.V <[email protected]>

Fix sparse warnings related to static functions
and local variables.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/extents.c | 6 +++---
fs/ext4/inode.c | 18 +++++++++++-------
fs/ext4/super.c | 3 +++
include/linux/ext4_fs.h | 2 ++
4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6853722..754c0d3 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1088,7 +1088,7 @@ static ext4_lblk_t ext4_ext_next_leaf_block(struct inode *inode,
* then we have to correct all indexes above.
* TODO: do we need to correct tree in all cases?
*/
-int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
+static int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
struct ext4_ext_path *path)
{
struct ext4_extent_header *eh;
@@ -1535,7 +1535,7 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
* It's used in truncate case only, thus all requests are for
* last index in the block only.
*/
-int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
+static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
struct ext4_ext_path *path)
{
struct buffer_head *bh;
@@ -1806,7 +1806,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path)
return 1;
}

-int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start)
+static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start)
{
struct super_block *sb = inode->i_sb;
int depth = ext_depth(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1ee19c9..76ceba2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2052,11 +2052,11 @@ static void ext4_clear_blocks(handle_t *handle, struct inode *inode,
for (p = first; p < last; p++) {
u32 nr = le32_to_cpu(*p);
if (nr) {
- struct buffer_head *bh;
+ struct buffer_head *tbh;

*p = 0;
- bh = sb_find_get_block(inode->i_sb, nr);
- ext4_forget(handle, 0, inode, bh, nr);
+ tbh = sb_find_get_block(inode->i_sb, nr);
+ ext4_forget(handle, 0, inode, tbh, nr);
}
}

@@ -2324,8 +2324,10 @@ void ext4_truncate(struct inode *inode)
return;
}

- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_truncate(inode, page);
+ if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
+ ext4_ext_truncate(inode, page);
+ return;
+ }

handle = start_transaction(inode);
if (IS_ERR(handle)) {
@@ -3163,8 +3165,10 @@ ext4_reserve_inode_write(handle_t *handle, struct inode *inode,
* Expand an inode by new_extra_isize bytes.
* Returns 0 on success or negative error number on failure.
*/
-int ext4_expand_extra_isize(struct inode *inode, unsigned int new_extra_isize,
- struct ext4_iloc iloc, handle_t *handle)
+static int ext4_expand_extra_isize(struct inode *inode,
+ unsigned int new_extra_isize,
+ struct ext4_iloc iloc,
+ handle_t *handle)
{
struct ext4_inode *raw_inode;
struct ext4_xattr_ibody_header *header;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4d7f33f..7be27db 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1644,6 +1644,9 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,


static int ext4_fill_super (struct super_block *sb, void *data, int silent)
+ __releases(kernel_sem)
+ __acquires(kernel_sem)
+
{
struct buffer_head * bh;
struct ext4_super_block *es = NULL;
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 429dbfc..1a27433 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -893,6 +893,8 @@ extern ext4_fsblk_t ext4_new_block (handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, int *errp);
extern ext4_fsblk_t ext4_new_blocks (handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, unsigned long *count, int *errp);
+extern ext4_fsblk_t ext4_new_blocks_old(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t goal, unsigned long *count, int *errp);
extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
ext4_fsblk_t block, unsigned long count);
extern void ext4_free_blocks_sb (handle_t *handle, struct super_block *sb,
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:13:04

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 12/49] ext4: Support large files

From: Aneesh Kumar K.V <[email protected]>

This patch converts ext4_inode i_blocks to represent total
blocks occupied by the inode in file system block size.
Earlier the variable used to represent this in 512 byte
block size. This actually limited the total size of the file.

The feature is enabled transparently when we write an inode
whose i_blocks cannot be represnted as 512 byte units in a
48 bit variable.

inode flag EXT4_HUGE_FILE_FL

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 32 +++++++++++++++++++++++++-------
fs/ext4/super.c | 9 ++++++---
include/linux/ext4_fs.h | 3 ++-
3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bb89fe7..9cf8572 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2671,14 +2671,20 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
struct ext4_inode_info *ei)
{
blkcnt_t i_blocks ;
- struct super_block *sb = ei->vfs_inode.i_sb;
+ struct inode *inode = &(ei->vfs_inode);
+ struct super_block *sb = inode->i_sb;

if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
/* we are using combined 48 bit field */
i_blocks = ((u64)le16_to_cpu(raw_inode->i_blocks_high)) << 32 |
le32_to_cpu(raw_inode->i_blocks_lo);
- return i_blocks;
+ if (ei->i_flags & EXT4_HUGE_FILE_FL) {
+ /* i_blocks represent file system block size */
+ return i_blocks << (inode->i_blkbits - 9);
+ } else {
+ return i_blocks;
+ }
} else {
return le32_to_cpu(raw_inode->i_blocks_lo);
}
@@ -2829,8 +2835,9 @@ static int ext4_inode_blocks_set(handle_t *handle,
* i_blocks can be represnted in a 32 bit variable
* as multiple of 512 bytes
*/
- raw_inode->i_blocks_lo = cpu_to_le32((u32)i_blocks);
+ raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
raw_inode->i_blocks_high = 0;
+ ei->i_flags &= ~EXT4_HUGE_FILE_FL;
} else if (i_blocks <= 0xffffffffffffULL) {
/*
* i_blocks can be represented in a 48 bit variable
@@ -2841,12 +2848,23 @@ static int ext4_inode_blocks_set(handle_t *handle,
if (err)
goto err_out;
/* i_block is stored in the split 48 bit fields */
- raw_inode->i_blocks_lo = cpu_to_le32((u32)i_blocks);
+ raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32);
+ ei->i_flags &= ~EXT4_HUGE_FILE_FL;
} else {
- ext4_error(sb, __FUNCTION__,
- "Wrong inode i_blocks count %llu\n",
- (unsigned long long)inode->i_blocks);
+ /*
+ * i_blocks should be represented in a 48 bit variable
+ * as multiple of file system block size
+ */
+ err = ext4_update_rocompat_feature(handle, sb,
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE);
+ if (err)
+ goto err_out;
+ ei->i_flags |= EXT4_HUGE_FILE_FL;
+ /* i_block is stored in file system block size */
+ i_blocks = i_blocks >> (inode->i_blkbits - 9);
+ raw_inode->i_blocks_lo = cpu_to_le32(i_blocks);
+ raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32);
}
err_out:
return err;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2b9dc96..64067de 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1631,11 +1631,14 @@ static loff_t ext4_max_size(int bits)
upper_limit >>= (bits - 9);

} else {
- /* We use 48 bit ext4_inode i_blocks */
+ /*
+ * We use 48 bit ext4_inode i_blocks
+ * With EXT4_HUGE_FILE_FL set the i_blocks
+ * represent total number of blocks in
+ * file system block size
+ */
upper_limit = (1LL << 48) - 1;

- /* total blocks in file system block size */
- upper_limit >>= (bits - 9);
}

/* indirect blocks */
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index be25eca..6ae91f4 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -178,8 +178,9 @@ struct ext4_group_desc
#define EXT4_NOTAIL_FL 0x00008000 /* file tail should not be merged */
#define EXT4_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define EXT4_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
-#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */
+#define EXT4_HUGE_FILE_FL 0x00040000 /* Set to each huge file */
#define EXT4_EXTENTS_FL 0x00080000 /* Inode uses extents */
+#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */

#define EXT4_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */
#define EXT4_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:13:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 37/49] ext4: Fix ext4_show_options to show the correct mount options.

From: Aneesh Kumar K.V <[email protected]>

We need to look at the default value and make sure
the mount options are not set via default value
before showing them via ext4_show_options

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/super.c | 26 +++++++++++++++-----------
1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index aa22acd..64fc7f1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -665,18 +665,20 @@ static inline void ext4_show_quota_options(struct seq_file *seq, struct super_bl
*/
static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
{
+ int def_errors;
+ unsigned long def_mount_opts;
struct super_block *sb = vfs->mnt_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
- unsigned long def_mount_opts;

def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
+ def_errors = le16_to_cpu(es->s_errors);

if (sbi->s_sb_block != 1)
seq_printf(seq, ",sb=%llu", sbi->s_sb_block);
if (test_opt(sb, MINIX_DF))
seq_puts(seq, ",minixdf");
- if (test_opt(sb, GRPID))
+ if (test_opt(sb, GRPID) && !(def_mount_opts & EXT4_DEFM_BSDGROUPS))
seq_puts(seq, ",grpid");
if (!test_opt(sb, GRPID) && (def_mount_opts & EXT4_DEFM_BSDGROUPS))
seq_puts(seq, ",nogrpid");
@@ -689,25 +691,24 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_printf(seq, ",resgid=%u", sbi->s_resgid);
}
if (test_opt(sb, ERRORS_RO)) {
- int def_errors = le16_to_cpu(es->s_errors);
-
if (def_errors == EXT4_ERRORS_PANIC ||
def_errors == EXT4_ERRORS_CONTINUE) {
seq_puts(seq, ",errors=remount-ro");
}
}
- if (test_opt(sb, ERRORS_CONT))
+ if (test_opt(sb, ERRORS_CONT) && def_errors != EXT4_ERRORS_CONTINUE)
seq_puts(seq, ",errors=continue");
- if (test_opt(sb, ERRORS_PANIC))
+ if (test_opt(sb, ERRORS_PANIC) && def_errors != EXT4_ERRORS_PANIC)
seq_puts(seq, ",errors=panic");
- if (test_opt(sb, NO_UID32))
+ if (test_opt(sb, NO_UID32) && !(def_mount_opts & EXT4_DEFM_UID16))
seq_puts(seq, ",nouid32");
- if (test_opt(sb, DEBUG))
+ if (test_opt(sb, DEBUG) && !(def_mount_opts & EXT4_DEFM_DEBUG))
seq_puts(seq, ",debug");
if (test_opt(sb, OLDALLOC))
seq_puts(seq, ",oldalloc");
#ifdef CONFIG_EXT4DEV_FS_XATTR
- if (test_opt(sb, XATTR_USER))
+ if (test_opt(sb, XATTR_USER) &&
+ !(def_mount_opts & EXT4_DEFM_XATTR_USER))
seq_puts(seq, ",user_xattr");
if (!test_opt(sb, XATTR_USER) &&
(def_mount_opts & EXT4_DEFM_XATTR_USER)) {
@@ -715,7 +716,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
}
#endif
#ifdef CONFIG_EXT4DEV_FS_POSIX_ACL
- if (test_opt(sb, POSIX_ACL))
+ if (test_opt(sb, POSIX_ACL) && !(def_mount_opts & EXT4_DEFM_ACL))
seq_puts(seq, ",acl");
if (!test_opt(sb, POSIX_ACL) && (def_mount_opts & EXT4_DEFM_ACL))
seq_puts(seq, ",noacl");
@@ -735,6 +736,10 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
if (test_opt(sb, I_VERSION))
seq_puts(seq, ",i_version");

+ /*
+ * journal mode get enabled in different ways
+ * So just print the value even if we didn't specify it
+ */
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
seq_puts(seq, ",data=journal");
else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
@@ -743,7 +748,6 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",data=writeback");

ext4_show_quota_options(seq, sb);
-
return 0;
}

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:13:51

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 24/49] ext4: add block bitmap validation

From: Aneesh Kumar K.V <[email protected]>

When a new block bitmap is read from disk in read_block_bitmap()
there are a few bits that should ALWAYS be set. In particular,
the blocks given corresponding to block bitmap, inode bitmap and inode tables.
Validate the block bitmap against these blocks.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/balloc.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 81 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index ff3428e..a9140ea 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -189,13 +189,65 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
return desc;
}

+static int ext4_valid_block_bitmap(struct super_block *sb,
+ struct ext4_group_desc *desc,
+ unsigned int block_group,
+ struct buffer_head *bh)
+{
+ ext4_grpblk_t offset;
+ ext4_grpblk_t next_zero_bit;
+ ext4_fsblk_t bitmap_blk;
+ ext4_fsblk_t group_first_block;
+
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+ /* with FLEX_BG, the inode/block bitmaps and itable
+ * blocks may not be in the group at all
+ * so the bitmap validation will be skipped for those groups
+ * or it has to also read the block group where the bitmaps
+ * are located to verify they are set.
+ */
+ return 1;
+ }
+ group_first_block = ext4_group_first_block_no(sb, block_group);
+
+ /* check whether block bitmap block number is set */
+ bitmap_blk = ext4_block_bitmap(sb, desc);
+ offset = bitmap_blk - group_first_block;
+ if (!ext4_test_bit(offset, bh->b_data))
+ /* bad block bitmap */
+ goto err_out;
+
+ /* check whether the inode bitmap block number is set */
+ bitmap_blk = ext4_inode_bitmap(sb, desc);
+ offset = bitmap_blk - group_first_block;
+ if (!ext4_test_bit(offset, bh->b_data))
+ /* bad block bitmap */
+ goto err_out;
+
+ /* check whether the inode table block number is set */
+ bitmap_blk = ext4_inode_table(sb, desc);
+ offset = bitmap_blk - group_first_block;
+ next_zero_bit = ext4_find_next_zero_bit(bh->b_data,
+ offset + EXT4_SB(sb)->s_itb_per_group,
+ offset);
+ if (next_zero_bit >= offset + EXT4_SB(sb)->s_itb_per_group)
+ /* good bitmap for inode tables */
+ return 1;
+
+err_out:
+ ext4_error(sb, __FUNCTION__,
+ "Invalid block bitmap - "
+ "block_group = %d, block = %llu",
+ block_group, bitmap_blk);
+ return 0;
+}
/**
* read_block_bitmap()
* @sb: super block
* @block_group: given block group
*
- * Read the bitmap for a given block_group, reading into the specified
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group,and validate the
+ * bits for block/inode/inode tables are set in the bitmaps
*
* Return buffer_head on success or NULL in case of failure.
*/
@@ -210,25 +262,36 @@ read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
if (!desc)
return NULL;
bitmap_blk = ext4_block_bitmap(sb, desc);
+ bh = sb_getblk(sb, bitmap_blk);
+ if (unlikely(!bh)) {
+ ext4_error(sb, __FUNCTION__,
+ "Cannot read block bitmap - "
+ "block_group = %d, block_bitmap = %llu",
+ (int)block_group, (unsigned long long)bitmap_blk);
+ return NULL;
+ }
+ if (bh_uptodate_or_lock(bh))
+ return bh;
+
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
- bh = sb_getblk(sb, bitmap_blk);
- if (!buffer_uptodate(bh)) {
- lock_buffer(bh);
- if (!buffer_uptodate(bh)) {
- ext4_init_block_bitmap(sb, bh, block_group,
- desc);
- set_buffer_uptodate(bh);
- }
- unlock_buffer(bh);
- }
- } else {
- bh = sb_bread(sb, bitmap_blk);
+ ext4_init_block_bitmap(sb, bh, block_group, desc);
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ return bh;
}
- if (!bh)
- ext4_error (sb, __FUNCTION__,
+ if (bh_submit_read(bh) < 0) {
+ brelse(bh);
+ ext4_error(sb, __FUNCTION__,
"Cannot read block bitmap - "
- "block_group = %lu, block_bitmap = %llu",
- block_group, bitmap_blk);
+ "block_group = %d, block_bitmap = %llu",
+ (int)block_group, (unsigned long long)bitmap_blk);
+ return NULL;
+ }
+ if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
+ brelse(bh);
+ return NULL;
+ }
+
return bh;
}
/*
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:14:17

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 46/49] jbd2: add lockdep support

From: Mingming Cao <[email protected]>

Ported from similar patch for the jbd layer.

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/transaction.c | 11 +++++++++++
include/linux/jbd2.h | 4 ++++
2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index f30802a..70b3199 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -241,6 +241,8 @@ out:
return ret;
}

+static struct lock_class_key jbd2_handle_key;
+
/* Allocate a new handle. This should probably be in a slab... */
static handle_t *new_handle(int nblocks)
{
@@ -251,6 +253,9 @@ static handle_t *new_handle(int nblocks)
handle->h_buffer_credits = nblocks;
handle->h_ref = 1;

+ lockdep_init_map(&handle->h_lockdep_map, "jbd2_handle",
+ &jbd2_handle_key, 0);
+
return handle;
}

@@ -293,7 +298,11 @@ handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
jbd2_free_handle(handle);
current->journal_info = NULL;
handle = ERR_PTR(err);
+ goto out;
}
+
+ lock_acquire(&handle->h_lockdep_map, 0, 0, 0, 2, _THIS_IP_);
+out:
return handle;
}

@@ -1419,6 +1428,8 @@ int jbd2_journal_stop(handle_t *handle)
spin_unlock(&journal->j_state_lock);
}

+ lock_release(&handle->h_lockdep_map, 1, _THIS_IP_);
+
jbd2_free_handle(handle);
return err;
}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a2645c2..f982d38 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -418,6 +418,10 @@ struct handle_s
unsigned int h_sync: 1; /* sync-on-close */
unsigned int h_jdata: 1; /* force data journaling */
unsigned int h_aborted: 1; /* fatal error on handle */
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map h_lockdep_map;
+#endif
};


--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:14:41

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 10/49] ext4: Rename i_dir_acl to i_size_high

From: Aneesh Kumar K.V <[email protected]>

Rename ext4_inode.i_dir_acl to i_size_high
drop ext4_inode_info.i_dir_acl as it is not used
Rename ext4_inode.i_size to ext4_inode.i_size_lo
Add helper function for accessing the ext4_inode combined i_size.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/ialloc.c | 1 -
fs/ext4/inode.c | 55 ++++++++++++++++++---------------------------
include/linux/ext4_fs.h | 15 +++++++++--
include/linux/ext4_fs_i.h | 1 -
4 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 00b152b..17b5df1 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -709,7 +709,6 @@ got:
if (!S_ISDIR(mode))
ei->i_flags &= ~EXT4_DIRSYNC_FL;
ei->i_file_acl = 0;
- ei->i_dir_acl = 0;
ei->i_dtime = 0;
ei->i_block_alloc_info = NULL;
ei->i_block_group = group;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7bcec18..e663455 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2694,7 +2694,6 @@ void ext4_read_inode(struct inode * inode)
inode->i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
}
inode->i_nlink = le16_to_cpu(raw_inode->i_links_count);
- inode->i_size = le32_to_cpu(raw_inode->i_size);

ei->i_state = 0;
ei->i_dir_start_lookup = 0;
@@ -2720,15 +2719,11 @@ void ext4_read_inode(struct inode * inode)
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
- cpu_to_le32(EXT4_OS_HURD))
+ cpu_to_le32(EXT4_OS_HURD)) {
ei->i_file_acl |=
((__u64)le16_to_cpu(raw_inode->i_file_acl_high)) << 32;
- if (!S_ISREG(inode->i_mode)) {
- ei->i_dir_acl = le32_to_cpu(raw_inode->i_dir_acl);
- } else {
- inode->i_size |=
- ((__u64)le32_to_cpu(raw_inode->i_size_high)) << 32;
}
+ inode->i_size = ext4_isize(raw_inode);
ei->i_disksize = inode->i_size;
inode->i_generation = le32_to_cpu(raw_inode->i_generation);
ei->i_block_group = iloc.block_group;
@@ -2852,7 +2847,6 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_gid_high = 0;
}
raw_inode->i_links_count = cpu_to_le16(inode->i_nlink);
- raw_inode->i_size = cpu_to_le32(ei->i_disksize);

EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
@@ -2867,32 +2861,27 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_file_acl_high =
cpu_to_le16(ei->i_file_acl >> 32);
raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl);
- if (!S_ISREG(inode->i_mode)) {
- raw_inode->i_dir_acl = cpu_to_le32(ei->i_dir_acl);
- } else {
- raw_inode->i_size_high =
- cpu_to_le32(ei->i_disksize >> 32);
- if (ei->i_disksize > 0x7fffffffULL) {
- struct super_block *sb = inode->i_sb;
- if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
- EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ||
- EXT4_SB(sb)->s_es->s_rev_level ==
- cpu_to_le32(EXT4_GOOD_OLD_REV)) {
- /* If this is the first large file
- * created, add a flag to the superblock.
- */
- err = ext4_journal_get_write_access(handle,
- EXT4_SB(sb)->s_sbh);
- if (err)
- goto out_brelse;
- ext4_update_dynamic_rev(sb);
- EXT4_SET_RO_COMPAT_FEATURE(sb,
+ ext4_isize_set(raw_inode, ei->i_disksize);
+ if (ei->i_disksize > 0x7fffffffULL) {
+ struct super_block *sb = inode->i_sb;
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ||
+ EXT4_SB(sb)->s_es->s_rev_level ==
+ cpu_to_le32(EXT4_GOOD_OLD_REV)) {
+ /* If this is the first large file
+ * created, add a flag to the superblock.
+ */
+ err = ext4_journal_get_write_access(handle,
+ EXT4_SB(sb)->s_sbh);
+ if (err)
+ goto out_brelse;
+ ext4_update_dynamic_rev(sb);
+ EXT4_SET_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
- sb->s_dirt = 1;
- handle->h_sync = 1;
- err = ext4_journal_dirty_metadata(handle,
- EXT4_SB(sb)->s_sbh);
- }
+ sb->s_dirt = 1;
+ handle->h_sync = 1;
+ err = ext4_journal_dirty_metadata(handle,
+ EXT4_SB(sb)->s_sbh);
}
}
raw_inode->i_generation = cpu_to_le32(inode->i_generation);
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 6894f36..a8f3fae 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -275,7 +275,7 @@ struct ext4_mount_options {
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner Uid */
- __le32 i_size; /* Size in bytes */
+ __le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode Change time */
__le32 i_mtime; /* Modification time */
@@ -298,7 +298,7 @@ struct ext4_inode {
__le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
- __le32 i_dir_acl; /* Directory ACL */
+ __le32 i_size_high;
__le32 i_obso_faddr; /* Obsoleted fragment address */
union {
struct {
@@ -330,7 +330,6 @@ struct ext4_inode {
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
};

-#define i_size_high i_dir_acl

#define EXT4_EPOCH_BITS 2
#define EXT4_EPOCH_MASK ((1 << EXT4_EPOCH_BITS) - 1)
@@ -1049,7 +1048,17 @@ static inline void ext4_r_blocks_count_set(struct ext4_super_block *es,
es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
}

+static inline loff_t ext4_isize(struct ext4_inode *raw_inode)
+{
+ return ((loff_t)le32_to_cpu(raw_inode->i_size_high) << 32) |
+ le32_to_cpu(raw_inode->i_size_lo);
+}

+static inline void ext4_isize_set(struct ext4_inode *raw_inode, loff_t i_size)
+{
+ raw_inode->i_size_lo = cpu_to_le32(i_size);
+ raw_inode->i_size_high = cpu_to_le32(i_size >> 32);
+}

#define ext4_std_error(sb, errno) \
do { \
diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
index 2b4e370..f1cd493 100644
--- a/include/linux/ext4_fs_i.h
+++ b/include/linux/ext4_fs_i.h
@@ -85,7 +85,6 @@ struct ext4_inode_info {
__le32 i_data[15]; /* unconverted */
__u32 i_flags;
ext4_fsblk_t i_file_acl;
- __u32 i_dir_acl;
__u32 i_dtime;

/*
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:15:00

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 29/49] ext4: Make ext4_get_blocks_wrap take the truncate_mutex early.

From: Aneesh Kumar K.V <[email protected]>

When doing a migrate from ext3 to ext4 inode we need to make sure the test
for inode type and walking inode data happens inside lock. To make this
happen move truncate_mutex early before checking the i_flags.


This actually should enable us to remove the verify_chain().

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/extents.c | 9 ++++--
fs/ext4/inode.c | 69 +++++-----------------------------------------
include/linux/ext4_fs.h | 2 +
3 files changed, 16 insertions(+), 64 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8593e59..ec5019f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2129,6 +2129,10 @@ out:
return err ? err : allocated;
}

+/*
+ * Need to be called with
+ * mutex_lock(&EXT4_I(inode)->truncate_mutex);
+ */
int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
@@ -2144,7 +2148,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
__clear_bit(BH_New, &bh_result->b_state);
ext_debug("blocks %u/%lu requested for inode %u\n",
iblock, max_blocks, inode->i_ino);
- mutex_lock(&EXT4_I(inode)->truncate_mutex);

/* check in cache */
goal = ext4_ext_in_cache(inode, iblock, &newex);
@@ -2318,8 +2321,6 @@ out2:
ext4_ext_drop_refs(path);
kfree(path);
}
- mutex_unlock(&EXT4_I(inode)->truncate_mutex);
-
return err ? err : allocated;
}

@@ -2449,6 +2450,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
* modify 1 super block, 1 block bitmap and 1 group descriptor.
*/
credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+ mutex_lock(&EXT4_I(inode)->truncate_mutex)
retry:
while (ret >= 0 && ret < max_blocks) {
block = block + ret;
@@ -2505,6 +2507,7 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;

+ mutex_unlock(&EXT4_I(inode)->truncate_mutex)
/*
* Time to update the file size.
* Update only when preallocation was requested beyond the file size.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eaace13..71c7ad0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -243,13 +243,6 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
p->bh = bh;
}

-static int verify_chain(Indirect *from, Indirect *to)
-{
- while (from <= to && from->key == *from->p)
- from++;
- return (from > to);
-}
-
/**
* ext4_block_to_path - parse the block number into array of offsets
* @inode: inode in question (we are only interested in its superblock)
@@ -348,10 +341,11 @@ static int ext4_block_to_path(struct inode *inode,
* (pointer to last triple returned, *@err == 0)
* or when it gets an IO error reading an indirect block
* (ditto, *@err == -EIO)
- * or when it notices that chain had been changed while it was reading
- * (ditto, *@err == -EAGAIN)
* or when it reads all @depth-1 indirect blocks successfully and finds
* the whole chain, all way to the data (returns %NULL, *err == 0).
+ *
+ * Need to be called with
+ * mutex_lock(&EXT4_I(inode)->truncate_mutex)
*/
static Indirect *ext4_get_branch(struct inode *inode, int depth,
ext4_lblk_t *offsets,
@@ -370,9 +364,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth,
bh = sb_bread(sb, le32_to_cpu(p->key));
if (!bh)
goto failure;
- /* Reader: pointers */
- if (!verify_chain(chain, p))
- goto changed;
add_chain(++p, bh, (__le32*)bh->b_data + *++offsets);
/* Reader: end */
if (!p->key)
@@ -380,10 +371,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth,
}
return NULL;

-changed:
- brelse(bh);
- *err = -EAGAIN;
- goto no_block;
failure:
*err = -EIO;
no_block:
@@ -787,6 +774,10 @@ err_out:
* return > 0, # of blocks mapped or allocated.
* return = 0, if plain lookup failed.
* return < 0, error case.
+ *
+ *
+ * Need to be called with
+ * mutex_lock(&EXT4_I(inode)->truncate_mutex)
*/
int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned long maxblocks,
@@ -825,18 +816,6 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
while (count < maxblocks && count <= blocks_to_boundary) {
ext4_fsblk_t blk;

- if (!verify_chain(chain, partial)) {
- /*
- * Indirect block might be removed by
- * truncate while we were reading it.
- * Handling of that case: forget what we've
- * got now. Flag the err as EAGAIN, so it
- * will reread.
- */
- err = -EAGAIN;
- count = 0;
- break;
- }
blk = le32_to_cpu(*(chain[depth-1].p + count));

if (blk == first_block + count)
@@ -844,44 +823,13 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
else
break;
}
- if (err != -EAGAIN)
- goto got_it;
+ goto got_it;
}

/* Next simple case - plain lookup or failed read of indirect block */
if (!create || err == -EIO)
goto cleanup;

- mutex_lock(&ei->truncate_mutex);
-
- /*
- * If the indirect block is missing while we are reading
- * the chain(ext4_get_branch() returns -EAGAIN err), or
- * if the chain has been changed after we grab the semaphore,
- * (either because another process truncated this branch, or
- * another get_block allocated this branch) re-grab the chain to see if
- * the request block has been allocated or not.
- *
- * Since we already block the truncate/other get_block
- * at this point, we will have the current copy of the chain when we
- * splice the branch into the tree.
- */
- if (err == -EAGAIN || !verify_chain(chain, partial)) {
- while (partial > chain) {
- brelse(partial->bh);
- partial--;
- }
- partial = ext4_get_branch(inode, depth, offsets, chain, &err);
- if (!partial) {
- count++;
- mutex_unlock(&ei->truncate_mutex);
- if (err)
- goto cleanup;
- clear_buffer_new(bh_result);
- goto got_it;
- }
- }
-
/*
* Okay, we need to do block allocation. Lazily initialize the block
* allocation info here if necessary
@@ -923,7 +871,6 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
*/
if (!err && extend_disksize && inode->i_size > ei->i_disksize)
ei->i_disksize = inode->i_size;
- mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;

diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 55a376e..583049c 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -1113,6 +1113,7 @@ ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
int create, int extend_disksize)
{
int retval;
+ mutex_lock(&EXT4_I(inode)->truncate_mutex);
if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
retval = ext4_ext_get_blocks(handle, inode,
(ext4_lblk_t)block, max_blocks,
@@ -1122,6 +1123,7 @@ ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
(ext4_lblk_t)block, max_blocks,
bh, create, extend_disksize);
}
+ mutex_unlock(&EXT4_I(inode)->truncate_mutex);
return retval;
}

--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:15:38

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 11/49] ext4: Add support for 48 bit inode i_blocks.

From: Aneesh Kumar K.V <[email protected]>

Use the __le16 l_i_reserved1 field of the linux2 struct of ext4_inode
to represet the higher 16 bits for i_blocks. With this change max_file
size becomes (2**48 -1 )* 512 bytes.

We add a RO_COMPAT feature to the super block to indicate that inode
have i_blocks represented as a split 48 bits. Super block with this
feature set cannot be mounted read write on a kernel with CONFIG_LSF
disabled.

Super block flag EXT4_FEATURE_RO_COMPAT_HUGE_FILE

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 58 ++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/super.c | 62 ++++++++++++++++++++++++++++++++++++++++++----
include/linux/ext4_fs.h | 10 +++++--
3 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e663455..bb89fe7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2667,6 +2667,22 @@ void ext4_get_inode_flags(struct ext4_inode_info *ei)
if (flags & S_DIRSYNC)
ei->i_flags |= EXT4_DIRSYNC_FL;
}
+static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
+{
+ blkcnt_t i_blocks ;
+ struct super_block *sb = ei->vfs_inode.i_sb;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
+ /* we are using combined 48 bit field */
+ i_blocks = ((u64)le16_to_cpu(raw_inode->i_blocks_high)) << 32 |
+ le32_to_cpu(raw_inode->i_blocks_lo);
+ return i_blocks;
+ } else {
+ return le32_to_cpu(raw_inode->i_blocks_lo);
+ }
+}

void ext4_read_inode(struct inode * inode)
{
@@ -2715,8 +2731,8 @@ void ext4_read_inode(struct inode * inode)
* recovery code: that's fine, we're about to complete
* the process of deleting those. */
}
- inode->i_blocks = le32_to_cpu(raw_inode->i_blocks);
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
+ inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
cpu_to_le32(EXT4_OS_HURD)) {
@@ -2799,6 +2815,43 @@ bad_inode:
return;
}

+static int ext4_inode_blocks_set(handle_t *handle,
+ struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
+{
+ struct inode *inode = &(ei->vfs_inode);
+ u64 i_blocks = inode->i_blocks;
+ struct super_block *sb = inode->i_sb;
+ int err = 0;
+
+ if (i_blocks <= ~0U) {
+ /*
+ * i_blocks can be represnted in a 32 bit variable
+ * as multiple of 512 bytes
+ */
+ raw_inode->i_blocks_lo = cpu_to_le32((u32)i_blocks);
+ raw_inode->i_blocks_high = 0;
+ } else if (i_blocks <= 0xffffffffffffULL) {
+ /*
+ * i_blocks can be represented in a 48 bit variable
+ * as multiple of 512 bytes
+ */
+ err = ext4_update_rocompat_feature(handle, sb,
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE);
+ if (err)
+ goto err_out;
+ /* i_block is stored in the split 48 bit fields */
+ raw_inode->i_blocks_lo = cpu_to_le32((u32)i_blocks);
+ raw_inode->i_blocks_high = cpu_to_le16(i_blocks >> 32);
+ } else {
+ ext4_error(sb, __FUNCTION__,
+ "Wrong inode i_blocks count %llu\n",
+ (unsigned long long)inode->i_blocks);
+ }
+err_out:
+ return err;
+}
+
/*
* Post the struct inode info into an on-disk inode location in the
* buffer-cache. This gobbles the caller's reference to the
@@ -2853,7 +2906,8 @@ static int ext4_do_update_inode(handle_t *handle,
EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);

- raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
+ if (ext4_inode_blocks_set(handle, raw_inode, ei))
+ goto out_brelse;
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7be27db..2b9dc96 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1603,17 +1603,50 @@ static void ext4_orphan_cleanup (struct super_block * sb,

/*
* Maximal file size. There is a direct, and {,double-,triple-}indirect
- * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
- * We need to be 1 filesystem block less than the 2^32 sector limit.
+ * block limit, and also a limit of (2^48 - 1) 512-byte sectors in i_blocks.
+ * We need to be 1 filesystem block less than the 2^48 sector limit.
*/
static loff_t ext4_max_size(int bits)
{
loff_t res = EXT4_NDIR_BLOCKS;
- /* This constant is calculated to be the largest file size for a
- * dense, 4k-blocksize file such that the total number of
+ int meta_blocks;
+ loff_t upper_limit;
+ /* This is calculated to be the largest file size for a
+ * dense, file such that the total number of
* sectors in the file, including data and all indirect blocks,
- * does not exceed 2^32. */
- const loff_t upper_limit = 0x1ff7fffd000LL;
+ * does not exceed 2^48 -1
+ * __u32 i_blocks_lo and _u16 i_blocks_high representing the
+ * total number of 512 bytes blocks of the file
+ */
+
+ if (sizeof(blkcnt_t) < sizeof(u64)) {
+ /*
+ * CONFIG_LSF is not enabled implies the inode
+ * i_block represent total blocks in 512 bytes
+ * 32 == size of vfs inode i_blocks * 8
+ */
+ upper_limit = (1LL << 32) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (bits - 9);
+
+ } else {
+ /* We use 48 bit ext4_inode i_blocks */
+ upper_limit = (1LL << 48) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (bits - 9);
+ }
+
+ /* indirect blocks */
+ meta_blocks = 1;
+ /* double indirect blocks */
+ meta_blocks += 1 + (1LL << (bits-2));
+ /* tripple indirect blocks */
+ meta_blocks += 1 + (1LL << (bits-2)) + (1LL << (2*(bits-2)));
+
+ upper_limit -= meta_blocks;
+ upper_limit <<= bits;

res += 1LL << (bits-2);
res += 1LL << (2*(bits-2));
@@ -1621,6 +1654,10 @@ static loff_t ext4_max_size(int bits)
res <<= bits;
if (res > upper_limit)
res = upper_limit;
+
+ if (res > MAX_LFS_FILESIZE)
+ res = MAX_LFS_FILESIZE;
+
return res;
}

@@ -1789,6 +1826,19 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
sb->s_id, le32_to_cpu(features));
goto failed_mount;
}
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
+ /*
+ * Large file size enabled file system can only be
+ * mount if kernel is build with CONFIG_LSF
+ */
+ if (sizeof(root->i_blocks) < sizeof(u64) &&
+ !(sb->s_flags & MS_RDONLY)) {
+ printk(KERN_ERR "EXT4-fs: %s: Filesystem with huge "
+ "files cannot be mounted read-write "
+ "without CONFIG_LSF.\n", sb->s_id);
+ goto failed_mount;
+ }
+ }
blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);

if (blocksize < EXT4_MIN_BLOCK_SIZE ||
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index a8f3fae..be25eca 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -282,7 +282,7 @@ struct ext4_inode {
__le32 i_dtime; /* Deletion Time */
__le16 i_gid; /* Low 16 bits of Group Id */
__le16 i_links_count; /* Links count */
- __le32 i_blocks; /* Blocks count */
+ __le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
union {
struct {
@@ -302,7 +302,7 @@ struct ext4_inode {
__le32 i_obso_faddr; /* Obsoleted fragment address */
union {
struct {
- __le16 l_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
+ __le16 l_i_blocks_high; /* were l_i_reserved1 */
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
@@ -404,6 +404,7 @@ do { \
#if defined(__KERNEL__) || defined(__linux__)
#define i_reserved1 osd1.linux1.l_i_reserved1
#define i_file_acl_high osd2.linux2.l_i_file_acl_high
+#define i_blocks_high osd2.linux2.l_i_blocks_high
#define i_uid_low i_uid
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
@@ -670,6 +671,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
#define EXT4_FEATURE_RO_COMPAT_BTREE_DIR 0x0004
+#define EXT4_FEATURE_RO_COMPAT_HUGE_FILE 0x0008
#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
@@ -681,6 +683,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
#define EXT4_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT4_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
+#define EXT4_FEATURE_INCOMPAT_MMP 0x0100
#define EXT4_FEATURE_INCOMPAT_FLEX_BG 0x0200

#define EXT4_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
@@ -695,7 +698,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
- EXT4_FEATURE_RO_COMPAT_BTREE_DIR)
+ EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE)

/*
* Default values for user and/or group using reserved blocks
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:15:56

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.

From: Aneesh Kumar K.V <[email protected]>

We are currently taking the truncate_mutex for every read. This would have
performance impact on large CPU configuration. Convert the lock to read write
semaphore and take read lock when we are trying to read the file.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/balloc.c | 2 +-
fs/ext4/extents.c | 13 +++++++------
fs/ext4/file.c | 4 ++--
fs/ext4/inode.c | 39 ++++++++++++++++++++++++++++++++-------
fs/ext4/ioctl.c | 4 ++--
fs/ext4/super.c | 2 +-
include/linux/ext4_fs.h | 25 ++++---------------------
include/linux/ext4_fs_i.h | 6 +++---
8 files changed, 52 insertions(+), 43 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index a9140ea..925e063 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -526,7 +526,7 @@ static inline int rsv_is_empty(struct ext4_reserve_window *rsv)
* when setting the reservation window size through ioctl before the file
* is open for write (needs block allocation).
*
- * Needs truncate_mutex protection prior to call this function.
+ * Needs down_write(i_data_sem) protection prior to call this function.
*/
void ext4_init_block_alloc_info(struct inode *inode)
{
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ec5019f..03d1bbb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1565,7 +1565,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
* This routine returns max. credits that the extent tree can consume.
* It should be OK for low-performance paths like ->writepage()
* To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under truncate_mutex and
+ * the caller should calculate credits under i_data_sem and
* pass the actual path.
*/
int ext4_ext_calc_credits_for_insert(struct inode *inode,
@@ -2131,7 +2131,8 @@ out:

/*
* Need to be called with
- * mutex_lock(&EXT4_I(inode)->truncate_mutex);
+ * down_read(&EXT4_I(inode)->i_data_sem) if not allocating file system block
+ * (ie, create is zero). Otherwise down_write(&EXT4_I(inode)->i_data_sem)
*/
int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
@@ -2350,7 +2351,7 @@ void ext4_ext_truncate(struct inode * inode, struct page *page)
if (page)
ext4_block_truncate_page(handle, page, mapping, inode->i_size);

- mutex_lock(&EXT4_I(inode)->truncate_mutex);
+ down_write(&EXT4_I(inode)->i_data_sem);
ext4_ext_invalidate_cache(inode);

/*
@@ -2386,7 +2387,7 @@ out_stop:
if (inode->i_nlink)
ext4_orphan_del(handle, inode);

- mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+ up_write(&EXT4_I(inode)->i_data_sem);
ext4_journal_stop(handle);
}

@@ -2450,7 +2451,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
* modify 1 super block, 1 block bitmap and 1 group descriptor.
*/
credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
- mutex_lock(&EXT4_I(inode)->truncate_mutex)
+ down_write((&EXT4_I(inode)->i_data_sem));
retry:
while (ret >= 0 && ret < max_blocks) {
block = block + ret;
@@ -2507,7 +2508,7 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;

- mutex_unlock(&EXT4_I(inode)->truncate_mutex)
+ up_write((&EXT4_I(inode)->i_data_sem));
/*
* Time to update the file size.
* Update only when preallocation was requested beyond the file size.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index a6b2aa1..ac35ec5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -37,9 +37,9 @@ static int ext4_release_file (struct inode * inode, struct file * filp)
if ((filp->f_mode & FMODE_WRITE) &&
(atomic_read(&inode->i_writecount) == 1))
{
- mutex_lock(&EXT4_I(inode)->truncate_mutex);
+ down_write(&EXT4_I(inode)->i_data_sem);
ext4_discard_reservation(inode);
- mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+ up_write(&EXT4_I(inode)->i_data_sem);
}
if (is_dx(inode) && filp->private_data)
ext4_htree_free_dir_info(filp->private_data);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 71c7ad0..596b3ab 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -308,7 +308,7 @@ static int ext4_block_to_path(struct inode *inode,
final = ptrs;
} else {
ext4_warning(inode->i_sb, "ext4_block_to_path",
- "block %u > max",
+ "block %lu > max",
i_block + direct_blocks +
indirect_blocks + double_blocks);
}
@@ -345,7 +345,7 @@ static int ext4_block_to_path(struct inode *inode,
* the whole chain, all way to the data (returns %NULL, *err == 0).
*
* Need to be called with
- * mutex_lock(&EXT4_I(inode)->truncate_mutex)
+ * down_read(&EXT4_I(inode)->i_data_sem)
*/
static Indirect *ext4_get_branch(struct inode *inode, int depth,
ext4_lblk_t *offsets,
@@ -777,7 +777,8 @@ err_out:
*
*
* Need to be called with
- * mutex_lock(&EXT4_I(inode)->truncate_mutex)
+ * down_read(&EXT4_I(inode)->i_data_sem) if not allocating file system block
+ * (ie, create is zero). Otherwise down_write(&EXT4_I(inode)->i_data_sem)
*/
int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned long maxblocks,
@@ -865,7 +866,7 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
err = ext4_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
/*
- * i_disksize growing is protected by truncate_mutex. Don't forget to
+ * i_disksize growing is protected by i_data_sem. Don't forget to
* protect it if you're about to implement concurrent
* ext4_get_block() -bzzz
*/
@@ -895,6 +896,30 @@ out:

#define DIO_CREDITS (EXT4_RESERVE_TRANS_BLOCKS + 32)

+int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
+ unsigned long max_blocks, struct buffer_head *bh,
+ int create, int extend_disksize)
+{
+ int retval;
+ if (create) {
+ down_write((&EXT4_I(inode)->i_data_sem));
+ } else {
+ down_read((&EXT4_I(inode)->i_data_sem));
+ }
+ if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
+ retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
+ bh, create, extend_disksize);
+ } else {
+ retval = ext4_get_blocks_handle(handle, inode, block,
+ max_blocks, bh, create, extend_disksize);
+ }
+ if (create) {
+ up_write((&EXT4_I(inode)->i_data_sem));
+ } else {
+ up_read((&EXT4_I(inode)->i_data_sem));
+ }
+ return retval;
+}
static int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
@@ -1399,7 +1424,7 @@ static int jbd2_journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
* ext4_file_write() -> generic_file_write() -> __alloc_pages() -> ...
*
* Same applies to ext4_get_block(). We will deadlock on various things like
- * lock_journal and i_truncate_mutex.
+ * lock_journal and i_data_sem
*
* Setting PF_MEMALLOC here doesn't work - too many internal memory
* allocations fail.
@@ -2325,7 +2350,7 @@ void ext4_truncate(struct inode *inode)
* From here we block out all ext4_get_block() callers who want to
* modify the block allocation tree.
*/
- mutex_lock(&ei->truncate_mutex);
+ down_write(&ei->i_data_sem);

if (n == 1) { /* direct blocks */
ext4_free_data(handle, inode, NULL, i_data+offsets[0],
@@ -2389,7 +2414,7 @@ do_indirects:

ext4_discard_reservation(inode);

- mutex_unlock(&ei->truncate_mutex);
+ up_write(&ei->i_data_sem);
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index e7f894b..c0e5b8c 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -199,7 +199,7 @@ flags_err:
* need to allocate reservation structure for this inode
* before set the window size
*/
- mutex_lock(&ei->truncate_mutex);
+ down_write(&ei->i_data_sem);
if (!ei->i_block_alloc_info)
ext4_init_block_alloc_info(inode);

@@ -207,7 +207,7 @@ flags_err:
struct ext4_reserve_window_node *rsv = &ei->i_block_alloc_info->rsv_window_node;
rsv->rsv_goal_size = rsv_window_size;
}
- mutex_unlock(&ei->truncate_mutex);
+ up_write(&ei->i_data_sem);
return 0;
}
case EXT4_IOC_GROUP_EXTEND: {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index effd375..c730544 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -593,7 +593,7 @@ static void init_once(struct kmem_cache *cachep, void *foo)
#ifdef CONFIG_EXT4DEV_FS_XATTR
init_rwsem(&ei->xattr_sem);
#endif
- mutex_init(&ei->truncate_mutex);
+ init_rwsem(&ei->i_data_sem);
inode_init_once(&ei->vfs_inode);
}

diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 583049c..300cc5a 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -1107,27 +1107,10 @@ extern void ext4_ext_init(struct super_block *);
extern void ext4_ext_release(struct super_block *);
extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
loff_t len);
-static inline int
-ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
- unsigned long max_blocks, struct buffer_head *bh,
- int create, int extend_disksize)
-{
- int retval;
- mutex_lock(&EXT4_I(inode)->truncate_mutex);
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
- retval = ext4_ext_get_blocks(handle, inode,
- (ext4_lblk_t)block, max_blocks,
- bh, create, extend_disksize);
- } else {
- retval = ext4_get_blocks_handle(handle, inode,
- (ext4_lblk_t)block, max_blocks,
- bh, create, extend_disksize);
- }
- mutex_unlock(&EXT4_I(inode)->truncate_mutex);
- return retval;
-}
-
-
+extern int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode,
+ sector_t block, unsigned long max_blocks,
+ struct buffer_head *bh, int create,
+ int extend_disksize);
#endif /* __KERNEL__ */

#endif /* _LINUX_EXT4_FS_H */
diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
index f1cd493..4377d24 100644
--- a/include/linux/ext4_fs_i.h
+++ b/include/linux/ext4_fs_i.h
@@ -139,16 +139,16 @@ struct ext4_inode_info {
__u16 i_extra_isize;

/*
- * truncate_mutex is for serialising ext4_truncate() against
+ * i_data_sem is for serialising ext4_truncate() against
* ext4_getblock(). In the 2.4 ext2 design, great chunks of inode's
* data tree are chopped off during truncate. We can't do that in
* ext4 because whenever we perform intermediate commits during
* truncate, the inode and all the metadata blocks *must* be in a
* consistent state which allows truncation of the orphans to restart
* during recovery. Hence we must fix the get_block-vs-truncate race
- * by other means, so we have truncate_mutex.
+ * by other means, so we have i_data_sem.
*/
- struct mutex truncate_mutex;
+ struct rw_semaphore i_data_sem;
struct inode vfs_inode;

unsigned long i_ext_generation;
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:16:27

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

From: Aneesh Kumar K.V <[email protected]>

The below patch add ioctl for migrating ext3 indirect block mapped inode
to ext4 extent mapped inode.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/Makefile | 2 +-
fs/ext4/ioctl.c | 3 +
fs/ext4/migrate.c | 634 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ext4_fs.h | 4 +
4 files changed, 642 insertions(+), 1 deletions(-)
create mode 100644 fs/ext4/migrate.c

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index ae6e7e5..d5fd80b 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o

ext4dev-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
- ext4_jbd2.o
+ ext4_jbd2.o migrate.o

ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index c0e5b8c..2ed7c37 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -254,6 +254,9 @@ flags_err:
return err;
}

+ case EXT4_IOC_MIGRATE:
+ return ext4_ext_migrate(inode, filp, cmd, arg);
+
default:
return -ENOTTY;
}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
new file mode 100644
index 0000000..7203d3d
--- /dev/null
+++ b/fs/ext4/migrate.c
@@ -0,0 +1,634 @@
+/*
+ * Copyright IBM Corporation, 2007
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/ext4_jbd2.h>
+#include <linux/ext4_fs_extents.h>
+
+struct list_blocks_struct {
+ ext4_lblk_t first_block, last_block;
+ ext4_fsblk_t first_pblock, last_pblock;
+};
+
+/* will go away */
+static void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb)
+{
+ ex->ee_start_lo = cpu_to_le32((unsigned long) (pb & 0xffffffff));
+ ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1)
+ & 0xffff);
+}
+
+static int finish_range(handle_t *handle, struct inode *inode,
+ struct list_blocks_struct *lb)
+
+{
+ int retval = 0, needed;
+ struct ext4_extent newext;
+ struct ext4_ext_path *path;
+ if (lb->first_pblock == 0)
+ return 0;
+
+ /* Add the extent to temp inode*/
+ newext.ee_block = cpu_to_le32(lb->first_block);
+ newext.ee_len = cpu_to_le16(lb->last_block - lb->first_block + 1);
+ ext4_ext_store_pblock(&newext, lb->first_pblock);
+ path = ext4_ext_find_extent(inode, lb->first_block, NULL);
+
+ if (IS_ERR(path)) {
+ retval = PTR_ERR(path);
+ goto err_out;
+ }
+
+ /*
+ * Calculate the credit needed to inserting this extent
+ * Since we are doing this in loop we may accumalate extra
+ * credit. But below we try to not accumalate too much
+ * of them by restarting the journal.
+ */
+ needed = ext4_ext_calc_credits_for_insert(inode, path);
+
+ /*
+ * Make sure the credit we accumalated is not really high
+ */
+
+ if (needed && handle->h_buffer_credits >= EXT4_RESERVE_TRANS_BLOCKS) {
+
+ retval = ext4_journal_restart(handle, needed);
+ if (retval)
+ goto err_out;
+
+ }
+
+ if (needed) {
+ retval = ext4_journal_extend(handle, needed);
+ if (retval != 0) {
+ /*
+ * IF not able to extend the journal restart the journal
+ */
+ retval = ext4_journal_restart(handle, needed);
+ if (retval)
+ goto err_out;
+ }
+ }
+
+ retval = ext4_ext_insert_extent(handle, inode, path, &newext);
+
+err_out:
+ lb->first_pblock = 0;
+ return retval;
+}
+static int update_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t blk_num,
+ struct list_blocks_struct *lb)
+{
+ int retval;
+
+ /*
+ * See if we can add on to the existing range (if it exists)
+ */
+ if (lb->first_pblock &&
+ (lb->last_pblock+1 == pblock) &&
+ (lb->last_block+1 == blk_num)) {
+ lb->last_pblock = pblock;
+ lb->last_block = blk_num;
+ return 0;
+ }
+ /*
+ * Start a new range.
+ */
+ retval = finish_range(handle, inode, lb);
+ lb->first_pblock = lb->last_pblock = pblock;
+ lb->first_block = lb->last_block = blk_num;
+
+ return retval;
+
+}
+
+static int update_ind_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+ struct list_blocks_struct *lb)
+{
+ struct buffer_head *bh;
+ __le32 *i_data;
+ int i, retval = 0;
+ ext4_lblk_t blk_count = *blk_nump;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ if (!pblock) {
+ /* Only update the file block number */
+ *blk_nump += max_entries;
+ return 0;
+ }
+
+ bh = sb_bread(inode->i_sb, pblock);
+ if (!bh)
+ return -EIO;
+
+ i_data = (__le32 *)bh->b_data;
+
+ for (i = 0; i < max_entries; i++, blk_count++) {
+ if (i_data[i]) {
+ retval = update_extent_range(handle, inode,
+ le32_to_cpu(i_data[i]),
+ blk_count, lb);
+ if (retval)
+ break;
+ }
+ }
+
+ /* Update the file block number */
+ *blk_nump = blk_count;
+ brelse(bh);
+ return retval;
+
+}
+static int update_dind_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+ struct list_blocks_struct *lb)
+{
+ struct buffer_head *bh;
+ __le32 *i_data;
+ int i, retval = 0;
+ ext4_lblk_t blk_count = *blk_nump;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ if (!pblock) {
+ /* Only update the file block number */
+ *blk_nump += max_entries * max_entries;
+ return 0;
+ }
+
+ bh = sb_bread(inode->i_sb, pblock);
+ if (!bh)
+ return -EIO;
+
+ i_data = (__le32 *)bh->b_data;
+
+ for (i = 0; i < max_entries; i++) {
+ if (i_data[i]) {
+ retval = update_ind_extent_range(handle, inode,
+ le32_to_cpu(i_data[i]),
+ &blk_count, lb);
+ if (retval)
+ break;
+ } else {
+ /* Only update the file block number */
+ blk_count += max_entries;
+ }
+ }
+
+ /* Update the file block number */
+ *blk_nump = blk_count;
+ brelse(bh);
+ return retval;
+
+}
+static int update_tind_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+ struct list_blocks_struct *lb)
+{
+ struct buffer_head *bh;
+ __le32 *i_data;
+ int i, retval = 0;
+ ext4_lblk_t blk_count = *blk_nump;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ if (!pblock) {
+ /* Only update the file block number */
+ *blk_nump += max_entries * max_entries * max_entries;
+ return 0;
+ }
+
+ bh = sb_bread(inode->i_sb, pblock);
+ if (!bh)
+ return -EIO;
+
+ i_data = (__le32 *)bh->b_data;
+
+ for (i = 0; i < max_entries; i++) {
+ if (i_data[i]) {
+ retval = update_dind_extent_range(handle, inode,
+ le32_to_cpu(i_data[i]),
+ &blk_count, lb);
+ if (retval)
+ break;
+ } else {
+ /* Only update the file block number */
+ blk_count += max_entries * max_entries;
+ }
+ }
+
+ /* Update the file block number */
+ *blk_nump = blk_count;
+ brelse(bh);
+ return retval;
+
+}
+
+
+static int free_dind_blocks(handle_t *handle,
+ struct inode *inode, __le32 i_data)
+{
+ int i;
+ __le32 *tmp_idata;
+ struct buffer_head *bh;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ bh = sb_bread(inode->i_sb, le32_to_cpu(i_data));
+ if (!bh)
+ return -EIO;
+
+ tmp_idata = (__le32 *)bh->b_data;
+ for (i = 0; i < max_entries; i++) {
+ if (tmp_idata[i]) {
+ ext4_free_blocks(handle, inode,
+ le32_to_cpu(tmp_idata[i]), 1);
+ }
+ }
+ brelse(bh);
+ ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+
+ return 0;
+
+
+}
+
+static int free_tind_blocks(handle_t *handle,
+ struct inode *inode, __le32 i_data)
+{
+ int i, retval = 0;
+ __le32 *tmp_idata;
+ struct buffer_head *bh;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ bh = sb_bread(inode->i_sb, le32_to_cpu(i_data));
+ if (!bh)
+ return -EIO;
+
+ tmp_idata = (__le32 *)bh->b_data;
+
+ for (i = 0; i < max_entries; i++) {
+ if (tmp_idata[i]) {
+ retval = free_dind_blocks(handle,
+ inode, tmp_idata[i]);
+ if (retval) {
+ brelse(bh);
+ return retval;
+ }
+ }
+ }
+ brelse(bh);
+ ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+
+ return 0;
+
+
+}
+
+static int free_ind_block(handle_t *handle, struct inode *inode)
+{
+ int retval;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (ei->i_data[EXT4_IND_BLOCK]) {
+
+ ext4_free_blocks(handle, inode,
+ le32_to_cpu(ei->i_data[EXT4_IND_BLOCK]), 1);
+
+ }
+
+ if (ei->i_data[EXT4_DIND_BLOCK]) {
+ retval = free_dind_blocks(handle, inode,
+ ei->i_data[EXT4_DIND_BLOCK]);
+ if (retval)
+ return retval;
+ }
+
+ if (ei->i_data[EXT4_TIND_BLOCK]) {
+ retval = free_tind_blocks(handle, inode,
+ ei->i_data[EXT4_TIND_BLOCK]);
+ if (retval)
+ return retval;
+ }
+
+
+ return 0;
+}
+static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
+ struct inode *tmp_inode, int retval)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
+
+
+ retval = free_ind_block(handle, inode);
+ if (retval)
+ goto err_out;
+
+ /*
+ * One credit accounted for writing the
+ * i_data field of the original inode
+ */
+ retval = ext4_journal_extend(handle, 1);
+ if (retval != 0) {
+ retval = ext4_journal_restart(handle, 1);
+ if (retval)
+ goto err_out;
+ }
+
+ /*
+ * We have the extent map build with the tmp inode.
+ * Now copy the i_data across
+ */
+ ei->i_flags |= EXT4_EXTENTS_FL;
+ memcpy(ei->i_data, tmp_ei->i_data, sizeof(ei->i_data));
+
+ /*
+ * Update i_blocks with the new blocks that got
+ * allocated while adding extents for extent index
+ * blocks.
+ *
+ * While converting to extents we need not
+ * update the orignal inode i_blocks for extent blocks
+ * via quota APIs. The quota update happened via tmp_inode already.
+ */
+ spin_lock(&inode->i_lock);
+ inode->i_blocks += tmp_inode->i_blocks;
+ spin_unlock(&inode->i_lock);
+
+ ext4_mark_inode_dirty(handle, inode);
+
+err_out:
+
+ return retval;
+}
+
+/* Will go away */
+static ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix)
+{
+ ext4_fsblk_t block;
+
+ block = le32_to_cpu(ix->ei_leaf_lo);
+ block |= ((ext4_fsblk_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+ return block;
+}
+
+static int free_ext_idx(handle_t *handle, struct inode *inode,
+ struct ext4_extent_idx *ix)
+{
+ int i, retval = 0;
+ ext4_fsblk_t block;
+ struct buffer_head *bh;
+ struct ext4_extent_header *eh;
+
+
+ block = idx_pblock(ix);
+ bh = sb_bread(inode->i_sb, block);
+ if (!bh)
+ return -EIO;
+
+ eh = (struct ext4_extent_header *)bh->b_data;
+ if (eh->eh_depth == 0) {
+
+ brelse(bh);
+ ext4_free_blocks(handle, inode, block, 1);
+
+ } else {
+
+ ix = EXT_FIRST_INDEX(eh);
+ for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+ retval = free_ext_idx(handle, inode, ix);
+ if (retval)
+ return retval;
+ }
+
+ }
+
+ return retval;
+
+}
+/*
+ * Free the extent meta data blocks only
+ */
+static int free_ext_block(handle_t *handle, struct inode *inode)
+{
+ int i, retval = 0;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_extent_header *eh = (struct ext4_extent_header *)ei->i_data;
+ struct ext4_extent_idx *ix;
+ if (eh->eh_depth == 0) {
+ /*
+ * No extra blocks allocated for extent meta data
+ */
+ return 0;
+ }
+ ix = EXT_FIRST_INDEX(eh);
+ for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+ retval = free_ext_idx(handle, inode, ix);
+ if (retval)
+ return retval;
+ }
+
+ return retval;
+
+}
+int ext4_ext_migrate(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+ handle_t *handle;
+ int retval = 0, i;
+ __le32 *i_data;
+ ext4_lblk_t blk_count = 0;
+ struct ext4_inode_info *ei;
+ struct inode *tmp_inode = NULL;
+ struct list_blocks_struct lb;
+ unsigned long max_entries;
+
+
+ if (!test_opt(inode->i_sb, EXTENTS)) {
+ /*
+ * if mounted with noextents
+ * we don't allow the migrate
+ */
+ return -EINVAL;
+ }
+
+ if ((EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return -EINVAL;
+
+ down_write(&EXT4_I(inode)->i_data_sem);
+
+
+ handle = ext4_journal_start(inode,
+ EXT4_DATA_TRANS_BLOCKS(inode->i_sb) +
+ EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+ 2 * EXT4_QUOTA_INIT_BLOCKS(inode->i_sb)
+ + 1);
+ if (IS_ERR(handle)) {
+ retval = PTR_ERR(handle);
+ goto err_out;
+ }
+
+ tmp_inode = ext4_new_inode(handle,
+ inode->i_sb->s_root->d_inode,
+ S_IFREG);
+
+ if (IS_ERR(tmp_inode)) {
+ retval = -ENOMEM;
+ ext4_journal_stop(handle);
+ tmp_inode = NULL;
+ goto err_out;
+ }
+
+ i_size_write(tmp_inode, i_size_read(inode));
+ /*
+ * We don't want the inode to be reclaimed
+ * if we got interrupted in between. We have
+ * this tmp inode carrying reference to the
+ * data blocks of the original file. We set
+ * the i_nlink to zero at the last stage after
+ * switching the original file to extent format
+ */
+ tmp_inode->i_nlink = 1;
+
+ ext4_ext_tree_init(handle, tmp_inode);
+ ext4_orphan_add(handle, tmp_inode);
+ ext4_journal_stop(handle);
+
+ ei = EXT4_I(inode);
+ i_data = ei->i_data;
+ memset(&lb, 0, sizeof(lb));
+
+ /* 32 bit block address 4 bytes */
+ max_entries = inode->i_sb->s_blocksize >> 2;
+
+ /*
+ * start with one credit accounted for
+ * superblock modification.
+ *
+ * For the tmp_inode we already have commited the
+ * trascation that created the inode. Later as and
+ * when we add extents we extent the journal
+ */
+ handle = ext4_journal_start(inode, 1);
+ for (i = 0; i < EXT4_NDIR_BLOCKS; i++, blk_count++) {
+
+ if (i_data[i]) {
+ retval = update_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[i]),
+ blk_count, &lb);
+ if (retval)
+ goto err_out;
+ }
+ }
+
+ if (i_data[EXT4_IND_BLOCK]) {
+ retval = update_ind_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[EXT4_IND_BLOCK]),
+ &blk_count, &lb);
+ if (retval)
+ goto err_out;
+ } else {
+ blk_count += max_entries;
+ }
+
+ if (i_data[EXT4_DIND_BLOCK]) {
+ retval = update_dind_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[EXT4_DIND_BLOCK]),
+ &blk_count, &lb);
+ if (retval)
+ goto err_out;
+ } else {
+ blk_count += max_entries * max_entries;
+ }
+
+
+ if (i_data[EXT4_TIND_BLOCK]) {
+ retval = update_tind_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[EXT4_TIND_BLOCK]),
+ &blk_count, &lb);
+ if (retval)
+ goto err_out;
+ }
+
+ /*
+ * Build the last extent
+ */
+ retval = finish_range(handle, tmp_inode, &lb);
+
+err_out:
+ /*
+ * We are either freeing extent information or indirect
+ * blocks. During this we touch superblock, group descriptor
+ * and block bitmap. Later we mark the tmp_inode dirty
+ * via ext4_ext_tree_init. So allocate a credit of 4
+ * We may update quota (user and group).
+ *
+ * FIXME!! we may be touching bitmaps in different block groups.
+ */
+
+ if (ext4_journal_extend(handle,
+ 4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb)) != 0) {
+
+ ext4_journal_restart(handle,
+ 4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ }
+
+ if (retval) {
+ /*
+ * Failure case delete the extent information with the
+ * tmp_inode
+ */
+ free_ext_block(handle, tmp_inode);
+
+ } else {
+
+ retval = ext4_ext_swap_inode_data(handle, inode,
+ tmp_inode, retval);
+ }
+
+ /*
+ * Mark the tmp_inode as of size zero
+ */
+ i_size_write(tmp_inode, 0);
+
+ /*
+ * set the i_blocks count to zero
+ * so that the ext4_delete_inode does the
+ * right job
+ *
+ * We don't need to take the i_lock because
+ * the inode is not visible to user space.
+ */
+ tmp_inode->i_blocks = 0;
+
+ /* Reset the extent details */
+ ext4_ext_tree_init(handle, tmp_inode);
+
+ /*
+ * Set the i_nlink to zero so that
+ * generic_drop_inode really deletes the
+ * inode
+ */
+ tmp_inode->i_nlink = 0;
+
+ ext4_journal_stop(handle);
+
+ up_write(&EXT4_I(inode)->i_data_sem);
+
+ if (tmp_inode)
+ iput(tmp_inode);
+
+ return retval;
+}
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index b609294..213974f 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -243,6 +243,7 @@ struct ext4_new_group_data {
#endif
#define EXT4_IOC_GETRSVSZ _IOR('f', 5, long)
#define EXT4_IOC_SETRSVSZ _IOW('f', 6, long)
+#define EXT4_IOC_MIGRATE _IO('f', 7)

/*
* ioctl commands in 32 bit emulation
@@ -983,6 +984,9 @@ extern int ext4_ioctl (struct inode *, struct file *, unsigned int,
unsigned long);
extern long ext4_compat_ioctl (struct file *, unsigned int, unsigned long);

+/* migrate.c */
+extern int ext4_ext_migrate(struct inode *, struct file *, unsigned int,
+ unsigned long);
/* namei.c */
extern int ext4_orphan_add(handle_t *, struct inode *);
extern int ext4_orphan_del(handle_t *, struct inode *);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:16:49

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 13/49] ext4: different maxbytes functions for bitmap & extent files

From: Eric Sandeen <[email protected]>

use 2 different maxbytes functions for bitmapped & extent-based
files.

Signed-off-by: Eric Sandeen <[email protected]>
---
fs/ext4/super.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 64067de..c79e46b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1600,19 +1600,58 @@ static void ext4_orphan_cleanup (struct super_block * sb,
#endif
sb->s_flags = s_flags; /* Restore MS_RDONLY status */
}
+/*
+ * Maximal extent format file size.
+ * Resulting logical blkno at s_maxbytes must fit in our on-disk
+ * extent format containers, within a sector_t, and within i_blocks
+ * in the vfs. ext4 inode has 48 bits of i_block in fsblock units,
+ * so that won't be a limiting factor.
+ *
+ * Note, this does *not* consider any metadata overhead for vfs i_blocks.
+ */
+static loff_t ext4_max_size(int blkbits)
+{
+ loff_t res;
+ loff_t upper_limit = MAX_LFS_FILESIZE;
+
+ /* small i_blocks in vfs inode? */
+ if (sizeof(blkcnt_t) < sizeof(u64)) {
+ /*
+ * CONFIG_LSF is not enabled implies the inode
+ * i_block represent total blocks in 512 bytes
+ * 32 == size of vfs inode i_blocks * 8
+ */
+ upper_limit = (1LL << 32) - 1;
+
+ /* total blocks in file system block size */
+ upper_limit >>= (blkbits - 9);
+ upper_limit <<= blkbits;
+ }
+
+ /* 32-bit extent-start container, ee_block */
+ res = 1LL << 32;
+ res <<= blkbits;
+ res -= 1;
+
+ /* Sanity check against vm- & vfs- imposed limits */
+ if (res > upper_limit)
+ res = upper_limit;
+
+ return res;
+}

/*
- * Maximal file size. There is a direct, and {,double-,triple-}indirect
+ * Maximal bitmap file size. There is a direct, and {,double-,triple-}indirect
* block limit, and also a limit of (2^48 - 1) 512-byte sectors in i_blocks.
* We need to be 1 filesystem block less than the 2^48 sector limit.
*/
-static loff_t ext4_max_size(int bits)
+static loff_t ext4_max_bitmap_size(int bits)
{
loff_t res = EXT4_NDIR_BLOCKS;
int meta_blocks;
loff_t upper_limit;
/* This is calculated to be the largest file size for a
- * dense, file such that the total number of
+ * dense, bitmapped file such that the total number of
* sectors in the file, including data and all indirect blocks,
* does not exceed 2^48 -1
* __u32 i_blocks_lo and _u16 i_blocks_high representing the
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:17:16

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 05/49] ext4: add ext4_group_t, and change all group variables to this type.

From: Avantika Mathur <[email protected]>

In many places variables for block group are of type int, which limits the
maximum number of block groups to 2^31. Each block group can have up to
2^15 blocks, with a 4K block size, and the max filesystem size is limited to
2^31 * (2^15 * 2^12) = 2^58 -- or 256 PB

This patch introduces a new type ext4_group_t, of type unsigned long, to
represent block group numbers in ext4.
All occurrences of block group variables are converted to type ext4_group_t.

Signed-off-by: Avantika Mathur <[email protected]>
---
fs/ext4/balloc.c | 69 +++++++++++++++++++++-----------------------
fs/ext4/group.h | 8 +++--
fs/ext4/ialloc.c | 46 +++++++++++++++--------------
fs/ext4/inode.c | 5 ++-
fs/ext4/resize.c | 12 ++++----
fs/ext4/super.c | 20 ++++++-------
include/linux/ext4_fs.h | 11 ++++---
include/linux/ext4_fs_i.h | 5 ++-
include/linux/ext4_fs_sb.h | 2 +-
9 files changed, 91 insertions(+), 87 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 71ee95e..9568a57 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -29,7 +29,7 @@
* Calculate the block group number and offset, given a block number
*/
void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
- unsigned long *blockgrpp, ext4_grpblk_t *offsetp)
+ ext4_group_t *blockgrpp, ext4_grpblk_t *offsetp)
{
struct ext4_super_block *es = EXT4_SB(sb)->s_es;
ext4_grpblk_t offset;
@@ -46,7 +46,7 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
/* Initializes an uninitialized block bitmap if given, and returns the
* number of blocks free in the group. */
unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
- int block_group, struct ext4_group_desc *gdp)
+ ext4_group_t block_group, struct ext4_group_desc *gdp)
{
unsigned long start;
int bit, bit_max;
@@ -60,7 +60,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
* essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
ext4_error(sb, __FUNCTION__,
- "Checksum bad for group %u\n", block_group);
+ "Checksum bad for group %lu\n", block_group);
gdp->bg_free_blocks_count = 0;
gdp->bg_free_inodes_count = 0;
gdp->bg_itable_unused = 0;
@@ -153,7 +153,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
* group descriptor
*/
struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
- unsigned int block_group,
+ ext4_group_t block_group,
struct buffer_head ** bh)
{
unsigned long group_desc;
@@ -164,7 +164,7 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
if (block_group >= sbi->s_groups_count) {
ext4_error (sb, "ext4_get_group_desc",
"block_group >= groups_count - "
- "block_group = %d, groups_count = %lu",
+ "block_group = %lu, groups_count = %lu",
block_group, sbi->s_groups_count);

return NULL;
@@ -176,7 +176,7 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
if (!sbi->s_group_desc[group_desc]) {
ext4_error (sb, "ext4_get_group_desc",
"Group descriptor not loaded - "
- "block_group = %d, group_desc = %lu, desc = %lu",
+ "block_group = %lu, group_desc = %lu, desc = %lu",
block_group, group_desc, offset);
return NULL;
}
@@ -200,7 +200,7 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
* Return buffer_head on success or NULL in case of failure.
*/
struct buffer_head *
-read_block_bitmap(struct super_block *sb, unsigned int block_group)
+read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
{
struct ext4_group_desc * desc;
struct buffer_head * bh = NULL;
@@ -227,7 +227,7 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
if (!bh)
ext4_error (sb, __FUNCTION__,
"Cannot read block bitmap - "
- "block_group = %d, block_bitmap = %llu",
+ "block_group = %lu, block_bitmap = %llu",
block_group, bitmap_blk);
return bh;
}
@@ -320,7 +320,7 @@ restart:
*/
static int
goal_in_my_reservation(struct ext4_reserve_window *rsv, ext4_grpblk_t grp_goal,
- unsigned int group, struct super_block * sb)
+ ext4_group_t group, struct super_block *sb)
{
ext4_fsblk_t group_first_block, group_last_block;

@@ -540,7 +540,7 @@ void ext4_free_blocks_sb(handle_t *handle, struct super_block *sb,
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *gd_bh;
- unsigned long block_group;
+ ext4_group_t block_group;
ext4_grpblk_t bit;
unsigned long i;
unsigned long overflow;
@@ -920,9 +920,10 @@ claim_block(spinlock_t *lock, ext4_grpblk_t block, struct buffer_head *bh)
* ext4_journal_release_buffer(), else we'll run out of credits.
*/
static ext4_grpblk_t
-ext4_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
- struct buffer_head *bitmap_bh, ext4_grpblk_t grp_goal,
- unsigned long *count, struct ext4_reserve_window *my_rsv)
+ext4_try_to_allocate(struct super_block *sb, handle_t *handle,
+ ext4_group_t group, struct buffer_head *bitmap_bh,
+ ext4_grpblk_t grp_goal, unsigned long *count,
+ struct ext4_reserve_window *my_rsv)
{
ext4_fsblk_t group_first_block;
ext4_grpblk_t start, end;
@@ -1156,7 +1157,7 @@ static int find_next_reservable_window(
*/
static int alloc_new_reservation(struct ext4_reserve_window_node *my_rsv,
ext4_grpblk_t grp_goal, struct super_block *sb,
- unsigned int group, struct buffer_head *bitmap_bh)
+ ext4_group_t group, struct buffer_head *bitmap_bh)
{
struct ext4_reserve_window_node *search_head;
ext4_fsblk_t group_first_block, group_end_block, start_block;
@@ -1354,7 +1355,7 @@ static void try_to_extend_reservation(struct ext4_reserve_window_node *my_rsv,
*/
static ext4_grpblk_t
ext4_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
- unsigned int group, struct buffer_head *bitmap_bh,
+ ext4_group_t group, struct buffer_head *bitmap_bh,
ext4_grpblk_t grp_goal,
struct ext4_reserve_window_node * my_rsv,
unsigned long *count, int *errp)
@@ -1528,12 +1529,12 @@ ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *gdp_bh;
- unsigned long group_no;
- int goal_group;
+ ext4_group_t group_no;
+ ext4_group_t goal_group;
ext4_grpblk_t grp_target_blk; /* blockgroup relative goal block */
ext4_grpblk_t grp_alloc_blk; /* blockgroup-relative allocated block*/
ext4_fsblk_t ret_block; /* filesyetem-wide allocated block */
- int bgi; /* blockgroup iteration index */
+ ext4_group_t bgi; /* blockgroup iteration index */
int fatal = 0, err;
int performed_allocation = 0;
ext4_grpblk_t free_blocks; /* number of free blocks in a group */
@@ -1544,10 +1545,7 @@ ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,
struct ext4_reserve_window_node *my_rsv = NULL;
struct ext4_block_alloc_info *block_i;
unsigned short windowsz = 0;
-#ifdef EXT4FS_DEBUG
- static int goal_hits, goal_attempts;
-#endif
- unsigned long ngroups;
+ ext4_group_t ngroups;
unsigned long num = *count;

*errp = -ENOSPC;
@@ -1743,9 +1741,6 @@ allocated:
* list of some description. We don't know in advance whether
* the caller wants to use it as metadata or data.
*/
- ext4_debug("allocating block %lu. Goal hits %d of %d.\n",
- ret_block, goal_hits, goal_attempts);
-
spin_lock(sb_bgl_lock(sbi, group_no));
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
@@ -1804,8 +1799,8 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
{
ext4_fsblk_t desc_count;
struct ext4_group_desc *gdp;
- int i;
- unsigned long ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t i;
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
#ifdef EXT4FS_DEBUG
struct ext4_super_block *es;
ext4_fsblk_t bitmap_count;
@@ -1829,7 +1824,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
continue;

x = ext4_count_free(bitmap_bh, sb->s_blocksize);
- printk("group %d: stored = %d, counted = %lu\n",
+ printk(KERN_DEBUG "group %lu: stored = %d, counted = %lu\n",
i, le16_to_cpu(gdp->bg_free_blocks_count), x);
bitmap_count += x;
}
@@ -1853,7 +1848,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
#endif
}

-static inline int test_root(int a, int b)
+static inline int test_root(ext4_group_t a, int b)
{
int num = b;

@@ -1862,7 +1857,7 @@ static inline int test_root(int a, int b)
return num == a;
}

-static int ext4_group_sparse(int group)
+static int ext4_group_sparse(ext4_group_t group)
{
if (group <= 1)
return 1;
@@ -1880,7 +1875,7 @@ static int ext4_group_sparse(int group)
* Return the number of blocks used by the superblock (primary or backup)
* in this group. Currently this will be only 0 or 1.
*/
-int ext4_bg_has_super(struct super_block *sb, int group)
+int ext4_bg_has_super(struct super_block *sb, ext4_group_t group)
{
if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER) &&
@@ -1889,18 +1884,20 @@ int ext4_bg_has_super(struct super_block *sb, int group)
return 1;
}

-static unsigned long ext4_bg_num_gdb_meta(struct super_block *sb, int group)
+static unsigned long ext4_bg_num_gdb_meta(struct super_block *sb,
+ ext4_group_t group)
{
unsigned long metagroup = group / EXT4_DESC_PER_BLOCK(sb);
- unsigned long first = metagroup * EXT4_DESC_PER_BLOCK(sb);
- unsigned long last = first + EXT4_DESC_PER_BLOCK(sb) - 1;
+ ext4_group_t first = metagroup * EXT4_DESC_PER_BLOCK(sb);
+ ext4_group_t last = first + EXT4_DESC_PER_BLOCK(sb) - 1;

if (group == first || group == first + 1 || group == last)
return 1;
return 0;
}

-static unsigned long ext4_bg_num_gdb_nometa(struct super_block *sb, int group)
+static unsigned long ext4_bg_num_gdb_nometa(struct super_block *sb,
+ ext4_group_t group)
{
if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER) &&
@@ -1918,7 +1915,7 @@ static unsigned long ext4_bg_num_gdb_nometa(struct super_block *sb, int group)
* (primary or backup) in this group. In the future there may be a
* different number of descriptor blocks in each group.
*/
-unsigned long ext4_bg_num_gdb(struct super_block *sb, int group)
+unsigned long ext4_bg_num_gdb(struct super_block *sb, ext4_group_t group)
{
unsigned long first_meta_bg =
le32_to_cpu(EXT4_SB(sb)->s_es->s_first_meta_bg);
diff --git a/fs/ext4/group.h b/fs/ext4/group.h
index 1577910..7eb0604 100644
--- a/fs/ext4/group.h
+++ b/fs/ext4/group.h
@@ -14,14 +14,16 @@ extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group,
extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
struct buffer_head *read_block_bitmap(struct super_block *sb,
- unsigned int block_group);
+ ext4_group_t block_group);
extern unsigned ext4_init_block_bitmap(struct super_block *sb,
- struct buffer_head *bh, int group,
+ struct buffer_head *bh,
+ ext4_group_t group,
struct ext4_group_desc *desc);
#define ext4_free_blocks_after_init(sb, group, desc) \
ext4_init_block_bitmap(sb, NULL, group, desc)
extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
- struct buffer_head *bh, int group,
+ struct buffer_head *bh,
+ ext4_group_t group,
struct ext4_group_desc *desc);
extern void mark_bitmap_end(int start_bit, int end_bit, char *bitmap);
#endif /* _LINUX_EXT4_GROUP_H */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index c61f37f..64dea86 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -64,8 +64,8 @@ void mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
}

/* Initializes an uninitialized inode bitmap */
-unsigned ext4_init_inode_bitmap(struct super_block *sb,
- struct buffer_head *bh, int block_group,
+unsigned ext4_init_inode_bitmap(struct super_block *sb, struct buffer_head *bh,
+ ext4_group_t block_group,
struct ext4_group_desc *gdp)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -75,7 +75,7 @@ unsigned ext4_init_inode_bitmap(struct super_block *sb,
/* If checksum is bad mark all blocks and inodes use to prevent
* allocation, essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
- ext4_error(sb, __FUNCTION__, "Checksum bad for group %u\n",
+ ext4_error(sb, __FUNCTION__, "Checksum bad for group %lu\n",
block_group);
gdp->bg_free_blocks_count = 0;
gdp->bg_free_inodes_count = 0;
@@ -98,7 +98,7 @@ unsigned ext4_init_inode_bitmap(struct super_block *sb,
* Return buffer_head of bitmap on success or NULL.
*/
static struct buffer_head *
-read_inode_bitmap(struct super_block * sb, unsigned long block_group)
+read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
{
struct ext4_group_desc *desc;
struct buffer_head *bh = NULL;
@@ -152,7 +152,7 @@ void ext4_free_inode (handle_t *handle, struct inode * inode)
unsigned long ino;
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *bh2;
- unsigned long block_group;
+ ext4_group_t block_group;
unsigned long bit;
struct ext4_group_desc * gdp;
struct ext4_super_block * es;
@@ -260,12 +260,12 @@ error_return:
* For other inodes, search forward from the parent directory\'s block
* group to find a free inode.
*/
-static int find_group_dir(struct super_block *sb, struct inode *parent)
+static ext4_group_t find_group_dir(struct super_block *sb, struct inode *parent)
{
- int ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
unsigned int freei, avefreei;
struct ext4_group_desc *desc, *best_desc = NULL;
- int group, best_group = -1;
+ ext4_group_t group, best_group = -1;

freei = percpu_counter_read_positive(&EXT4_SB(sb)->s_freeinodes_counter);
avefreei = freei / ngroups;
@@ -314,12 +314,13 @@ static int find_group_dir(struct super_block *sb, struct inode *parent)
#define INODE_COST 64
#define BLOCK_COST 256

-static int find_group_orlov(struct super_block *sb, struct inode *parent)
+static ext4_group_t find_group_orlov(struct super_block *sb,
+ struct inode *parent)
{
- int parent_group = EXT4_I(parent)->i_block_group;
+ ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
- int ngroups = sbi->s_groups_count;
+ ext4_group_t ngroups = sbi->s_groups_count;
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
unsigned int freei, avefreei;
ext4_fsblk_t freeb, avefreeb;
@@ -327,7 +328,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)
unsigned int ndirs;
int max_debt, max_dirs, min_inodes;
ext4_grpblk_t min_blocks;
- int group = -1, i;
+ ext4_group_t group = -1, i;
struct ext4_group_desc *desc;

freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
@@ -340,7 +341,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent)
if ((parent == sb->s_root->d_inode) ||
(EXT4_I(parent)->i_flags & EXT4_TOPDIR_FL)) {
int best_ndir = inodes_per_group;
- int best_group = -1;
+ ext4_group_t best_group = -1;

get_random_bytes(&group, sizeof(group));
parent_group = (unsigned)group % ngroups;
@@ -415,12 +416,13 @@ fallback:
return -1;
}

-static int find_group_other(struct super_block *sb, struct inode *parent)
+static ext4_group_t find_group_other(struct super_block *sb,
+ struct inode *parent)
{
- int parent_group = EXT4_I(parent)->i_block_group;
- int ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
struct ext4_group_desc *desc;
- int group, i;
+ ext4_group_t group, i;

/*
* Try to place the inode in its parent directory
@@ -487,7 +489,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode)
struct super_block *sb;
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *bh2;
- int group;
+ ext4_group_t group;
unsigned long ino = 0;
struct inode * inode;
struct ext4_group_desc * gdp = NULL;
@@ -583,7 +585,7 @@ got:
ino > EXT4_INODES_PER_GROUP(sb)) {
ext4_error(sb, __FUNCTION__,
"reserved inode or inode > inodes count - "
- "block_group = %d, inode=%lu", group,
+ "block_group = %lu, inode=%lu", group,
ino + group * EXT4_INODES_PER_GROUP(sb));
err = -EIO;
goto fail;
@@ -777,7 +779,7 @@ fail_drop:
struct inode *ext4_orphan_get(struct super_block *sb, unsigned long ino)
{
unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
- unsigned long block_group;
+ ext4_group_t block_group;
int bit;
struct buffer_head *bitmap_bh = NULL;
struct inode *inode = NULL;
@@ -833,7 +835,7 @@ unsigned long ext4_count_free_inodes (struct super_block * sb)
{
unsigned long desc_count;
struct ext4_group_desc *gdp;
- int i;
+ ext4_group_t i;
#ifdef EXT4FS_DEBUG
struct ext4_super_block *es;
unsigned long bitmap_count, x;
@@ -879,7 +881,7 @@ unsigned long ext4_count_free_inodes (struct super_block * sb)
unsigned long ext4_count_dirs (struct super_block * sb)
{
unsigned long count = 0;
- int i;
+ ext4_group_t i;

for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
struct ext4_group_desc *gdp = ext4_get_group_desc (sb, i, NULL);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 488f829..1ee19c9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2464,7 +2464,8 @@ out_stop:
static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
unsigned long ino, struct ext4_iloc *iloc)
{
- unsigned long desc, group_desc, block_group;
+ unsigned long desc, group_desc;
+ ext4_group_t block_group;
unsigned long offset;
ext4_fsblk_t block;
struct buffer_head *bh;
@@ -2551,7 +2552,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
struct ext4_group_desc *desc;
int inodes_per_buffer;
int inode_offset, i;
- int block_group;
+ ext4_group_t block_group;
int start;

block_group = (inode->i_ino - 1) /
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index bd8a52b..7090c2d 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -28,7 +28,7 @@ static int verify_group_input(struct super_block *sb,
struct ext4_super_block *es = sbi->s_es;
ext4_fsblk_t start = ext4_blocks_count(es);
ext4_fsblk_t end = start + input->blocks_count;
- unsigned group = input->group;
+ ext4_group_t group = input->group;
ext4_fsblk_t itend = input->inode_table + sbi->s_itb_per_group;
unsigned overhead = ext4_bg_has_super(sb, group) ?
(1 + ext4_bg_num_gdb(sb, group) +
@@ -357,7 +357,7 @@ static int verify_reserved_gdb(struct super_block *sb,
struct buffer_head *primary)
{
const ext4_fsblk_t blk = primary->b_blocknr;
- const unsigned long end = EXT4_SB(sb)->s_groups_count;
+ const ext4_group_t end = EXT4_SB(sb)->s_groups_count;
unsigned three = 1;
unsigned five = 5;
unsigned seven = 7;
@@ -656,12 +656,12 @@ static void update_backups(struct super_block *sb,
int blk_off, char *data, int size)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- const unsigned long last = sbi->s_groups_count;
+ const ext4_group_t last = sbi->s_groups_count;
const int bpg = EXT4_BLOCKS_PER_GROUP(sb);
unsigned three = 1;
unsigned five = 5;
unsigned seven = 7;
- unsigned group;
+ ext4_group_t group;
int rest = sb->s_blocksize - size;
handle_t *handle;
int err = 0, err2;
@@ -716,7 +716,7 @@ static void update_backups(struct super_block *sb,
exit_err:
if (err) {
ext4_warning(sb, __FUNCTION__,
- "can't update backup for group %d (err %d), "
+ "can't update backup for group %lu (err %d), "
"forcing fsck on next reboot", group, err);
sbi->s_mount_state &= ~EXT4_VALID_FS;
sbi->s_es->s_state &= cpu_to_le16(~EXT4_VALID_FS);
@@ -952,7 +952,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
ext4_fsblk_t n_blocks_count)
{
ext4_fsblk_t o_blocks_count;
- unsigned long o_groups_count;
+ ext4_group_t o_groups_count;
ext4_grpblk_t last;
ext4_grpblk_t add;
struct buffer_head * bh;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6302b03..df8842b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1364,7 +1364,7 @@ static int ext4_check_descriptors (struct super_block * sb)
struct ext4_group_desc * gdp = NULL;
int desc_block = 0;
int flexbg_flag = 0;
- int i;
+ ext4_group_t i;

if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
flexbg_flag = 1;
@@ -1386,7 +1386,7 @@ static int ext4_check_descriptors (struct super_block * sb)
if (block_bitmap < first_block || block_bitmap > last_block)
{
ext4_error (sb, "ext4_check_descriptors",
- "Block bitmap for group %d"
+ "Block bitmap for group %lu"
" not in group (block %llu)!",
i, block_bitmap);
return 0;
@@ -1395,7 +1395,7 @@ static int ext4_check_descriptors (struct super_block * sb)
if (inode_bitmap < first_block || inode_bitmap > last_block)
{
ext4_error (sb, "ext4_check_descriptors",
- "Inode bitmap for group %d"
+ "Inode bitmap for group %lu"
" not in group (block %llu)!",
i, inode_bitmap);
return 0;
@@ -1405,17 +1405,16 @@ static int ext4_check_descriptors (struct super_block * sb)
inode_table + sbi->s_itb_per_group - 1 > last_block)
{
ext4_error (sb, "ext4_check_descriptors",
- "Inode table for group %d"
+ "Inode table for group %lu"
" not in group (block %llu)!",
i, inode_table);
return 0;
}
if (!ext4_group_desc_csum_verify(sbi, i, gdp)) {
ext4_error(sb, __FUNCTION__,
- "Checksum for group %d failed (%u!=%u)\n", i,
- le16_to_cpu(ext4_group_desc_csum(sbi, i,
- gdp)),
- le16_to_cpu(gdp->bg_checksum));
+ "Checksum for group %lu failed (%u!=%u)\n",
+ i, le16_to_cpu(ext4_group_desc_csum(sbi, i,
+ gdp)), le16_to_cpu(gdp->bg_checksum));
return 0;
}
if (!flexbg_flag)
@@ -1429,7 +1428,6 @@ static int ext4_check_descriptors (struct super_block * sb)
return 1;
}

-
/* ext4_orphan_cleanup() walks a singly-linked list of inodes (starting at
* the superblock) which were deleted from all directories, but held open by
* a process at the time of a crash. We walk the list and try to delete these
@@ -1570,7 +1568,7 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,
ext4_fsblk_t logical_sb_block, int nr)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- unsigned long bg, first_meta_bg;
+ ext4_group_t bg, first_meta_bg;
int has_super = 0;

first_meta_bg = le32_to_cpu(sbi->s_es->s_first_meta_bg);
@@ -2678,7 +2676,7 @@ static int ext4_statfs (struct dentry * dentry, struct kstatfs * buf)
if (test_opt(sb, MINIX_DF)) {
sbi->s_overhead_last = 0;
} else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
- unsigned long ngroups = sbi->s_groups_count, i;
+ ext4_group_t ngroups = sbi->s_groups_count, i;
ext4_fsblk_t overhead = 0;
smp_rmb();

diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 5e2da09..e1103c2 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -830,7 +830,7 @@ struct ext4_iloc
{
struct buffer_head *bh;
unsigned long offset;
- unsigned long block_group;
+ ext4_group_t block_group;
};

static inline struct ext4_inode *ext4_raw_inode(struct ext4_iloc *iloc)
@@ -855,7 +855,7 @@ struct dir_private_info {

/* calculate the first block number of the group */
static inline ext4_fsblk_t
-ext4_group_first_block_no(struct super_block *sb, unsigned long group_no)
+ext4_group_first_block_no(struct super_block *sb, ext4_group_t group_no)
{
return group_no * (ext4_fsblk_t)EXT4_BLOCKS_PER_GROUP(sb) +
le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
@@ -886,8 +886,9 @@ extern unsigned int ext4_block_group(struct super_block *sb,
ext4_fsblk_t blocknr);
extern ext4_grpblk_t ext4_block_group_offset(struct super_block *sb,
ext4_fsblk_t blocknr);
-extern int ext4_bg_has_super(struct super_block *sb, int group);
-extern unsigned long ext4_bg_num_gdb(struct super_block *sb, int group);
+extern int ext4_bg_has_super(struct super_block *sb, ext4_group_t group);
+extern unsigned long ext4_bg_num_gdb(struct super_block *sb,
+ ext4_group_t group);
extern ext4_fsblk_t ext4_new_block (handle_t *handle, struct inode *inode,
ext4_fsblk_t goal, int *errp);
extern ext4_fsblk_t ext4_new_blocks (handle_t *handle, struct inode *inode,
@@ -900,7 +901,7 @@ extern void ext4_free_blocks_sb (handle_t *handle, struct super_block *sb,
extern ext4_fsblk_t ext4_count_free_blocks (struct super_block *);
extern void ext4_check_blocks_bitmap (struct super_block *);
extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
- unsigned int block_group,
+ ext4_group_t block_group,
struct buffer_head ** bh);
extern int ext4_should_retry_alloc(struct super_block *sb, int *retries);
extern void ext4_init_block_alloc_info(struct inode *);
diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
index 6c610b6..2b4e370 100644
--- a/include/linux/ext4_fs_i.h
+++ b/include/linux/ext4_fs_i.h
@@ -30,6 +30,9 @@ typedef unsigned long long ext4_fsblk_t;
/* data type for file logical block number */
typedef __u32 ext4_lblk_t;

+/* data type for block group number */
+typedef unsigned long ext4_group_t;
+
struct ext4_reserve_window {
ext4_fsblk_t _rsv_start; /* First byte reserved */
ext4_fsblk_t _rsv_end; /* Last byte reserved or 0 */
@@ -92,7 +95,7 @@ struct ext4_inode_info {
* place a file's data blocks near its inode block, and new inodes
* near to their parent directory's inode.
*/
- __u32 i_block_group;
+ ext4_group_t i_block_group;
__u32 i_state; /* Dynamic state flags for ext4 */

/* block reservation info */
diff --git a/include/linux/ext4_fs_sb.h b/include/linux/ext4_fs_sb.h
index b40e827..f15821c 100644
--- a/include/linux/ext4_fs_sb.h
+++ b/include/linux/ext4_fs_sb.h
@@ -35,7 +35,7 @@ struct ext4_sb_info {
unsigned long s_itb_per_group; /* Number of inode table blocks per group */
unsigned long s_gdb_count; /* Number of group descriptor blocks */
unsigned long s_desc_per_block; /* Number of group descriptors per block */
- unsigned long s_groups_count; /* Number of groups in the fs */
+ ext4_group_t s_groups_count; /* Number of groups in the fs */
unsigned long s_overhead_last; /* Last calculated overhead */
unsigned long s_blocks_last; /* Last seen block count */
struct buffer_head * s_sbh; /* Buffer containing the super block */
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:17:41

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 22/49] ext4: Change the default behaviour on error

From: Aneesh Kumar K.V <[email protected]>

ext4 file system was by default ignoring errors and continuing. This
is not a good default as continuing on error could lead to file system
corruption. Change the default to mark the file system
readonly. Debian and ubuntu already does this as the default in their
fstab.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Eric Sandeen <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/super.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 32e3ecb..effd375 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -688,16 +688,16 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID) {
seq_printf(seq, ",resgid=%u", sbi->s_resgid);
}
- if (test_opt(sb, ERRORS_CONT)) {
+ if (test_opt(sb, ERRORS_RO)) {
int def_errors = le16_to_cpu(es->s_errors);

if (def_errors == EXT4_ERRORS_PANIC ||
- def_errors == EXT4_ERRORS_RO) {
- seq_puts(seq, ",errors=continue");
+ def_errors == EXT4_ERRORS_CONTINUE) {
+ seq_puts(seq, ",errors=remount-ro");
}
}
- if (test_opt(sb, ERRORS_RO))
- seq_puts(seq, ",errors=remount-ro");
+ if (test_opt(sb, ERRORS_CONT))
+ seq_puts(seq, ",errors=continue");
if (test_opt(sb, ERRORS_PANIC))
seq_puts(seq, ",errors=panic");
if (test_opt(sb, NO_UID32))
@@ -1819,10 +1819,10 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)

if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC)
set_opt(sbi->s_mount_opt, ERRORS_PANIC);
- else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_RO)
- set_opt(sbi->s_mount_opt, ERRORS_RO);
- else
+ else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE)
set_opt(sbi->s_mount_opt, ERRORS_CONT);
+ else
+ set_opt(sbi->s_mount_opt, ERRORS_RO);

sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:18:03

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 43/49] ext4: Check for return value from sb_set_blocksize

From: Aneesh Kumar K.V <[email protected]>

sb_set_blocksize validates whether the specfied block size can be used by
the file system. Make sure we fail mounting the file system if the
blocksize specfied cannot be used.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/super.c | 15 +++++----------
1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 91a11ec..a91e17e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1809,7 +1809,6 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
unsigned long def_mount_opts;
struct inode *root;
int blocksize;
- int hblock;
int db_count;
int i;
int needs_recovery;
@@ -1966,20 +1965,16 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
goto failed_mount;
}

- hblock = bdev_hardsect_size(sb->s_bdev);
if (sb->s_blocksize != blocksize) {
- /*
- * Make sure the blocksize for the filesystem is larger
- * than the hardware sectorsize for the machine.
- */
- if (blocksize < hblock) {
- printk(KERN_ERR "EXT4-fs: blocksize %d too small for "
- "device blocksize %d.\n", blocksize, hblock);
+
+ /* Validate the filesystem blocksize */
+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT4-fs: bad block size %d.\n",
+ blocksize);
goto failed_mount;
}

brelse (bh);
- sb_set_blocksize(sb, blocksize);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
bh = sb_bread(sb, logical_sb_block);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:18:39

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 02/49] ext4: Avoid rec_len overflow with 64KB block size

From: Jan Kara <[email protected]>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/dir.c | 12 ++++----
fs/ext4/namei.c | 77 ++++++++++++++++++++++------------------------
include/linux/ext4_fs.h | 20 ++++++++++++
3 files changed, 63 insertions(+), 46 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index f612bef..145a9c0 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -67,7 +67,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir,
unsigned long offset)
{
const char * error_msg = NULL;
- const int rlen = le16_to_cpu(de->rec_len);
+ const int rlen = ext4_rec_len_from_disk(de->rec_len);

if (rlen < EXT4_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -172,10 +172,10 @@ revalidate:
* least that it is non-zero. A
* failure will be detected in the
* dirent test below. */
- if (le16_to_cpu(de->rec_len) <
- EXT4_DIR_REC_LEN(1))
+ if (ext4_rec_len_from_disk(de->rec_len)
+ < EXT4_DIR_REC_LEN(1))
break;
- i += le16_to_cpu(de->rec_len);
+ i += ext4_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -197,7 +197,7 @@ revalidate:
ret = stored;
goto out;
}
- offset += le16_to_cpu(de->rec_len);
+ offset += ext4_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
* if the data destination is
@@ -219,7 +219,7 @@ revalidate:
goto revalidate;
stored ++;
}
- filp->f_pos += le16_to_cpu(de->rec_len);
+ filp->f_pos += ext4_rec_len_from_disk(de->rec_len);
}
offset = 0;
brelse (bh);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 94ee6f3..d9a3a2f 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -280,7 +280,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext4_dir_ent
space += EXT4_DIR_REC_LEN(de->name_len);
names++;
}
- de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
}
printk("(%i)\n", names);
return (struct stats) { names, space, 1 };
@@ -551,7 +551,8 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
*/
static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p)
{
- return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
+ return (struct ext4_dir_entry_2 *)((char *)p +
+ ext4_rec_len_from_disk(p->rec_len));
}

/*
@@ -720,7 +721,7 @@ static int dx_make_map (struct ext4_dir_entry_2 *de, int size,
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris */
- de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
}
return count;
}
@@ -820,7 +821,7 @@ static inline int search_dirblock(struct buffer_head * bh,
return 1;
}
/* prevent looping on a bad block */
- de_len = le16_to_cpu(de->rec_len);
+ de_len = ext4_rec_len_from_disk(de->rec_len);
if (de_len <= 0)
return -1;
offset += de_len;
@@ -1128,7 +1129,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count)
rec_len = EXT4_DIR_REC_LEN(de->name_len);
memcpy (to, de, rec_len);
((struct ext4_dir_entry_2 *) to)->rec_len =
- cpu_to_le16(rec_len);
+ ext4_rec_len_to_disk(rec_len);
de->inode = 0;
map++;
to += rec_len;
@@ -1147,13 +1148,12 @@ static struct ext4_dir_entry_2* dx_pack_dirents(char *base, int size)

prev = to = de;
while ((char*)de < base + size) {
- next = (struct ext4_dir_entry_2 *) ((char *) de +
- le16_to_cpu(de->rec_len));
+ next = ext4_next_entry(de);
if (de->inode && de->name_len) {
rec_len = EXT4_DIR_REC_LEN(de->name_len);
if (de > to)
memmove(to, de, rec_len);
- to->rec_len = cpu_to_le16(rec_len);
+ to->rec_len = ext4_rec_len_to_disk(rec_len);
prev = to;
to = (struct ext4_dir_entry_2 *) (((char *) to) + rec_len);
}
@@ -1227,8 +1227,8 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split);
de = dx_pack_dirents(data1,blocksize);
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
- de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de);
+ de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2);
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));

@@ -1297,7 +1297,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
return -EEXIST;
}
nlen = EXT4_DIR_REC_LEN(de->name_len);
- rlen = le16_to_cpu(de->rec_len);
+ rlen = ext4_rec_len_from_disk(de->rec_len);
if ((de->inode? rlen - nlen: rlen) >= reclen)
break;
de = (struct ext4_dir_entry_2 *)((char *)de + rlen);
@@ -1316,11 +1316,11 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,

/* By now the buffer is marked for journaling */
nlen = EXT4_DIR_REC_LEN(de->name_len);
- rlen = le16_to_cpu(de->rec_len);
+ rlen = ext4_rec_len_from_disk(de->rec_len);
if (de->inode) {
struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen);
- de1->rec_len = cpu_to_le16(rlen - nlen);
- de->rec_len = cpu_to_le16(nlen);
+ de1->rec_len = ext4_rec_len_to_disk(rlen - nlen);
+ de->rec_len = ext4_rec_len_to_disk(nlen);
de = de1;
}
de->file_type = EXT4_FT_UNKNOWN;
@@ -1397,17 +1397,18 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,

/* The 0th block becomes the root, move the dirents out */
fde = &root->dotdot;
- de = (struct ext4_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
+ de = (struct ext4_dir_entry_2 *)((char *)fde +
+ ext4_rec_len_from_disk(fde->rec_len));
len = ((char *) root) + blocksize - (char *) de;
memcpy (data1, de, len);
de = (struct ext4_dir_entry_2 *) data1;
top = data1 + len;
- while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
+ while ((char *)(de2 = ext4_next_entry(de)) < top)
de = de2;
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de);
/* Initialize the root; the dot dirents already exist */
de = (struct ext4_dir_entry_2 *) (&root->dotdot);
- de->rec_len = cpu_to_le16(blocksize - EXT4_DIR_REC_LEN(2));
+ de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2));
memset (&root->info, 0, sizeof(root->info));
root->info.info_length = sizeof(root->info);
root->info.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
@@ -1487,7 +1488,7 @@ static int ext4_add_entry (handle_t *handle, struct dentry *dentry,
return retval;
de = (struct ext4_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = cpu_to_le16(blocksize);
+ de->rec_len = ext4_rec_len_to_disk(blocksize);
return add_dirent_to_buf(handle, dentry, inode, de, bh);
}

@@ -1550,7 +1551,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
goto cleanup;
node2 = (struct dx_node *)(bh2->b_data);
entries2 = node2->entries;
- node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ node2->fake.rec_len = ext4_rec_len_to_disk(sb->s_blocksize);
node2->fake.inode = 0;
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext4_journal_get_write_access(handle, frame->bh);
@@ -1648,9 +1649,9 @@ static int ext4_delete_entry (handle_t *handle,
BUFFER_TRACE(bh, "get_write_access");
ext4_journal_get_write_access(handle, bh);
if (pde)
- pde->rec_len =
- cpu_to_le16(le16_to_cpu(pde->rec_len) +
- le16_to_cpu(de->rec_len));
+ pde->rec_len = ext4_rec_len_to_disk(
+ ext4_rec_len_from_disk(pde->rec_len) +
+ ext4_rec_len_from_disk(de->rec_len));
else
de->inode = 0;
dir->i_version++;
@@ -1658,10 +1659,9 @@ static int ext4_delete_entry (handle_t *handle,
ext4_journal_dirty_metadata(handle, bh);
return 0;
}
- i += le16_to_cpu(de->rec_len);
+ i += ext4_rec_len_from_disk(de->rec_len);
pde = de;
- de = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
}
return -ENOENT;
}
@@ -1824,13 +1824,13 @@ retry:
de = (struct ext4_dir_entry_2 *) dir_block->b_data;
de->inode = cpu_to_le32(inode->i_ino);
de->name_len = 1;
- de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(de->name_len));
+ de->rec_len = ext4_rec_len_to_disk(EXT4_DIR_REC_LEN(de->name_len));
strcpy (de->name, ".");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
- de = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+ de->rec_len = ext4_rec_len_to_disk(inode->i_sb->s_blocksize -
+ EXT4_DIR_REC_LEN(1));
de->name_len = 2;
strcpy (de->name, "..");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1882,8 +1882,7 @@ static int empty_dir (struct inode * inode)
return 1;
}
de = (struct ext4_dir_entry_2 *) bh->b_data;
- de1 = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de1 = ext4_next_entry(de);
if (le32_to_cpu(de->inode) != inode->i_ino ||
!le32_to_cpu(de1->inode) ||
strcmp (".", de->name) ||
@@ -1894,9 +1893,9 @@ static int empty_dir (struct inode * inode)
brelse (bh);
return 1;
}
- offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
- de = (struct ext4_dir_entry_2 *)
- ((char *) de1 + le16_to_cpu(de1->rec_len));
+ offset = ext4_rec_len_from_disk(de->rec_len) +
+ ext4_rec_len_from_disk(de1->rec_len);
+ de = ext4_next_entry(de1);
while (offset < inode->i_size ) {
if (!bh ||
(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
@@ -1925,9 +1924,8 @@ static int empty_dir (struct inode * inode)
brelse (bh);
return 0;
}
- offset += le16_to_cpu(de->rec_len);
- de = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ offset += ext4_rec_len_from_disk(de->rec_len);
+ de = ext4_next_entry(de);
}
brelse (bh);
return 1;
@@ -2282,8 +2280,7 @@ retry:
}

#define PARENT_INO(buffer) \
- ((struct ext4_dir_entry_2 *) ((char *) buffer + \
- le16_to_cpu(((struct ext4_dir_entry_2 *) buffer)->rec_len)))->inode
+ (ext4_next_entry((struct ext4_dir_entry_2 *)(buffer))->inode)

/*
* Anybody can rename anything with this: the permission checks are left to the
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index dfe4487..fb31c1a 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -767,6 +767,26 @@ struct ext4_dir_entry_2 {
#define EXT4_DIR_ROUND (EXT4_DIR_PAD - 1)
#define EXT4_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT4_DIR_ROUND) & \
~EXT4_DIR_ROUND)
+#define EXT4_MAX_REC_LEN ((1<<16)-1)
+
+static inline unsigned ext4_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT4_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext4_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT4_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* Hash Tree Directory indexing
* (c) Daniel Phillips, 2001
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:18:58

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 18/49] ext4: sync up block group descriptor with e2fsprogs.

From: Coly Li <[email protected]>

This patch extends bg_itable_unused of ext4 group descriptor
from 16bit into 32bit. In order to add bg_itable_unused_hi into
struct ext4_group_desc, some extra fields which are already introduced into
e2fsprogs are also added in for consistency.

Signed-off-by: Coly Li <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
include/linux/ext4_fs.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 6ae91f4..55a376e 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -118,6 +118,11 @@ struct ext4_group_desc
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
__le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
__le32 bg_inode_table_hi; /* Inodes table block MSB */
+ __le16 bg_free_blocks_count_hi;/* Free blocks count MSB */
+ __le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
+ __le16 bg_used_dirs_count_hi; /* Directories count MSB */
+ __le16 bg_itable_unused_hi; /* Unused inodes count MSB */
+ __u32 bg_reserved2[3];
};

#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:19:26

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 33/49] ext4: Add the journal checksum feature

From: Girish Shilamkar <[email protected]>

The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.

Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Girish Shilamkar <[email protected]>
Signed-off-by: Dave Kleikamp <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
Documentation/filesystems/ext4.txt | 10 ++
fs/Kconfig | 1 +
fs/ext4/super.c | 25 +++++
fs/jbd2/commit.c | 196 +++++++++++++++++++++++++++---------
fs/jbd2/journal.c | 28 +++++
fs/jbd2/recovery.c | 149 ++++++++++++++++++++++++++--
include/linux/ext4_fs.h | 3 +-
include/linux/jbd2.h | 36 ++++++-
8 files changed, 388 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 6a4adca..4f329af 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the following option are accepted:
extents ext4 will use extents to address file data. The
file system will no longer be mountable by ext3.

+journal_checksum Enable checksumming of the journal transactions.
+ This will allow the recovery code in e2fsck and the
+ kernel to detect corruption in the kernel. It is a
+ compatible change and will be ignored by older kernels.
+
+journal_async_commit Commit block can be written to disk without waiting
+ for descriptor blocks. If enabled older kernels cannot
+ mount the device. This will enable 'journal_checksum'
+ internally.
+
journal=update Update the ext4 file system's journal to the current
format.

diff --git a/fs/Kconfig b/fs/Kconfig
index 487236c..bb0b72c 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -236,6 +236,7 @@ config JBD_DEBUG

config JBD2
tristate
+ select CRC32
help
This is a generic journaling layer for block devices that support
both 32-bit and 64-bit block numbers. It is currently used by
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c730544..f7479d3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -869,6 +869,7 @@ enum {
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+ Opt_journal_checksum, Opt_journal_async_commit,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
@@ -908,6 +909,8 @@ static match_table_t tokens = {
{Opt_journal_update, "journal=update"},
{Opt_journal_inum, "journal=%u"},
{Opt_journal_dev, "journal_dev=%u"},
+ {Opt_journal_checksum, "journal_checksum"},
+ {Opt_journal_async_commit, "journal_async_commit"},
{Opt_abort, "abort"},
{Opt_data_journal, "data=journal"},
{Opt_data_ordered, "data=ordered"},
@@ -1095,6 +1098,13 @@ static int parse_options (char *options, struct super_block *sb,
return 0;
*journal_devnum = option;
break;
+ case Opt_journal_checksum:
+ set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+ break;
+ case Opt_journal_async_commit:
+ set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
+ set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+ break;
case Opt_noload:
set_opt (sbi->s_mount_opt, NOLOAD);
break;
@@ -2114,6 +2124,21 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
goto failed_mount4;
}

+ if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
+ jbd2_journal_set_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
+ jbd2_journal_set_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+ jbd2_journal_clear_features(sbi->s_journal, 0, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ } else {
+ jbd2_journal_clear_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ }
+
/* We have now updated the journal if required, so we can
* validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8749a86..2107820 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/jiffies.h>
+#include <linux/crc32.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -93,19 +94,23 @@ static int inverted_lock(journal_t *journal, struct buffer_head *bh)
return 1;
}

-/* Done it all: now write the commit record. We should have
+/*
+ * Done it all: now submit the commit record. We should have
* cleaned up our previous buffers by now, so if we are in abort
* mode we can now just skip the rest of the journal write
* entirely.
*
* Returns 1 if the journal needs to be aborted or 0 on success
*/
-static int journal_write_commit_record(journal_t *journal,
- transaction_t *commit_transaction)
+static int journal_submit_commit_record(journal_t *journal,
+ transaction_t *commit_transaction,
+ struct buffer_head **cbh,
+ __u32 crc32_sum)
{
struct journal_head *descriptor;
+ struct commit_header *tmp;
struct buffer_head *bh;
- int i, ret;
+ int ret;
int barrier_done = 0;

if (is_journal_aborted(journal))
@@ -117,21 +122,33 @@ static int journal_write_commit_record(journal_t *journal,

bh = jh2bh(descriptor);

- /* AKPM: buglet - add `i' to tmp! */
- for (i = 0; i < bh->b_size; i += 512) {
- journal_header_t *tmp = (journal_header_t*)bh->b_data;
- tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
- tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
- tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+ tmp = (struct commit_header *)bh->b_data;
+ tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+ tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
+ tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+
+ if (JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ tmp->h_chksum_type = JBD2_CRC32_CHKSUM;
+ tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE;
+ tmp->h_chksum[0] = cpu_to_be32(crc32_sum);
}

- JBUFFER_TRACE(descriptor, "write commit block");
+ JBUFFER_TRACE(descriptor, "submit commit block");
+ lock_buffer(bh);
+
set_buffer_dirty(bh);
- if (journal->j_flags & JBD2_BARRIER) {
+ set_buffer_uptodate(bh);
+ bh->b_end_io = journal_end_buffer_io_sync;
+
+ if (journal->j_flags & JBD2_BARRIER &&
+ !JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
set_buffer_ordered(bh);
barrier_done = 1;
}
- ret = sync_dirty_buffer(bh);
+ ret = submit_bh(WRITE, bh);
+
/* is it possible for another commit to fail at roughly
* the same time as this one? If so, we don't want to
* trust the barrier flag in the super, but instead want
@@ -152,14 +169,72 @@ static int journal_write_commit_record(journal_t *journal,
clear_buffer_ordered(bh);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
- ret = sync_dirty_buffer(bh);
+ ret = submit_bh(WRITE, bh);
}
- put_bh(bh); /* One for getblk() */
- jbd2_journal_put_journal_head(descriptor);
+ *cbh = bh;
+ return ret;
+}
+
+/*
+ * This function along with journal_submit_commit_record
+ * allows to write the commit record asynchronously.
+ */
+static int journal_wait_on_commit_record(struct buffer_head *bh)
+{
+ int ret = 0;
+
+ clear_buffer_dirty(bh);
+ wait_on_buffer(bh);

- return (ret == -EIO);
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+ put_bh(bh); /* One for getblk() */
+ jbd2_journal_put_journal_head(bh2jh(bh));
+
+ return ret;
}

+/*
+ * Wait for all submitted IO to complete.
+ */
+static int journal_wait_on_locked_list(journal_t *journal,
+ transaction_t *commit_transaction)
+{
+ int ret = 0;
+ struct journal_head *jh;
+
+ while (commit_transaction->t_locked_list) {
+ struct buffer_head *bh;
+
+ jh = commit_transaction->t_locked_list->b_tprev;
+ bh = jh2bh(jh);
+ get_bh(bh);
+ if (buffer_locked(bh)) {
+ spin_unlock(&journal->j_list_lock);
+ wait_on_buffer(bh);
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+ spin_lock(&journal->j_list_lock);
+ }
+ if (!inverted_lock(journal, bh)) {
+ put_bh(bh);
+ spin_lock(&journal->j_list_lock);
+ continue;
+ }
+ if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+ __jbd2_journal_unfile_buffer(jh);
+ jbd_unlock_bh_state(bh);
+ jbd2_journal_remove_journal_head(bh);
+ put_bh(bh);
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ put_bh(bh);
+ cond_resched_lock(&journal->j_list_lock);
+ }
+ return ret;
+ }
+
static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
{
int i;
@@ -275,6 +350,20 @@ write_out_data:
journal_do_submit_data(wbuf, bufs);
}

+static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+{
+ struct page *page = bh->b_page;
+ char *addr;
+ __u32 checksum;
+
+ addr = kmap_atomic(page, KM_USER0);
+ checksum = crc32_be(crc32_sum,
+ (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
+ kunmap_atomic(addr, KM_USER0);
+
+ return checksum;
+}
+
static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
unsigned long long block)
{
@@ -307,6 +396,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int tag_flag;
int i;
int tag_bytes = journal_tag_bytes(journal);
+ struct buffer_head *cbh = NULL; /* For transactional checksums */
+ __u32 crc32_sum = ~0;

/*
* First job: lock down the current transaction and wait for
@@ -451,38 +542,15 @@ void jbd2_journal_commit_transaction(journal_t *journal)
journal_submit_data_buffers(journal, commit_transaction);

/*
- * Wait for all previously submitted IO to complete.
+ * Wait for all previously submitted IO to complete if commit
+ * record is to be written synchronously.
*/
spin_lock(&journal->j_list_lock);
- while (commit_transaction->t_locked_list) {
- struct buffer_head *bh;
+ if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
+ err = journal_wait_on_locked_list(journal,
+ commit_transaction);

- jh = commit_transaction->t_locked_list->b_tprev;
- bh = jh2bh(jh);
- get_bh(bh);
- if (buffer_locked(bh)) {
- spin_unlock(&journal->j_list_lock);
- wait_on_buffer(bh);
- if (unlikely(!buffer_uptodate(bh)))
- err = -EIO;
- spin_lock(&journal->j_list_lock);
- }
- if (!inverted_lock(journal, bh)) {
- put_bh(bh);
- spin_lock(&journal->j_list_lock);
- continue;
- }
- if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
- __jbd2_journal_unfile_buffer(jh);
- jbd_unlock_bh_state(bh);
- jbd2_journal_remove_journal_head(bh);
- put_bh(bh);
- } else {
- jbd_unlock_bh_state(bh);
- }
- put_bh(bh);
- cond_resched_lock(&journal->j_list_lock);
- }
spin_unlock(&journal->j_list_lock);

if (err)
@@ -656,6 +724,15 @@ void jbd2_journal_commit_transaction(journal_t *journal)
start_journal_io:
for (i = 0; i < bufs; i++) {
struct buffer_head *bh = wbuf[i];
+ /*
+ * Compute checksum.
+ */
+ if (JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ crc32_sum =
+ jbd2_checksum_data(crc32_sum, bh);
+ }
+
lock_buffer(bh);
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
@@ -672,6 +749,23 @@ start_journal_io:
}
}

+ /* Done it all: now write the commit record asynchronously. */
+
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ err = journal_submit_commit_record(journal, commit_transaction,
+ &cbh, crc32_sum);
+ if (err)
+ __jbd2_journal_abort_hard(journal);
+
+ spin_lock(&journal->j_list_lock);
+ err = journal_wait_on_locked_list(journal,
+ commit_transaction);
+ spin_unlock(&journal->j_list_lock);
+ if (err)
+ __jbd2_journal_abort_hard(journal);
+ }
+
/* Lo and behold: we have just managed to send a transaction to
the log. Before we can commit it, wait for the IO so far to
complete. Control buffers being written are on the
@@ -771,8 +865,14 @@ wait_for_iobuf:

jbd_debug(3, "JBD: commit phase 6\n");

- if (journal_write_commit_record(journal, commit_transaction))
- err = -EIO;
+ if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ err = journal_submit_commit_record(journal, commit_transaction,
+ &cbh, crc32_sum);
+ if (err)
+ __jbd2_journal_abort_hard(journal);
+ }
+ err = journal_wait_on_commit_record(cbh);

if (err)
jbd2_journal_abort(journal, err);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 3667c91..f8b0f8c 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1578,6 +1578,34 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
return 1;
}

+/*
+ * jbd2_journal_clear_features () - Clear a given journal feature in the
+ * superblock
+ * @journal: Journal to act on.
+ * @compat: bitmask of compatible features
+ * @ro: bitmask of features that force read-only mount
+ * @incompat: bitmask of incompatible features
+ *
+ * Clear a given journal feature as present on the
+ * superblock. Returns true if the requested features could be reset.
+ */
+int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
+ unsigned long ro, unsigned long incompat)
+{
+ journal_superblock_t *sb;
+
+ jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
+ compat, ro, incompat);
+
+ sb = journal->j_superblock;
+
+ sb->s_feature_compat &= ~cpu_to_be32(compat);
+ sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
+ sb->s_feature_incompat &= ~cpu_to_be32(incompat);
+
+ return 1;
+}
+EXPORT_SYMBOL(jbd2_journal_clear_features);

/**
* int jbd2_journal_update_format () - Update on-disk journal structure.
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index d0ce627..90f8c30 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -21,6 +21,7 @@
#include <linux/jbd2.h>
#include <linux/errno.h>
#include <linux/slab.h>
+#include <linux/crc32.h>
#endif

/*
@@ -316,6 +317,37 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag
return block;
}

+/*
+ * calc_chksums calculates the checksums for the blocks described in the
+ * descriptor block.
+ */
+static int calc_chksums(journal_t *journal, struct buffer_head *bh,
+ unsigned long *next_log_block, __u32 *crc32_sum)
+{
+ int i, num_blks, err;
+ unsigned io_block;
+ struct buffer_head *obh;
+
+ num_blks = count_tags(journal, bh);
+ /* Calculate checksum of the descriptor block. */
+ *crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
+
+ for (i = 0; i < num_blks; i++) {
+ io_block = (*next_log_block)++;
+ wrap(journal, *next_log_block);
+ err = jread(&obh, journal, io_block);
+ if (err) {
+ printk(KERN_ERR "JBD: IO error %d recovering block "
+ "%u in log\n", err, io_block);
+ return 1;
+ } else {
+ *crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
+ obh->b_size);
+ }
+ }
+ return 0;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
unsigned int sequence;
int blocktype;
int tag_bytes = journal_tag_bytes(journal);
+ __u32 crc32_sum = ~0; /* Transactional Checksums */

/* Precompute the maximum metadata descriptors in a descriptor block */
int MAX_BLOCKS_PER_DESC;
@@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
switch(blocktype) {
case JBD2_DESCRIPTOR_BLOCK:
/* If it is a valid descriptor block, replay it
- * in pass REPLAY; otherwise, just skip over the
- * blocks it describes. */
+ * in pass REPLAY; if journal_checksums enabled, then
+ * calculate checksums in PASS_SCAN, otherwise,
+ * just skip over the blocks it describes. */
if (pass != PASS_REPLAY) {
+ if (pass == PASS_SCAN &&
+ JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM) &&
+ !info->end_transaction) {
+ if (calc_chksums(journal, bh,
+ &next_log_block,
+ &crc32_sum)) {
+ brelse(bh);
+ break;
+ }
+ brelse(bh);
+ continue;
+ }
next_log_block += count_tags(journal, bh);
wrap(journal, next_log_block);
brelse(bh);
@@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal,
continue;

case JBD2_COMMIT_BLOCK:
- /* Found an expected commit block: not much to
- * do other than move on to the next sequence
+ /* How to differentiate between interrupted commit
+ * and journal corruption ?
+ *
+ * {nth transaction}
+ * Checksum Verification Failed
+ * |
+ * ____________________
+ * | |
+ * async_commit sync_commit
+ * | |
+ * | GO TO NEXT "Journal Corruption"
+ * | TRANSACTION
+ * |
+ * {(n+1)th transanction}
+ * |
+ * _______|______________
+ * | |
+ * Commit block found Commit block not found
+ * | |
+ * "Journal Corruption" |
+ * _____________|_________
+ * | |
+ * nth trans corrupt OR nth trans
+ * and (n+1)th interrupted interrupted
+ * before commit block
+ * could reach the disk.
+ * (Cannot find the difference in above
+ * mentioned conditions. Hence assume
+ * "Interrupted Commit".)
+ */
+
+ /* Found an expected commit block: if checksums
+ * are present verify them in PASS_SCAN; else not
+ * much to do other than move on to the next sequence
* number. */
+ if (pass == PASS_SCAN &&
+ JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ int chksum_err, chksum_seen;
+ struct commit_header *cbh =
+ (struct commit_header *)bh->b_data;
+ unsigned found_chksum =
+ be32_to_cpu(cbh->h_chksum[0]);
+
+ chksum_err = chksum_seen = 0;
+
+ if (info->end_transaction) {
+ printk(KERN_ERR "JBD: Transaction %u "
+ "found to be corrupt.\n",
+ next_commit_ID - 1);
+ brelse(bh);
+ break;
+ }
+
+ if (crc32_sum == found_chksum &&
+ cbh->h_chksum_type == JBD2_CRC32_CHKSUM &&
+ cbh->h_chksum_size ==
+ JBD2_CRC32_CHKSUM_SIZE)
+ chksum_seen = 1;
+ else if (!(cbh->h_chksum_type == 0 &&
+ cbh->h_chksum_size == 0 &&
+ found_chksum == 0 &&
+ !chksum_seen))
+ /*
+ * If fs is mounted using an old kernel and then
+ * kernel with journal_chksum is used then we
+ * get a situation where the journal flag has
+ * checksum flag set but checksums are not
+ * present i.e chksum = 0, in the individual
+ * commit blocks.
+ * Hence to avoid checksum failures, in this
+ * situation, this extra check is added.
+ */
+ chksum_err = 1;
+
+ if (chksum_err) {
+ info->end_transaction = next_commit_ID;
+
+ if (!JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)){
+ printk(KERN_ERR
+ "JBD: Transaction %u "
+ "found to be corrupt.\n",
+ next_commit_ID);
+ brelse(bh);
+ break;
+ }
+ }
+ crc32_sum = ~0;
+ }
brelse(bh);
next_commit_ID++;
continue;
@@ -554,9 +688,10 @@ static int do_one_pass(journal_t *journal,
* transaction marks the end of the valid log.
*/

- if (pass == PASS_SCAN)
- info->end_transaction = next_commit_ID;
- else {
+ if (pass == PASS_SCAN) {
+ if (!info->end_transaction)
+ info->end_transaction = next_commit_ID;
+ } else {
/* It's really bad news if different passes end up at
* different places (but possible due to IO errors). */
if (info->end_transaction != next_commit_ID) {
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 300cc5a..cd406db 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -467,7 +467,8 @@ do { \
#define EXT4_MOUNT_USRQUOTA 0x100000 /* "old" user quota */
#define EXT4_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
#define EXT4_MOUNT_EXTENTS 0x400000 /* Extents support */
-
+#define EXT4_MOUNT_JOURNAL_CHECKSUM 0x800000 /* Journal checksums */
+#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
#define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6856400..a2645c2 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -149,6 +149,28 @@ typedef struct journal_header_s
__be32 h_sequence;
} journal_header_t;

+/*
+ * Checksum types.
+ */
+#define JBD2_CRC32_CHKSUM 1
+#define JBD2_MD5_CHKSUM 2
+#define JBD2_SHA1_CHKSUM 3
+
+#define JBD2_CRC32_CHKSUM_SIZE 4
+
+#define JBD2_CHECKSUM_BYTES (32 / sizeof(u32))
+/*
+ * Commit block header for storing transactional checksums:
+ */
+struct commit_header {
+ __be32 h_magic;
+ __be32 h_blocktype;
+ __be32 h_sequence;
+ unsigned char h_chksum_type;
+ unsigned char h_chksum_size;
+ unsigned char h_padding[2];
+ __be32 h_chksum[JBD2_CHECKSUM_BYTES];
+};

/*
* The block tag: used to describe a single buffer in the journal.
@@ -242,14 +264,18 @@ typedef struct journal_superblock_s
((j)->j_format_version >= 2 && \
((j)->j_superblock->s_feature_incompat & cpu_to_be32((mask))))

-#define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001
-#define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002
+#define JBD2_FEATURE_COMPAT_CHECKSUM 0x00000001
+
+#define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001
+#define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002
+#define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004

/* Features known to this kernel version: */
-#define JBD2_KNOWN_COMPAT_FEATURES 0
+#define JBD2_KNOWN_COMPAT_FEATURES JBD2_FEATURE_COMPAT_CHECKSUM
#define JBD2_KNOWN_ROCOMPAT_FEATURES 0
#define JBD2_KNOWN_INCOMPAT_FEATURES (JBD2_FEATURE_INCOMPAT_REVOKE | \
- JBD2_FEATURE_INCOMPAT_64BIT)
+ JBD2_FEATURE_INCOMPAT_64BIT | \
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)

#ifdef __KERNEL__

@@ -997,6 +1023,8 @@ extern int jbd2_journal_check_available_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int jbd2_journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
+extern int jbd2_journal_clear_features
+ (journal_t *, unsigned long, unsigned long, unsigned long);
extern int jbd2_journal_create (journal_t *);
extern int jbd2_journal_load (journal_t *journal);
extern void jbd2_journal_destroy (journal_t *);
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:19:51

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 38/49] ext4: fix up EXT4FS_DEBUG builds

From: Eric Sandeen <[email protected]>

Builds with EXT4FS_DEBUG defined (to enable ext4_debug()) fail
without these changes. Clean up some format warnings too.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/balloc.c | 6 +++---
fs/ext4/ialloc.c | 2 +-
fs/ext4/resize.c | 16 ++++++++--------
3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 925e063..54d3da7 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1630,7 +1630,7 @@ ext4_fsblk_t ext4_new_blocks(handle_t *handle, struct inode *inode,

sbi = EXT4_SB(sb);
es = EXT4_SB(sb)->s_es;
- ext4_debug("goal=%lu.\n", goal);
+ ext4_debug("goal=%llu.\n", goal);
/*
* Allocate a block from reservation only when
* filesystem is mounted with reservation(default,-o reservation), and
@@ -1740,7 +1740,7 @@ retry_alloc:

allocated:

- ext4_debug("using block group %d(%d)\n",
+ ext4_debug("using block group %lu(%d)\n",
group_no, gdp->bg_free_blocks_count);

BUFFER_TRACE(gdp_bh, "get_write_access");
@@ -1898,7 +1898,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
brelse(bitmap_bh);
printk("ext4_count_free_blocks: stored = %llu"
", computed = %llu, %llu\n",
- EXT4_FREE_BLOCKS_COUNT(es),
+ ext4_free_blocks_count(es),
desc_count, bitmap_count);
return bitmap_count;
#else
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 17b5df1..575b521 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -857,7 +857,7 @@ unsigned long ext4_count_free_inodes (struct super_block * sb)
continue;

x = ext4_count_free(bitmap_bh, EXT4_INODES_PER_GROUP(sb) / 8);
- printk("group %d: stored = %d, counted = %lu\n",
+ printk(KERN_DEBUG "group %lu: stored = %d, counted = %lu\n",
i, le16_to_cpu(gdp->bg_free_inodes_count), x);
bitmap_count += x;
}
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 7090c2d..4fbba60 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -206,7 +206,7 @@ static int setup_new_group_blocks(struct super_block *sb,
}

if (ext4_bg_has_super(sb, input->group)) {
- ext4_debug("mark backup superblock %#04lx (+0)\n", start);
+ ext4_debug("mark backup superblock %#04llx (+0)\n", start);
ext4_set_bit(0, bh->b_data);
}

@@ -215,7 +215,7 @@ static int setup_new_group_blocks(struct super_block *sb,
i < gdblocks; i++, block++, bit++) {
struct buffer_head *gdb;

- ext4_debug("update backup group %#04lx (+%d)\n", block, bit);
+ ext4_debug("update backup group %#04llx (+%d)\n", block, bit);

if ((err = extend_or_restart_transaction(handle, 1, bh)))
goto exit_bh;
@@ -243,7 +243,7 @@ static int setup_new_group_blocks(struct super_block *sb,
i < reserved_gdb; i++, block++, bit++) {
struct buffer_head *gdb;

- ext4_debug("clear reserved block %#04lx (+%d)\n", block, bit);
+ ext4_debug("clear reserved block %#04llx (+%d)\n", block, bit);

if ((err = extend_or_restart_transaction(handle, 1, bh)))
goto exit_bh;
@@ -256,10 +256,10 @@ static int setup_new_group_blocks(struct super_block *sb,
ext4_set_bit(bit, bh->b_data);
brelse(gdb);
}
- ext4_debug("mark block bitmap %#04x (+%ld)\n", input->block_bitmap,
+ ext4_debug("mark block bitmap %#04llx (+%llu)\n", input->block_bitmap,
input->block_bitmap - start);
ext4_set_bit(input->block_bitmap - start, bh->b_data);
- ext4_debug("mark inode bitmap %#04x (+%ld)\n", input->inode_bitmap,
+ ext4_debug("mark inode bitmap %#04llx (+%llu)\n", input->inode_bitmap,
input->inode_bitmap - start);
ext4_set_bit(input->inode_bitmap - start, bh->b_data);

@@ -268,7 +268,7 @@ static int setup_new_group_blocks(struct super_block *sb,
i < sbi->s_itb_per_group; i++, bit++, block++) {
struct buffer_head *it;

- ext4_debug("clear inode block %#04lx (+%d)\n", block, bit);
+ ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);

if ((err = extend_or_restart_transaction(handle, 1, bh)))
goto exit_bh;
@@ -291,7 +291,7 @@ static int setup_new_group_blocks(struct super_block *sb,
brelse(bh);

/* Mark unused entries in inode bitmap used */
- ext4_debug("clear inode bitmap %#04x (+%ld)\n",
+ ext4_debug("clear inode bitmap %#04llx (+%llu)\n",
input->inode_bitmap, input->inode_bitmap - start);
if (IS_ERR(bh = bclean(handle, sb, input->inode_bitmap))) {
err = PTR_ERR(bh);
@@ -1054,7 +1054,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
ext4_journal_dirty_metadata(handle, EXT4_SB(sb)->s_sbh);
sb->s_dirt = 1;
unlock_super(sb);
- ext4_debug("freeing blocks %lu through %llu\n", o_blocks_count,
+ ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
o_blocks_count + add);
ext4_free_blocks_sb(handle, sb, o_blocks_count, add, &freed_blocks);
ext4_debug("freed blocks %llu through %llu\n", o_blocks_count,
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:20:26

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 34/49] vfs: Add 64 bit i_version support

From: Jean Noel Cordenner <[email protected]>

The i_version field of the inode is changed to be a 64-bit counter that
is set on every inode creation and that is incremented every time the
inode data is modified (similarly to the "ctime" time-stamp).
The aim is to fulfill a NFSv4 requirement for rfc3530.
This first part concerns the vfs, it converts the 32-bit i_version in
the generic inode to a 64-bit, a flag is added in the super block in
order to check if the feature is enabled and the i_version is
incremented in the vfs.

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Jean Noel Cordenner <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>
---
fs/afs/dir.c | 9 +++++----
fs/afs/inode.c | 3 ++-
fs/inode.c | 22 ++++++++++++++++++++++
include/linux/fs.h | 5 ++++-
4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 33fe39a..0cc3597 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -546,11 +546,11 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
dentry->d_op = &afs_fs_dentry_operations;

d_add(dentry, inode);
- _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%lu }",
+ _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%llu }",
fid.vnode,
fid.unique,
dentry->d_inode->i_ino,
- dentry->d_inode->i_version);
+ (unsigned long long)dentry->d_inode->i_version);

return NULL;
}
@@ -630,9 +630,10 @@ static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
* been deleted and replaced, and the original vnode ID has
* been reused */
if (fid.unique != vnode->fid.unique) {
- _debug("%s: file deleted (uq %u -> %u I:%lu)",
+ _debug("%s: file deleted (uq %u -> %u I:%llu)",
dentry->d_name.name, fid.unique,
- vnode->fid.unique, dentry->d_inode->i_version);
+ vnode->fid.unique,
+ (unsigned long long)dentry->d_inode->i_version);
spin_lock(&vnode->lock);
set_bit(AFS_VNODE_DELETED, &vnode->flags);
spin_unlock(&vnode->lock);
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index d196840..84750c8 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -301,7 +301,8 @@ int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,

inode = dentry->d_inode;

- _enter("{ ino=%lu v=%lu }", inode->i_ino, inode->i_version);
+ _enter("{ ino=%lu v=%llu }", inode->i_ino,
+ (unsigned long long)inode->i_version);

generic_fillattr(inode, stat);
return 0;
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..b48324a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,6 +1243,23 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
EXPORT_SYMBOL(touch_atime);

/**
+ * inode_inc_iversion - increments i_version
+ * @inode: inode that need to be updated
+ *
+ * Every time the inode is modified, the i_version field
+ * will be incremented.
+ * The filesystem has to be mounted with i_version flag
+ *
+ */
+
+void inode_inc_iversion(struct inode *inode)
+{
+ spin_lock(&inode->i_lock);
+ inode->i_version++;
+ spin_unlock(&inode->i_lock);
+}
+
+/**
* file_update_time - update mtime and ctime time
* @file: file accessed
*
@@ -1276,6 +1293,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}

+ if (IS_I_VERSION(inode)) {
+ inode_inc_iversion(inode);
+ sync_it = 1;
+ }
+
if (sync_it)
mark_inode_dirty_sync(inode);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..94cf5d8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -124,6 +124,7 @@ extern int dir_notify_enable;
#define MS_SHARED (1<<20) /* change to shared */
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
+#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -173,6 +174,7 @@ extern int dir_notify_enable;
((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
#define IS_MANDLOCK(inode) __IS_FLG(inode, MS_MANDLOCK)
#define IS_NOATIME(inode) __IS_FLG(inode, MS_RDONLY|MS_NOATIME)
+#define IS_I_VERSION(inode) __IS_FLG(inode, MS_I_VERSION)

#define IS_NOQUOTA(inode) ((inode)->i_flags & S_NOQUOTA)
#define IS_APPEND(inode) ((inode)->i_flags & S_APPEND)
@@ -599,7 +601,7 @@ struct inode {
uid_t i_uid;
gid_t i_gid;
dev_t i_rdev;
- unsigned long i_version;
+ u64 i_version;
loff_t i_size;
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;
@@ -1394,6 +1396,7 @@ static inline void inode_dec_link_count(struct inode *inode)
mark_inode_dirty(inode);
}

+extern void inode_inc_iversion(struct inode *inode);
extern void touch_atime(struct vfsmount *mnt, struct dentry *dentry);
static inline void file_accessed(struct file *file)
{
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:20:51

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 03/49] ext4: Introduce ext4_lblk_t

From: Aneesh Kumar K.V <[email protected]>

This patch adds a new data type ext4_lblk_t to represent
the logical file blocks.

This is the preparatory patch to support large files in ext4
The follow up patch with convert the ext4_inode i_blocks to
represent the number of blocks in file system block size. This
changes makes it possible to have a block number 2**32 -1 which
will result in overflow if the block number is represented by
signed long. This patch convert all the block number to type
ext4_lblk_t which is typedef to __u32

Also remove dead code ext4_ext_walk_space

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
---
fs/ext4/dir.c | 2 +-
fs/ext4/extents.c | 218 ++++++++++++---------------------------
fs/ext4/inode.c | 34 ++++---
fs/ext4/namei.c | 54 ++++++-----
fs/ext4/super.c | 4 +-
include/linux/ext4_fs.h | 29 ++++--
include/linux/ext4_fs_extents.h | 19 +---
include/linux/ext4_fs_i.h | 9 +-
8 files changed, 143 insertions(+), 226 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 145a9c0..33888bb 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -124,7 +124,7 @@ static int ext4_readdir(struct file * filp,
offset = filp->f_pos & (sb->s_blocksize - 1);

while (!error && !stored && filp->f_pos < inode->i_size) {
- unsigned long blk = filp->f_pos >> EXT4_BLOCK_SIZE_BITS(sb);
+ ext4_lblk_t blk = filp->f_pos >> EXT4_BLOCK_SIZE_BITS(sb);
struct buffer_head map_bh;
struct buffer_head *bh = NULL;

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8528774..19d8059 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -144,7 +144,7 @@ static int ext4_ext_dirty(handle_t *handle, struct inode *inode,

static ext4_fsblk_t ext4_ext_find_goal(struct inode *inode,
struct ext4_ext_path *path,
- ext4_fsblk_t block)
+ ext4_lblk_t block)
{
struct ext4_inode_info *ei = EXT4_I(inode);
ext4_fsblk_t bg_start;
@@ -367,13 +367,14 @@ static void ext4_ext_drop_refs(struct ext4_ext_path *path)
* the header must be checked before calling this
*/
static void
-ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int block)
+ext4_ext_binsearch_idx(struct inode *inode,
+ struct ext4_ext_path *path, ext4_lblk_t block)
{
struct ext4_extent_header *eh = path->p_hdr;
struct ext4_extent_idx *r, *l, *m;


- ext_debug("binsearch for %d(idx): ", block);
+ ext_debug("binsearch for %lu(idx): ", (unsigned long)block);

l = EXT_FIRST_INDEX(eh) + 1;
r = EXT_LAST_INDEX(eh);
@@ -425,7 +426,8 @@ ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int bloc
* the header must be checked before calling this
*/
static void
-ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block)
+ext4_ext_binsearch(struct inode *inode,
+ struct ext4_ext_path *path, ext4_lblk_t block)
{
struct ext4_extent_header *eh = path->p_hdr;
struct ext4_extent *r, *l, *m;
@@ -438,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block)
return;
}

- ext_debug("binsearch for %d: ", block);
+ ext_debug("binsearch for %lu: ", (unsigned long)block);

l = EXT_FIRST_EXTENT(eh) + 1;
r = EXT_LAST_EXTENT(eh);
@@ -494,7 +496,8 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode)
}

struct ext4_ext_path *
-ext4_ext_find_extent(struct inode *inode, int block, struct ext4_ext_path *path)
+ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
+ struct ext4_ext_path *path)
{
struct ext4_extent_header *eh;
struct buffer_head *bh;
@@ -979,8 +982,8 @@ repeat:
/* refill path */
ext4_ext_drop_refs(path);
path = ext4_ext_find_extent(inode,
- le32_to_cpu(newext->ee_block),
- path);
+ (ext4_lblk_t)le32_to_cpu(newext->ee_block),
+ path);
if (IS_ERR(path))
err = PTR_ERR(path);
} else {
@@ -992,8 +995,8 @@ repeat:
/* refill path */
ext4_ext_drop_refs(path);
path = ext4_ext_find_extent(inode,
- le32_to_cpu(newext->ee_block),
- path);
+ (ext4_lblk_t)le32_to_cpu(newext->ee_block),
+ path);
if (IS_ERR(path)) {
err = PTR_ERR(path);
goto out;
@@ -1021,7 +1024,7 @@ out:
* allocated block. Thus, index entries have to be consistent
* with leaves.
*/
-static unsigned long
+static ext4_lblk_t
ext4_ext_next_allocated_block(struct ext4_ext_path *path)
{
int depth;
@@ -1054,7 +1057,7 @@ ext4_ext_next_allocated_block(struct ext4_ext_path *path)
* ext4_ext_next_leaf_block:
* returns first allocated block from next leaf or EXT_MAX_BLOCK
*/
-static unsigned ext4_ext_next_leaf_block(struct inode *inode,
+static ext4_lblk_t ext4_ext_next_leaf_block(struct inode *inode,
struct ext4_ext_path *path)
{
int depth;
@@ -1072,7 +1075,8 @@ static unsigned ext4_ext_next_leaf_block(struct inode *inode,
while (depth >= 0) {
if (path[depth].p_idx !=
EXT_LAST_INDEX(path[depth].p_hdr))
- return le32_to_cpu(path[depth].p_idx[1].ei_block);
+ return (ext4_lblk_t)
+ le32_to_cpu(path[depth].p_idx[1].ei_block);
depth--;
}

@@ -1239,7 +1243,7 @@ unsigned int ext4_ext_check_overlap(struct inode *inode,
struct ext4_extent *newext,
struct ext4_ext_path *path)
{
- unsigned long b1, b2;
+ ext4_lblk_t b1, b2;
unsigned int depth, len1;
unsigned int ret = 0;

@@ -1260,7 +1264,7 @@ unsigned int ext4_ext_check_overlap(struct inode *inode,
goto out;
}

- /* check for wrap through zero */
+ /* check for wrap through zero on extent logical start block*/
if (b1 + len1 < b1) {
len1 = EXT_MAX_BLOCK - b1;
newext->ee_len = cpu_to_le16(len1);
@@ -1290,7 +1294,8 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
- int depth, len, err, next;
+ int depth, len, err;
+ ext4_lblk_t next;
unsigned uninitialized = 0;

BUG_ON(ext4_ext_get_actual_len(newext) == 0);
@@ -1435,114 +1440,8 @@ cleanup:
return err;
}

-int ext4_ext_walk_space(struct inode *inode, unsigned long block,
- unsigned long num, ext_prepare_callback func,
- void *cbdata)
-{
- struct ext4_ext_path *path = NULL;
- struct ext4_ext_cache cbex;
- struct ext4_extent *ex;
- unsigned long next, start = 0, end = 0;
- unsigned long last = block + num;
- int depth, exists, err = 0;
-
- BUG_ON(func == NULL);
- BUG_ON(inode == NULL);
-
- while (block < last && block != EXT_MAX_BLOCK) {
- num = last - block;
- /* find extent for this block */
- path = ext4_ext_find_extent(inode, block, path);
- if (IS_ERR(path)) {
- err = PTR_ERR(path);
- path = NULL;
- break;
- }
-
- depth = ext_depth(inode);
- BUG_ON(path[depth].p_hdr == NULL);
- ex = path[depth].p_ext;
- next = ext4_ext_next_allocated_block(path);
-
- exists = 0;
- if (!ex) {
- /* there is no extent yet, so try to allocate
- * all requested space */
- start = block;
- end = block + num;
- } else if (le32_to_cpu(ex->ee_block) > block) {
- /* need to allocate space before found extent */
- start = block;
- end = le32_to_cpu(ex->ee_block);
- if (block + num < end)
- end = block + num;
- } else if (block >= le32_to_cpu(ex->ee_block)
- + ext4_ext_get_actual_len(ex)) {
- /* need to allocate space after found extent */
- start = block;
- end = block + num;
- if (end >= next)
- end = next;
- } else if (block >= le32_to_cpu(ex->ee_block)) {
- /*
- * some part of requested space is covered
- * by found extent
- */
- start = block;
- end = le32_to_cpu(ex->ee_block)
- + ext4_ext_get_actual_len(ex);
- if (block + num < end)
- end = block + num;
- exists = 1;
- } else {
- BUG();
- }
- BUG_ON(end <= start);
-
- if (!exists) {
- cbex.ec_block = start;
- cbex.ec_len = end - start;
- cbex.ec_start = 0;
- cbex.ec_type = EXT4_EXT_CACHE_GAP;
- } else {
- cbex.ec_block = le32_to_cpu(ex->ee_block);
- cbex.ec_len = ext4_ext_get_actual_len(ex);
- cbex.ec_start = ext_pblock(ex);
- cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
- }
-
- BUG_ON(cbex.ec_len == 0);
- err = func(inode, path, &cbex, cbdata);
- ext4_ext_drop_refs(path);
-
- if (err < 0)
- break;
- if (err == EXT_REPEAT)
- continue;
- else if (err == EXT_BREAK) {
- err = 0;
- break;
- }
-
- if (ext_depth(inode) != depth) {
- /* depth was changed. we have to realloc path */
- kfree(path);
- path = NULL;
- }
-
- block = cbex.ec_block + cbex.ec_len;
- }
-
- if (path) {
- ext4_ext_drop_refs(path);
- kfree(path);
- }
-
- return err;
-}
-
static void
-ext4_ext_put_in_cache(struct inode *inode, __u32 block,
+ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
__u32 len, ext4_fsblk_t start, int type)
{
struct ext4_ext_cache *cex;
@@ -1561,10 +1460,11 @@ ext4_ext_put_in_cache(struct inode *inode, __u32 block,
*/
static void
ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
- unsigned long block)
+ ext4_lblk_t block)
{
int depth = ext_depth(inode);
- unsigned long lblock, len;
+ unsigned long len;
+ ext4_lblk_t lblock;
struct ext4_extent *ex;

ex = path[depth].p_ext;
@@ -1582,15 +1482,17 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
(unsigned long) ext4_ext_get_actual_len(ex));
} else if (block >= le32_to_cpu(ex->ee_block)
+ ext4_ext_get_actual_len(ex)) {
+ ext4_lblk_t next;
lblock = le32_to_cpu(ex->ee_block)
+ ext4_ext_get_actual_len(ex);
- len = ext4_ext_next_allocated_block(path);
+
+ next = ext4_ext_next_allocated_block(path);
ext_debug("cache gap(after): [%lu:%lu] %lu",
(unsigned long) le32_to_cpu(ex->ee_block),
(unsigned long) ext4_ext_get_actual_len(ex),
(unsigned long) block);
- BUG_ON(len == lblock);
- len = len - lblock;
+ BUG_ON(next == lblock);
+ len = next - lblock;
} else {
lblock = len = 0;
BUG();
@@ -1601,7 +1503,7 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
}

static int
-ext4_ext_in_cache(struct inode *inode, unsigned long block,
+ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
struct ext4_extent *ex)
{
struct ext4_ext_cache *cex;
@@ -1714,7 +1616,7 @@ int ext4_ext_calc_credits_for_insert(struct inode *inode,

static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
struct ext4_extent *ex,
- unsigned long from, unsigned long to)
+ ext4_lblk_t from, ext4_lblk_t to)
{
struct buffer_head *bh;
unsigned short ee_len = ext4_ext_get_actual_len(ex);
@@ -1738,11 +1640,12 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
if (from >= le32_to_cpu(ex->ee_block)
&& to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
/* tail removal */
- unsigned long num;
+ ext4_lblk_t num;
ext4_fsblk_t start;
+
num = le32_to_cpu(ex->ee_block) + ee_len - from;
start = ext_pblock(ex) + ee_len - num;
- ext_debug("free last %lu blocks starting %llu\n", num, start);
+ ext_debug("free last %u blocks starting %llu\n", num, start);
for (i = 0; i < num; i++) {
bh = sb_find_get_block(inode->i_sb, start + i);
ext4_forget(handle, 0, inode, bh, start + i);
@@ -1750,30 +1653,32 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
ext4_free_blocks(handle, inode, start, num);
} else if (from == le32_to_cpu(ex->ee_block)
&& to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
- printk("strange request: removal %lu-%lu from %u:%u\n",
+ printk(KERN_INFO "strange request: removal %u-%u from %u:%u\n",
from, to, le32_to_cpu(ex->ee_block), ee_len);
} else {
- printk("strange request: removal(2) %lu-%lu from %u:%u\n",
- from, to, le32_to_cpu(ex->ee_block), ee_len);
+ printk(KERN_INFO "strange request: removal(2) "
+ "%u-%u from %u:%u\n",
+ from, to, le32_to_cpu(ex->ee_block), ee_len);
}
return 0;
}

static int
ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
- struct ext4_ext_path *path, unsigned long start)
+ struct ext4_ext_path *path, ext4_lblk_t start)
{
int err = 0, correct_index = 0;
int depth = ext_depth(inode), credits;
struct ext4_extent_header *eh;
- unsigned a, b, block, num;
- unsigned long ex_ee_block;
+ ext4_lblk_t a, b, block;
+ unsigned num;
+ ext4_lblk_t ex_ee_block;
unsigned short ex_ee_len;
unsigned uninitialized = 0;
struct ext4_extent *ex;

/* the header must be checked already in ext4_ext_remove_space() */
- ext_debug("truncate since %lu in leaf\n", start);
+ ext_debug("truncate since %u in leaf\n", start);
if (!path[depth].p_hdr)
path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
eh = path[depth].p_hdr;
@@ -1904,7 +1809,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path)
return 1;
}

-int ext4_ext_remove_space(struct inode *inode, unsigned long start)
+int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start)
{
struct super_block *sb = inode->i_sb;
int depth = ext_depth(inode);
@@ -1912,7 +1817,7 @@ int ext4_ext_remove_space(struct inode *inode, unsigned long start)
handle_t *handle;
int i = 0, err = 0;

- ext_debug("truncate since %lu\n", start);
+ ext_debug("truncate since %u\n", start);

/* probably first extent we're gonna free will be last in block */
handle = ext4_journal_start(inode, depth + 1);
@@ -2094,17 +1999,19 @@ void ext4_ext_release(struct super_block *sb)
* b> Splits in two extents: Write is happening at either end of the extent
* c> Splits in three extents: Somone is writing in middle of the extent
*/
-int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
- struct ext4_ext_path *path,
- ext4_fsblk_t iblock,
- unsigned long max_blocks)
+static int ext4_ext_convert_to_initialized(handle_t *handle,
+ struct inode *inode,
+ struct ext4_ext_path *path,
+ ext4_lblk_t iblock,
+ unsigned long max_blocks)
{
struct ext4_extent *ex, newex;
struct ext4_extent *ex1 = NULL;
struct ext4_extent *ex2 = NULL;
struct ext4_extent *ex3 = NULL;
struct ext4_extent_header *eh;
- unsigned int allocated, ee_block, ee_len, depth;
+ ext4_lblk_t ee_block;
+ unsigned int allocated, ee_len, depth;
ext4_fsblk_t newblock;
int err = 0;
int ret = 0;
@@ -2226,7 +2133,7 @@ out:
}

int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
- ext4_fsblk_t iblock,
+ ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
int create, int extend_disksize)
{
@@ -2238,8 +2145,9 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
unsigned long allocated = 0;

__clear_bit(BH_New, &bh_result->b_state);
- ext_debug("blocks %d/%lu requested for inode %u\n", (int) iblock,
- max_blocks, (unsigned) inode->i_ino);
+ ext_debug("blocks %lu/%lu requested for inode %u\n",
+ (unsigned long) iblock, max_blocks,
+ (unsigned) inode->i_ino);
mutex_lock(&EXT4_I(inode)->truncate_mutex);

/* check in cache */
@@ -2288,7 +2196,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,

ex = path[depth].p_ext;
if (ex) {
- unsigned long ee_block = le32_to_cpu(ex->ee_block);
+ ext4_lblk_t ee_block = le32_to_cpu(ex->ee_block);
ext4_fsblk_t ee_start = ext_pblock(ex);
unsigned short ee_len;

@@ -2423,7 +2331,7 @@ void ext4_ext_truncate(struct inode * inode, struct page *page)
{
struct address_space *mapping = inode->i_mapping;
struct super_block *sb = inode->i_sb;
- unsigned long last_block;
+ ext4_lblk_t last_block;
handle_t *handle;
int err = 0;

@@ -2516,7 +2424,8 @@ int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
{
handle_t *handle;
- ext4_fsblk_t block, max_blocks;
+ ext4_lblk_t block;
+ unsigned long max_blocks;
ext4_fsblk_t nblocks = 0;
int ret = 0;
int ret2 = 0;
@@ -2561,8 +2470,9 @@ retry:
if (!ret) {
ext4_error(inode->i_sb, "ext4_fallocate",
"ext4_ext_get_blocks returned 0! inode#%lu"
- ", block=%llu, max_blocks=%llu",
- inode->i_ino, block, max_blocks);
+ ", block=%lu, max_blocks=%lu",
+ inode->i_ino, (unsigned long)block,
+ (unsigned long)max_blocks);
ret = -EIO;
ext4_mark_inode_dirty(handle, inode);
ret2 = ext4_journal_stop(handle);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5489703..488f829 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -105,7 +105,7 @@ int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
*/
static unsigned long blocks_for_truncate(struct inode *inode)
{
- unsigned long needed;
+ ext4_lblk_t needed;

needed = inode->i_blocks >> (inode->i_sb->s_blocksize_bits - 9);

@@ -282,7 +282,8 @@ static int verify_chain(Indirect *from, Indirect *to)
*/

static int ext4_block_to_path(struct inode *inode,
- long i_block, int offsets[4], int *boundary)
+ ext4_lblk_t i_block,
+ ext4_lblk_t offsets[4], int *boundary)
{
int ptrs = EXT4_ADDR_PER_BLOCK(inode->i_sb);
int ptrs_bits = EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb);
@@ -349,7 +350,8 @@ static int ext4_block_to_path(struct inode *inode,
* or when it reads all @depth-1 indirect blocks successfully and finds
* the whole chain, all way to the data (returns %NULL, *err == 0).
*/
-static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets,
+static Indirect *ext4_get_branch(struct inode *inode, int depth,
+ ext4_lblk_t *offsets,
Indirect chain[4], int *err)
{
struct super_block *sb = inode->i_sb;
@@ -445,7 +447,7 @@ static ext4_fsblk_t ext4_find_near(struct inode *inode, Indirect *ind)
* stores it in *@goal and returns zero.
*/

-static ext4_fsblk_t ext4_find_goal(struct inode *inode, long block,
+static ext4_fsblk_t ext4_find_goal(struct inode *inode, ext4_lblk_t block,
Indirect chain[4], Indirect *partial)
{
struct ext4_block_alloc_info *block_i;
@@ -590,7 +592,7 @@ failed_out:
*/
static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
int indirect_blks, int *blks, ext4_fsblk_t goal,
- int *offsets, Indirect *branch)
+ ext4_lblk_t *offsets, Indirect *branch)
{
int blocksize = inode->i_sb->s_blocksize;
int i, n = 0;
@@ -680,7 +682,7 @@ failed:
* chain to new block and return 0.
*/
static int ext4_splice_branch(handle_t *handle, struct inode *inode,
- long block, Indirect *where, int num, int blks)
+ ext4_lblk_t block, Indirect *where, int num, int blks)
{
int i;
int err = 0;
@@ -784,12 +786,12 @@ err_out:
* return < 0, error case.
*/
int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
- sector_t iblock, unsigned long maxblocks,
+ ext4_lblk_t iblock, unsigned long maxblocks,
struct buffer_head *bh_result,
int create, int extend_disksize)
{
int err = -EIO;
- int offsets[4];
+ ext4_lblk_t offsets[4];
Indirect chain[4];
Indirect *partial;
ext4_fsblk_t goal;
@@ -803,7 +805,8 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,

J_ASSERT(!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL));
J_ASSERT(handle != NULL || create == 0);
- depth = ext4_block_to_path(inode,iblock,offsets,&blocks_to_boundary);
+ depth = ext4_block_to_path(inode, iblock, offsets,
+ &blocks_to_boundary);

if (depth == 0)
goto out;
@@ -996,7 +999,7 @@ get_block:
* `handle' can be NULL if create is zero
*/
struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
- long block, int create, int *errp)
+ ext4_lblk_t block, int create, int *errp)
{
struct buffer_head dummy;
int fatal = 0, err;
@@ -1063,7 +1066,7 @@ err:
}

struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
- int block, int create, int *err)
+ ext4_lblk_t block, int create, int *err)
{
struct buffer_head * bh;

@@ -1828,7 +1831,8 @@ int ext4_block_truncate_page(handle_t *handle, struct page *page,
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize, iblock, length, pos;
+ unsigned blocksize, length, pos;
+ ext4_lblk_t iblock;
struct inode *inode = mapping->host;
struct buffer_head *bh;
int err = 0;
@@ -1964,7 +1968,7 @@ static inline int all_zeroes(__le32 *p, __le32 *q)
* (no partially truncated stuff there). */

static Indirect *ext4_find_shared(struct inode *inode, int depth,
- int offsets[4], Indirect chain[4], __le32 *top)
+ ext4_lblk_t offsets[4], Indirect chain[4], __le32 *top)
{
Indirect *partial, *p;
int k, err;
@@ -2289,12 +2293,12 @@ void ext4_truncate(struct inode *inode)
__le32 *i_data = ei->i_data;
int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
struct address_space *mapping = inode->i_mapping;
- int offsets[4];
+ ext4_lblk_t offsets[4];
Indirect chain[4];
Indirect *partial;
__le32 nr = 0;
int n;
- long last_block;
+ ext4_lblk_t last_block;
unsigned blocksize = inode->i_sb->s_blocksize;
struct page *page;

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d9a3a2f..fb673b1 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -51,7 +51,7 @@

static struct buffer_head *ext4_append(handle_t *handle,
struct inode *inode,
- u32 *block, int *err)
+ ext4_lblk_t *block, int *err)
{
struct buffer_head *bh;

@@ -144,8 +144,8 @@ struct dx_map_entry
u16 size;
};

-static inline unsigned dx_get_block (struct dx_entry *entry);
-static void dx_set_block (struct dx_entry *entry, unsigned value);
+static inline ext4_lblk_t dx_get_block(struct dx_entry *entry);
+static void dx_set_block(struct dx_entry *entry, ext4_lblk_t value);
static inline unsigned dx_get_hash (struct dx_entry *entry);
static void dx_set_hash (struct dx_entry *entry, unsigned value);
static unsigned dx_get_count (struct dx_entry *entries);
@@ -166,7 +166,8 @@ static void dx_sort_map(struct dx_map_entry *map, unsigned count);
static struct ext4_dir_entry_2 *dx_move_dirents (char *from, char *to,
struct dx_map_entry *offsets, int count);
static struct ext4_dir_entry_2* dx_pack_dirents (char *base, int size);
-static void dx_insert_block (struct dx_frame *frame, u32 hash, u32 block);
+static void dx_insert_block(struct dx_frame *frame,
+ u32 hash, ext4_lblk_t block);
static int ext4_htree_next_block(struct inode *dir, __u32 hash,
struct dx_frame *frame,
struct dx_frame *frames,
@@ -181,12 +182,12 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
* Mask them off for now.
*/

-static inline unsigned dx_get_block (struct dx_entry *entry)
+static inline ext4_lblk_t dx_get_block(struct dx_entry *entry)
{
return le32_to_cpu(entry->block) & 0x00ffffff;
}

-static inline void dx_set_block (struct dx_entry *entry, unsigned value)
+static inline void dx_set_block(struct dx_entry *entry, ext4_lblk_t value)
{
entry->block = cpu_to_le32(value);
}
@@ -243,8 +244,8 @@ static void dx_show_index (char * label, struct dx_entry *entries)
int i, n = dx_get_count (entries);
printk("%s index ", label);
for (i = 0; i < n; i++) {
- printk("%x->%u ", i? dx_get_hash(entries + i) :
- 0, dx_get_block(entries + i));
+ printk("%x->%lu ", i? dx_get_hash(entries + i) :
+ 0, (unsigned long)dx_get_block(entries + i));
}
printk("\n");
}
@@ -297,7 +298,8 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
printk("%i indexed blocks...\n", count);
for (i = 0; i < count; i++, entries++)
{
- u32 block = dx_get_block(entries), hash = i? dx_get_hash(entries): 0;
+ ext4_lblk_t block = dx_get_block(entries);
+ ext4_lblk_t hash = i ? dx_get_hash(entries): 0;
u32 range = i < count - 1? (dx_get_hash(entries + 1) - hash): ~hash;
struct stats stats;
printk("%s%3u:%03u hash %8x/%8x ",levels?"":" ", i, block, hash, range);
@@ -561,7 +563,7 @@ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *
* into the tree. If there is an error it is returned in err.
*/
static int htree_dirblock_to_tree(struct file *dir_file,
- struct inode *dir, int block,
+ struct inode *dir, ext4_lblk_t block,
struct dx_hash_info *hinfo,
__u32 start_hash, __u32 start_minor_hash)
{
@@ -569,7 +571,8 @@ static int htree_dirblock_to_tree(struct file *dir_file,
struct ext4_dir_entry_2 *de, *top;
int err, count = 0;

- dxtrace(printk("In htree dirblock_to_tree: block %d\n", block));
+ dxtrace(printk(KERN_INFO "In htree dirblock_to_tree: block %lu\n",
+ (unsigned long)block));
if (!(bh = ext4_bread (NULL, dir, block, 0, &err)))
return err;

@@ -621,9 +624,9 @@ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
struct ext4_dir_entry_2 *de;
struct dx_frame frames[2], *frame;
struct inode *dir;
- int block, err;
+ ext4_lblk_t block;
int count = 0;
- int ret;
+ int ret, err;
__u32 hashval;

dxtrace(printk("In htree_fill_tree, start hash: %x:%x\n", start_hash,
@@ -753,7 +756,7 @@ static void dx_sort_map (struct dx_map_entry *map, unsigned count)
} while(more);
}

-static void dx_insert_block(struct dx_frame *frame, u32 hash, u32 block)
+static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block)
{
struct dx_entry *entries = frame->entries;
struct dx_entry *old = frame->at, *new = old + 1;
@@ -848,13 +851,14 @@ static struct buffer_head * ext4_find_entry (struct dentry *dentry,
struct super_block * sb;
struct buffer_head * bh_use[NAMEI_RA_SIZE];
struct buffer_head * bh, *ret = NULL;
- unsigned long start, block, b;
+ ext4_lblk_t start, block, b;
int ra_max = 0; /* Number of bh's in the readahead
buffer, bh_use[] */
int ra_ptr = 0; /* Current index into readahead
buffer */
int num = 0;
- int nblocks, i, err;
+ ext4_lblk_t nblocks;
+ int i, err;
struct inode *dir = dentry->d_parent->d_inode;
int namelen;
const u8 *name;
@@ -915,7 +919,8 @@ restart:
if (!buffer_uptodate(bh)) {
/* read error, skip block & hope for the best */
ext4_error(sb, __FUNCTION__, "reading directory #%lu "
- "offset %lu", dir->i_ino, block);
+ "offset %lu", dir->i_ino,
+ (unsigned long)block);
brelse(bh);
goto next;
}
@@ -962,7 +967,7 @@ static struct buffer_head * ext4_dx_find_entry(struct dentry *dentry,
struct dx_frame frames[2], *frame;
struct ext4_dir_entry_2 *de, *top;
struct buffer_head *bh;
- unsigned long block;
+ ext4_lblk_t block;
int retval;
int namelen = dentry->d_name.len;
const u8 *name = dentry->d_name.name;
@@ -1174,7 +1179,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
unsigned blocksize = dir->i_sb->s_blocksize;
unsigned count, continued;
struct buffer_head *bh2;
- u32 newblock;
+ ext4_lblk_t newblock;
u32 hash2;
struct dx_map_entry *map;
char *data1 = (*bh)->b_data, *data2;
@@ -1221,8 +1226,9 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
split = count - move;
hash2 = map[split].hash;
continued = hash2 == map[split - 1].hash;
- dxtrace(printk("Split block %i at %x, %i/%i\n",
- dx_get_block(frame->at), hash2, split, count-split));
+ dxtrace(printk(KERN_INFO "Split block %lu at %x, %i/%i\n",
+ (unsigned long)dx_get_block(frame->at),
+ hash2, split, count-split));

/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split);
@@ -1374,7 +1380,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
int retval;
unsigned blocksize;
struct dx_hash_info hinfo;
- u32 block;
+ ext4_lblk_t block;
struct fake_dirent *fde;

blocksize = dir->i_sb->s_blocksize;
@@ -1455,7 +1461,7 @@ static int ext4_add_entry (handle_t *handle, struct dentry *dentry,
int retval;
int dx_fallback=0;
unsigned blocksize;
- u32 block, blocks;
+ ext4_lblk_t block, blocks;

sb = dir->i_sb;
blocksize = sb->s_blocksize;
@@ -1532,7 +1538,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
dx_get_count(entries), dx_get_limit(entries)));
/* Need to split index? */
if (dx_get_count(entries) == dx_get_limit(entries)) {
- u32 newblock;
+ ext4_lblk_t newblock;
unsigned icount = dx_get_count(entries);
int levels = frame - frames;
struct dx_entry *entries2;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ab7010d..6302b03 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2914,7 +2914,7 @@ static ssize_t ext4_quota_read(struct super_block *sb, int type, char *data,
size_t len, loff_t off)
{
struct inode *inode = sb_dqopt(sb)->files[type];
- sector_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
+ ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
int err = 0;
int offset = off & (sb->s_blocksize - 1);
int tocopy;
@@ -2952,7 +2952,7 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
const char *data, size_t len, loff_t off)
{
struct inode *inode = sb_dqopt(sb)->files[type];
- sector_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
+ ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
int err = 0;
int offset = off & (sb->s_blocksize - 1);
int tocopy;
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index fb31c1a..5e2da09 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -935,11 +935,14 @@ extern unsigned long ext4_count_free (struct buffer_head *, unsigned);
/* inode.c */
int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
struct buffer_head *bh, ext4_fsblk_t blocknr);
-struct buffer_head * ext4_getblk (handle_t *, struct inode *, long, int, int *);
-struct buffer_head * ext4_bread (handle_t *, struct inode *, int, int, int *);
+struct buffer_head *ext4_getblk(handle_t *, struct inode *,
+ ext4_lblk_t, int, int *);
+struct buffer_head *ext4_bread(handle_t *, struct inode *,
+ ext4_lblk_t, int, int *);
int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
- sector_t iblock, unsigned long maxblocks, struct buffer_head *bh_result,
- int create, int extend_disksize);
+ ext4_lblk_t iblock, unsigned long maxblocks,
+ struct buffer_head *bh_result,
+ int create, int extend_disksize);

extern void ext4_read_inode (struct inode *);
extern int ext4_write_inode (struct inode *, int);
@@ -1068,7 +1071,7 @@ extern const struct inode_operations ext4_fast_symlink_inode_operations;
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
- ext4_fsblk_t iblock,
+ ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
int create, int extend_disksize);
extern void ext4_ext_truncate(struct inode *, struct page *);
@@ -1081,11 +1084,17 @@ ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
unsigned long max_blocks, struct buffer_head *bh,
int create, int extend_disksize)
{
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_get_blocks(handle, inode, block, max_blocks,
- bh, create, extend_disksize);
- return ext4_get_blocks_handle(handle, inode, block, max_blocks, bh,
- create, extend_disksize);
+ int retval;
+ if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
+ retval = ext4_ext_get_blocks(handle, inode,
+ (ext4_lblk_t)block, max_blocks,
+ bh, create, extend_disksize);
+ } else {
+ retval = ext4_get_blocks_handle(handle, inode,
+ (ext4_lblk_t)block, max_blocks,
+ bh, create, extend_disksize);
+ }
+ return retval;
}


diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h
index d2045a2..023683b 100644
--- a/include/linux/ext4_fs_extents.h
+++ b/include/linux/ext4_fs_extents.h
@@ -124,20 +124,6 @@ struct ext4_ext_path {
#define EXT4_EXT_CACHE_GAP 1
#define EXT4_EXT_CACHE_EXTENT 2

-/*
- * to be called by ext4_ext_walk_space()
- * negative retcode - error
- * positive retcode - signal for ext4_ext_walk_space(), see below
- * callback must return valid extent (passed or newly created)
- */
-typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_path *,
- struct ext4_ext_cache *,
- void *);
-
-#define EXT_CONTINUE 0
-#define EXT_BREAK 1
-#define EXT_REPEAT 2
-

#define EXT_MAX_BLOCK 0xffffffff

@@ -233,8 +219,7 @@ extern int ext4_ext_try_to_merge(struct inode *inode,
struct ext4_extent *);
extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
-extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
-extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *);
-
+extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t,
+ struct ext4_ext_path *);
#endif /* _LINUX_EXT4_EXTENTS */

diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
index 86ddfe2..6c610b6 100644
--- a/include/linux/ext4_fs_i.h
+++ b/include/linux/ext4_fs_i.h
@@ -27,6 +27,9 @@ typedef int ext4_grpblk_t;
/* data type for filesystem-wide blocks number */
typedef unsigned long long ext4_fsblk_t;

+/* data type for file logical block number */
+typedef __u32 ext4_lblk_t;
+
struct ext4_reserve_window {
ext4_fsblk_t _rsv_start; /* First byte reserved */
ext4_fsblk_t _rsv_end; /* Last byte reserved or 0 */
@@ -48,7 +51,7 @@ struct ext4_block_alloc_info {
* most-recently-allocated block in this file.
* We use this for detecting linearly ascending allocation requests.
*/
- __u32 last_alloc_logical_block;
+ ext4_lblk_t last_alloc_logical_block;
/*
* Was i_next_alloc_goal in ext4_inode_info
* is the *physical* companion to i_next_alloc_block.
@@ -67,7 +70,7 @@ struct ext4_block_alloc_info {
*/
struct ext4_ext_cache {
ext4_fsblk_t ec_start;
- __u32 ec_block;
+ ext4_lblk_t ec_block;
__u32 ec_len; /* must be 32bit to return holes */
__u32 ec_type;
};
@@ -95,7 +98,7 @@ struct ext4_inode_info {
/* block reservation info */
struct ext4_block_alloc_info *i_block_alloc_info;

- __u32 i_dir_start_lookup;
+ ext4_lblk_t i_dir_start_lookup;
#ifdef CONFIG_EXT4DEV_FS_XATTR
/*
* Extended attributes can be read independently of the main file
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:21:28

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 35/49] ext4: Add inode version support in ext4

From: Jean Noel Cordenner <[email protected]>

This patch adds 64-bit inode version support to ext4. The lower 32 bits
are stored in the osd1.linux1.l_i_version field while the high 32 bits
are stored in the i_version_hi field newly created in the ext4_inode.
This field is incremented in case the ext4_inode is large enough. A
i_version mount option has been added to enable the feature.

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Jean Noel Cordenner <[email protected]>
---
fs/ext4/inode.c | 18 +++++++++++++++++-
fs/ext4/super.c | 10 ++++++++--
fs/inode.c | 17 -----------------
include/linux/ext4_fs.h | 6 +++++-
include/linux/fs.h | 16 +++++++++++++++-
5 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ee0bc3a..3c013e5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2780,6 +2780,13 @@ void ext4_read_inode(struct inode * inode)
EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);

+ inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+ if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
+ if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
+ inode->i_version |=
+ (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
+ }
+
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
@@ -2962,8 +2969,14 @@ static int ext4_do_update_inode(handle_t *handle,
} else for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];

- if (ei->i_extra_isize)
+ raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+ if (ei->i_extra_isize) {
+ if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
+ raw_inode->i_version_hi =
+ cpu_to_le32(inode->i_version >> 32);
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
+ }
+

BUFFER_TRACE(bh, "call ext4_journal_dirty_metadata");
rc = ext4_journal_dirty_metadata(handle, bh);
@@ -3190,6 +3203,9 @@ int ext4_mark_iloc_dirty(handle_t *handle,
{
int err = 0;

+ if (test_opt(inode->i_sb, I_VERSION))
+ inode_inc_iversion(inode);
+
/* the do_update_inode consumes one bh->b_count */
get_bh(iloc->bh);

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f7479d3..aa22acd 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -732,6 +732,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",nobh");
if (!test_opt(sb, EXTENTS))
seq_puts(seq, ",noextents");
+ if (test_opt(sb, I_VERSION))
+ seq_puts(seq, ",i_version");

if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
seq_puts(seq, ",data=journal");
@@ -874,7 +876,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota, Opt_extents, Opt_noextents,
+ Opt_grpquota, Opt_extents, Opt_noextents, Opt_i_version,
};

static match_table_t tokens = {
@@ -928,6 +930,7 @@ static match_table_t tokens = {
{Opt_barrier, "barrier=%u"},
{Opt_extents, "extents"},
{Opt_noextents, "noextents"},
+ {Opt_i_version, "i_version"},
{Opt_err, NULL},
{Opt_resize, "resize"},
};
@@ -1273,6 +1276,10 @@ clear_qf_name:
case Opt_noextents:
clear_opt (sbi->s_mount_opt, EXTENTS);
break;
+ case Opt_i_version:
+ set_opt(sbi->s_mount_opt, I_VERSION);
+ sb->s_flags |= MS_I_VERSION;
+ break;
default:
printk (KERN_ERR
"EXT4-fs: Unrecognized mount option \"%s\" "
@@ -3197,7 +3204,6 @@ out:
i_size_write(inode, off+len-towrite);
EXT4_I(inode)->i_disksize = inode->i_size;
}
- inode->i_version++;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
ext4_mark_inode_dirty(handle, inode);
mutex_unlock(&inode->i_mutex);
diff --git a/fs/inode.c b/fs/inode.c
index b48324a..276ffd6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,23 +1243,6 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
EXPORT_SYMBOL(touch_atime);

/**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
- *
- * Every time the inode is modified, the i_version field
- * will be incremented.
- * The filesystem has to be mounted with i_version flag
- *
- */
-
-void inode_inc_iversion(struct inode *inode)
-{
- spin_lock(&inode->i_lock);
- inode->i_version++;
- spin_unlock(&inode->i_lock);
-}
-
-/**
* file_update_time - update mtime and ctime time
* @file: file accessed
*
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index cd406db..b609294 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -292,7 +292,7 @@ struct ext4_inode {
__le32 i_flags; /* File flags */
union {
struct {
- __u32 l_i_reserved1;
+ __le32 l_i_version;
} linux1;
struct {
__u32 h_i_translator;
@@ -334,6 +334,7 @@ struct ext4_inode {
__le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
__le32 i_crtime; /* File Creation time */
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
+ __le32 i_version_hi; /* high 32 bits for 64-bit version */
};


@@ -407,6 +408,8 @@ do { \
raw_inode->xtime ## _extra); \
} while (0)

+#define i_disk_version osd1.linux1.l_i_version
+
#if defined(__KERNEL__) || defined(__linux__)
#define i_reserved1 osd1.linux1.l_i_reserved1
#define i_file_acl_high osd2.linux2.l_i_file_acl_high
@@ -469,6 +472,7 @@ do { \
#define EXT4_MOUNT_EXTENTS 0x400000 /* Extents support */
#define EXT4_MOUNT_JOURNAL_CHECKSUM 0x800000 /* Journal checksums */
#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
+#define EXT4_MOUNT_I_VERSION 0x2000000 /* i_version support */
/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
#define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94cf5d8..2ac81ee 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1396,7 +1396,21 @@ static inline void inode_dec_link_count(struct inode *inode)
mark_inode_dirty(inode);
}

-extern void inode_inc_iversion(struct inode *inode);
+/**
+ * inode_inc_iversion - increments i_version
+ * @inode: inode that need to be updated
+ *
+ * Every time the inode is modified, the i_version field will be incremented.
+ * The filesystem has to be mounted with i_version flag
+ */
+
+static inline void inode_inc_iversion(struct inode *inode)
+{
+ spin_lock(&inode->i_lock);
+ inode->i_version++;
+ spin_unlock(&inode->i_lock);
+}
+
extern void touch_atime(struct vfsmount *mnt, struct dentry *dentry);
static inline void file_accessed(struct file *file)
{
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:21:47

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 32/49] jbd2: jbd2 stats through procfs

From: Johann Lombardi <[email protected]>

The patch below updates the jbd stats patch to 2.6.20/jbd2.
The initial patch was posted by Alex Tomas in December 2005
(http://marc.info/?l=linux-ext4&m=113538565128617&w=2).
It provides statistics via procfs such as transaction lifetime and size.

Sometimes, investigating performance problems, i find useful to have
stats from jbd about transaction's lifetime, size, etc. here is a
patch for review and inclusion probably.

for example, stats after creation of 3M files in htree directory:

[root@bob ~]# cat /proc/fs/jbd/sda/history
R/C tid wait run lock flush log hndls block inlog ctime write drop close
R 261 8260 2720 0 0 750 9892 8170 8187
C 259 750 0 4885 1
R 262 20 2200 10 0 770 9836 8170 8187
R 263 30 2200 10 0 3070 9812 8170 8187
R 264 0 5000 10 0 1340 0 0 0
C 261 8240 3212 4957 0
R 265 8260 1470 0 0 4640 9854 8170 8187
R 266 0 5000 10 0 1460 0 0 0
C 262 8210 2989 4868 0
R 267 8230 1490 10 0 4440 9875 8171 8188
R 268 0 5000 10 0 1260 0 0 0
C 263 7710 2937 4908 0
R 269 7730 1470 10 0 3330 9841 8170 8187
R 270 0 5000 10 0 830 0 0 0
C 265 8140 3234 4898 0
C 267 720 0 4849 1
R 271 8630 2740 20 0 740 9819 8170 8187
C 269 800 0 4214 1
R 272 40 2170 10 0 830 9716 8170 8187
R 273 40 2280 0 0 3530 9799 8170 8187
R 274 0 5000 10 0 990 0 0 0


where,

R - line for transaction's life from T_RUNNING to T_FINISHED
C - line for transaction's checkpointing
tid - transaction's id
wait - for how long we were waiting for new transaction to start
(the longest period journal_start() took in this transaction)
run - real transaction's lifetime (from T_RUNNING to T_LOCKED
lock - how long we were waiting for all handles to close
(time the transaction was in T_LOCKED)
flush - how long it took to flush all data (data=ordered)
log - how long it took to write the transaction to the log
hndls - how many handles got to the transaction
block - how many blocks got to the transaction
inlog - how many blocks are written to the log (block + descriptors)
ctime - how long it took to checkpoint the transaction
write - how many blocks have been written during checkpointing
drop - how many blocks have been dropped during checkpointing
close - how many running transactions have been closed to checkpoint this one

all times are in msec.


[root@bob ~]# cat /proc/fs/jbd/sda/info
280 transaction, each upto 8192 blocks
average:
1633ms waiting for transaction
3616ms running transaction
5ms transaction was being locked
1ms flushing data (in ordered mode)
1799ms logging transaction
11781 handles per transaction
5629 blocks per transaction
5641 logged blocks per transaction

Signed-off-by: Johann Lombardi <[email protected]>
Signed-off-by: Mariusz Kozlowski <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
---
fs/jbd2/checkpoint.c | 10 +-
fs/jbd2/commit.c | 49 +++++++
fs/jbd2/journal.c | 338 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/jbd2/transaction.c | 9 ++
include/linux/jbd2.h | 77 +++++++++++
5 files changed, 481 insertions(+), 2 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 7e958c8..1b7f282 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -232,7 +232,8 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
*/
static int __process_buffer(journal_t *journal, struct journal_head *jh,
- struct buffer_head **bhs, int *batch_count)
+ struct buffer_head **bhs, int *batch_count,
+ transaction_t *transaction)
{
struct buffer_head *bh = jh2bh(jh);
int ret = 0;
@@ -250,6 +251,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
transaction_t *t = jh->b_transaction;
tid_t tid = t->t_tid;

+ transaction->t_chp_stats.cs_forced_to_close++;
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
jbd2_log_start_commit(journal, tid);
@@ -279,6 +281,7 @@ static int __process_buffer(journal_t *journal, struct journal_head *jh,
bhs[*batch_count] = bh;
__buffer_relink_io(jh);
jbd_unlock_bh_state(bh);
+ transaction->t_chp_stats.cs_written++;
(*batch_count)++;
if (*batch_count == NR_BATCH) {
spin_unlock(&journal->j_list_lock);
@@ -322,6 +325,8 @@ int jbd2_log_do_checkpoint(journal_t *journal)
if (!journal->j_checkpoint_transactions)
goto out;
transaction = journal->j_checkpoint_transactions;
+ if (transaction->t_chp_stats.cs_chp_time == 0)
+ transaction->t_chp_stats.cs_chp_time = jiffies;
this_tid = transaction->t_tid;
restart:
/*
@@ -346,7 +351,8 @@ restart:
retry = 1;
break;
}
- retry = __process_buffer(journal, jh, bhs,&batch_count);
+ retry = __process_buffer(journal, jh, bhs, &batch_count,
+ transaction);
if (!retry && lock_need_resched(&journal->j_list_lock)){
spin_unlock(&journal->j_list_lock);
retry = 1;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 39b5cee..8749a86 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -20,6 +20,7 @@
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/jiffies.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -290,6 +291,7 @@ static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
*/
void jbd2_journal_commit_transaction(journal_t *journal)
{
+ struct transaction_stats_s stats;
transaction_t *commit_transaction;
struct journal_head *jh, *new_jh, *descriptor;
struct buffer_head **wbuf = journal->j_wbuf;
@@ -337,6 +339,11 @@ void jbd2_journal_commit_transaction(journal_t *journal)
spin_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

+ stats.u.run.rs_wait = commit_transaction->t_max_wait;
+ stats.u.run.rs_locked = jiffies;
+ stats.u.run.rs_running = jbd2_time_diff(commit_transaction->t_start,
+ stats.u.run.rs_locked);
+
spin_lock(&commit_transaction->t_handle_lock);
while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);
@@ -407,6 +414,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
*/
jbd2_journal_switch_revoke_table(journal);

+ stats.u.run.rs_flushing = jiffies;
+ stats.u.run.rs_locked = jbd2_time_diff(stats.u.run.rs_locked,
+ stats.u.run.rs_flushing);
+
commit_transaction->t_state = T_FLUSH;
journal->j_committing_transaction = commit_transaction;
journal->j_running_transaction = NULL;
@@ -498,6 +509,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
*/
commit_transaction->t_state = T_COMMIT;

+ stats.u.run.rs_logging = jiffies;
+ stats.u.run.rs_flushing = jbd2_time_diff(stats.u.run.rs_flushing,
+ stats.u.run.rs_logging);
+ stats.u.run.rs_blocks = commit_transaction->t_outstanding_credits;
+ stats.u.run.rs_blocks_logged = 0;
+
descriptor = NULL;
bufs = 0;
while (commit_transaction->t_buffers) {
@@ -646,6 +663,7 @@ start_journal_io:
submit_bh(WRITE, bh);
}
cond_resched();
+ stats.u.run.rs_blocks_logged += bufs;

/* Force a new descriptor to be generated next
time round the loop. */
@@ -816,6 +834,7 @@ restart_loop:
cp_transaction = jh->b_cp_transaction;
if (cp_transaction) {
JBUFFER_TRACE(jh, "remove from old cp transaction");
+ cp_transaction->t_chp_stats.cs_dropped++;
__jbd2_journal_remove_checkpoint(jh);
}

@@ -890,6 +909,36 @@ restart_loop:

J_ASSERT(commit_transaction->t_state == T_COMMIT);

+ commit_transaction->t_start = jiffies;
+ stats.u.run.rs_logging = jbd2_time_diff(stats.u.run.rs_logging,
+ commit_transaction->t_start);
+
+ /*
+ * File the transaction for history
+ */
+ stats.ts_type = JBD2_STATS_RUN;
+ stats.ts_tid = commit_transaction->t_tid;
+ stats.u.run.rs_handle_count = commit_transaction->t_handle_count;
+ spin_lock(&journal->j_history_lock);
+ memcpy(journal->j_history + journal->j_history_cur, &stats,
+ sizeof(stats));
+ if (++journal->j_history_cur == journal->j_history_max)
+ journal->j_history_cur = 0;
+
+ /*
+ * Calculate overall stats
+ */
+ journal->j_stats.ts_tid++;
+ journal->j_stats.u.run.rs_wait += stats.u.run.rs_wait;
+ journal->j_stats.u.run.rs_running += stats.u.run.rs_running;
+ journal->j_stats.u.run.rs_locked += stats.u.run.rs_locked;
+ journal->j_stats.u.run.rs_flushing += stats.u.run.rs_flushing;
+ journal->j_stats.u.run.rs_logging += stats.u.run.rs_logging;
+ journal->j_stats.u.run.rs_handle_count += stats.u.run.rs_handle_count;
+ journal->j_stats.u.run.rs_blocks += stats.u.run.rs_blocks;
+ journal->j_stats.u.run.rs_blocks_logged += stats.u.run.rs_blocks_logged;
+ spin_unlock(&journal->j_history_lock);
+
commit_transaction->t_state = T_FINISHED;
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 6ddc553..3667c91 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -36,6 +36,7 @@
#include <linux/poison.h>
#include <linux/proc_fs.h>
#include <linux/debugfs.h>
+#include <linux/seq_file.h>

#include <asm/uaccess.h>
#include <asm/page.h>
@@ -640,6 +641,312 @@ struct journal_head *jbd2_journal_get_descriptor_buffer(journal_t *journal)
return jbd2_journal_add_journal_head(bh);
}

+struct jbd2_stats_proc_session {
+ journal_t *journal;
+ struct transaction_stats_s *stats;
+ int start;
+ int max;
+};
+
+static void *jbd2_history_skip_empty(struct jbd2_stats_proc_session *s,
+ struct transaction_stats_s *ts,
+ int first)
+{
+ if (ts == s->stats + s->max)
+ ts = s->stats;
+ if (!first && ts == s->stats + s->start)
+ return NULL;
+ while (ts->ts_type == 0) {
+ ts++;
+ if (ts == s->stats + s->max)
+ ts = s->stats;
+ if (ts == s->stats + s->start)
+ return NULL;
+ }
+ return ts;
+
+}
+
+static void *jbd2_seq_history_start(struct seq_file *seq, loff_t *pos)
+{
+ struct jbd2_stats_proc_session *s = seq->private;
+ struct transaction_stats_s *ts;
+ int l = *pos;
+
+ if (l == 0)
+ return SEQ_START_TOKEN;
+ ts = jbd2_history_skip_empty(s, s->stats + s->start, 1);
+ if (!ts)
+ return NULL;
+ l--;
+ while (l) {
+ ts = jbd2_history_skip_empty(s, ++ts, 0);
+ if (!ts)
+ break;
+ l--;
+ }
+ return ts;
+}
+
+static void *jbd2_seq_history_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct jbd2_stats_proc_session *s = seq->private;
+ struct transaction_stats_s *ts = v;
+
+ ++*pos;
+ if (v == SEQ_START_TOKEN)
+ return jbd2_history_skip_empty(s, s->stats + s->start, 1);
+ else
+ return jbd2_history_skip_empty(s, ++ts, 0);
+}
+
+static int jbd2_seq_history_show(struct seq_file *seq, void *v)
+{
+ struct transaction_stats_s *ts = v;
+ if (v == SEQ_START_TOKEN) {
+ seq_printf(seq, "%-4s %-5s %-5s %-5s %-5s %-5s %-5s %-6s %-5s "
+ "%-5s %-5s %-5s %-5s %-5s\n", "R/C", "tid",
+ "wait", "run", "lock", "flush", "log", "hndls",
+ "block", "inlog", "ctime", "write", "drop",
+ "close");
+ return 0;
+ }
+ if (ts->ts_type == JBD2_STATS_RUN)
+ seq_printf(seq, "%-4s %-5lu %-5u %-5u %-5u %-5u %-5u "
+ "%-6lu %-5lu %-5lu\n", "R", ts->ts_tid,
+ jiffies_to_msecs(ts->u.run.rs_wait),
+ jiffies_to_msecs(ts->u.run.rs_running),
+ jiffies_to_msecs(ts->u.run.rs_locked),
+ jiffies_to_msecs(ts->u.run.rs_flushing),
+ jiffies_to_msecs(ts->u.run.rs_logging),
+ ts->u.run.rs_handle_count,
+ ts->u.run.rs_blocks,
+ ts->u.run.rs_blocks_logged);
+ else if (ts->ts_type == JBD2_STATS_CHECKPOINT)
+ seq_printf(seq, "%-4s %-5lu %48s %-5u %-5lu %-5lu %-5lu\n",
+ "C", ts->ts_tid, " ",
+ jiffies_to_msecs(ts->u.chp.cs_chp_time),
+ ts->u.chp.cs_written, ts->u.chp.cs_dropped,
+ ts->u.chp.cs_forced_to_close);
+ else
+ J_ASSERT(0);
+ return 0;
+}
+
+static void jbd2_seq_history_stop(struct seq_file *seq, void *v)
+{
+}
+
+static struct seq_operations jbd2_seq_history_ops = {
+ .start = jbd2_seq_history_start,
+ .next = jbd2_seq_history_next,
+ .stop = jbd2_seq_history_stop,
+ .show = jbd2_seq_history_show,
+};
+
+static int jbd2_seq_history_open(struct inode *inode, struct file *file)
+{
+ journal_t *journal = PDE(inode)->data;
+ struct jbd2_stats_proc_session *s;
+ int rc, size;
+
+ s = kmalloc(sizeof(*s), GFP_KERNEL);
+ if (s == NULL)
+ return -ENOMEM;
+ size = sizeof(struct transaction_stats_s) * journal->j_history_max;
+ s->stats = kmalloc(size, GFP_KERNEL);
+ if (s->stats == NULL) {
+ kfree(s);
+ return -ENOMEM;
+ }
+ spin_lock(&journal->j_history_lock);
+ memcpy(s->stats, journal->j_history, size);
+ s->max = journal->j_history_max;
+ s->start = journal->j_history_cur % s->max;
+ spin_unlock(&journal->j_history_lock);
+
+ rc = seq_open(file, &jbd2_seq_history_ops);
+ if (rc == 0) {
+ struct seq_file *m = file->private_data;
+ m->private = s;
+ } else {
+ kfree(s->stats);
+ kfree(s);
+ }
+ return rc;
+
+}
+
+static int jbd2_seq_history_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq = file->private_data;
+ struct jbd2_stats_proc_session *s = seq->private;
+
+ kfree(s->stats);
+ kfree(s);
+ return seq_release(inode, file);
+}
+
+static struct file_operations jbd2_seq_history_fops = {
+ .owner = THIS_MODULE,
+ .open = jbd2_seq_history_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = jbd2_seq_history_release,
+};
+
+static void *jbd2_seq_info_start(struct seq_file *seq, loff_t *pos)
+{
+ return *pos ? NULL : SEQ_START_TOKEN;
+}
+
+static void *jbd2_seq_info_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ return NULL;
+}
+
+static int jbd2_seq_info_show(struct seq_file *seq, void *v)
+{
+ struct jbd2_stats_proc_session *s = seq->private;
+
+ if (v != SEQ_START_TOKEN)
+ return 0;
+ seq_printf(seq, "%lu transaction, each upto %u blocks\n",
+ s->stats->ts_tid,
+ s->journal->j_max_transaction_buffers);
+ if (s->stats->ts_tid == 0)
+ return 0;
+ seq_printf(seq, "average: \n %ums waiting for transaction\n",
+ jiffies_to_msecs(s->stats->u.run.rs_wait / s->stats->ts_tid));
+ seq_printf(seq, " %ums running transaction\n",
+ jiffies_to_msecs(s->stats->u.run.rs_running / s->stats->ts_tid));
+ seq_printf(seq, " %ums transaction was being locked\n",
+ jiffies_to_msecs(s->stats->u.run.rs_locked / s->stats->ts_tid));
+ seq_printf(seq, " %ums flushing data (in ordered mode)\n",
+ jiffies_to_msecs(s->stats->u.run.rs_flushing / s->stats->ts_tid));
+ seq_printf(seq, " %ums logging transaction\n",
+ jiffies_to_msecs(s->stats->u.run.rs_logging / s->stats->ts_tid));
+ seq_printf(seq, " %lu handles per transaction\n",
+ s->stats->u.run.rs_handle_count / s->stats->ts_tid);
+ seq_printf(seq, " %lu blocks per transaction\n",
+ s->stats->u.run.rs_blocks / s->stats->ts_tid);
+ seq_printf(seq, " %lu logged blocks per transaction\n",
+ s->stats->u.run.rs_blocks_logged / s->stats->ts_tid);
+ return 0;
+}
+
+static void jbd2_seq_info_stop(struct seq_file *seq, void *v)
+{
+}
+
+static struct seq_operations jbd2_seq_info_ops = {
+ .start = jbd2_seq_info_start,
+ .next = jbd2_seq_info_next,
+ .stop = jbd2_seq_info_stop,
+ .show = jbd2_seq_info_show,
+};
+
+static int jbd2_seq_info_open(struct inode *inode, struct file *file)
+{
+ journal_t *journal = PDE(inode)->data;
+ struct jbd2_stats_proc_session *s;
+ int rc, size;
+
+ s = kmalloc(sizeof(*s), GFP_KERNEL);
+ if (s == NULL)
+ return -ENOMEM;
+ size = sizeof(struct transaction_stats_s);
+ s->stats = kmalloc(size, GFP_KERNEL);
+ if (s->stats == NULL) {
+ kfree(s);
+ return -ENOMEM;
+ }
+ spin_lock(&journal->j_history_lock);
+ memcpy(s->stats, &journal->j_stats, size);
+ s->journal = journal;
+ spin_unlock(&journal->j_history_lock);
+
+ rc = seq_open(file, &jbd2_seq_info_ops);
+ if (rc == 0) {
+ struct seq_file *m = file->private_data;
+ m->private = s;
+ } else {
+ kfree(s->stats);
+ kfree(s);
+ }
+ return rc;
+
+}
+
+static int jbd2_seq_info_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *seq = file->private_data;
+ struct jbd2_stats_proc_session *s = seq->private;
+ kfree(s->stats);
+ kfree(s);
+ return seq_release(inode, file);
+}
+
+static struct file_operations jbd2_seq_info_fops = {
+ .owner = THIS_MODULE,
+ .open = jbd2_seq_info_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = jbd2_seq_info_release,
+};
+
+static struct proc_dir_entry *proc_jbd2_stats;
+
+static void jbd2_stats_proc_init(journal_t *journal)
+{
+ char name[BDEVNAME_SIZE];
+
+ snprintf(name, sizeof(name) - 1, "%s", bdevname(journal->j_dev, name));
+ journal->j_proc_entry = proc_mkdir(name, proc_jbd2_stats);
+ if (journal->j_proc_entry) {
+ struct proc_dir_entry *p;
+ p = create_proc_entry("history", S_IRUGO,
+ journal->j_proc_entry);
+ if (p) {
+ p->proc_fops = &jbd2_seq_history_fops;
+ p->data = journal;
+ p = create_proc_entry("info", S_IRUGO,
+ journal->j_proc_entry);
+ if (p) {
+ p->proc_fops = &jbd2_seq_info_fops;
+ p->data = journal;
+ }
+ }
+ }
+}
+
+static void jbd2_stats_proc_exit(journal_t *journal)
+{
+ char name[BDEVNAME_SIZE];
+
+ snprintf(name, sizeof(name) - 1, "%s", bdevname(journal->j_dev, name));
+ remove_proc_entry("info", journal->j_proc_entry);
+ remove_proc_entry("history", journal->j_proc_entry);
+ remove_proc_entry(name, proc_jbd2_stats);
+}
+
+static void journal_init_stats(journal_t *journal)
+{
+ int size;
+
+ if (!proc_jbd2_stats)
+ return;
+
+ journal->j_history_max = 100;
+ size = sizeof(struct transaction_stats_s) * journal->j_history_max;
+ journal->j_history = kzalloc(size, GFP_KERNEL);
+ if (!journal->j_history) {
+ journal->j_history_max = 0;
+ return;
+ }
+ spin_lock_init(&journal->j_history_lock);
+}
+
/*
* Management for journal control blocks: functions to create and
* destroy journal_t structures, and to initialise and read existing
@@ -681,6 +988,9 @@ static journal_t * journal_init_common (void)
kfree(journal);
goto fail;
}
+
+ journal_init_stats(journal);
+
return journal;
fail:
return NULL;
@@ -735,6 +1045,7 @@ journal_t * jbd2_journal_init_dev(struct block_device *bdev,
journal->j_fs_dev = fs_dev;
journal->j_blk_offset = start;
journal->j_maxlen = len;
+ jbd2_stats_proc_init(journal);

bh = __getblk(journal->j_dev, start, journal->j_blocksize);
J_ASSERT(bh != NULL);
@@ -773,6 +1084,7 @@ journal_t * jbd2_journal_init_inode (struct inode *inode)

journal->j_maxlen = inode->i_size >> inode->i_sb->s_blocksize_bits;
journal->j_blocksize = inode->i_sb->s_blocksize;
+ jbd2_stats_proc_init(journal);

/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
@@ -1153,6 +1465,8 @@ void jbd2_journal_destroy(journal_t *journal)
brelse(journal->j_sb_buffer);
}

+ if (journal->j_proc_entry)
+ jbd2_stats_proc_exit(journal);
if (journal->j_inode)
iput(journal->j_inode);
if (journal->j_revoke)
@@ -1900,6 +2214,28 @@ static void __exit jbd2_remove_debugfs_entry(void)

#endif

+#ifdef CONFIG_PROC_FS
+
+#define JBD2_STATS_PROC_NAME "fs/jbd2"
+
+static void __init jbd2_create_jbd_stats_proc_entry(void)
+{
+ proc_jbd2_stats = proc_mkdir(JBD2_STATS_PROC_NAME, NULL);
+}
+
+static void __exit jbd2_remove_jbd_stats_proc_entry(void)
+{
+ if (proc_jbd2_stats)
+ remove_proc_entry(JBD2_STATS_PROC_NAME, NULL);
+}
+
+#else
+
+#define jbd2_create_jbd_stats_proc_entry() do {} while (0)
+#define jbd2_remove_jbd_stats_proc_entry() do {} while (0)
+
+#endif
+
struct kmem_cache *jbd2_handle_cache;

static int __init journal_init_handle_cache(void)
@@ -1955,6 +2291,7 @@ static int __init journal_init(void)
if (ret != 0)
jbd2_journal_destroy_caches();
jbd2_create_debugfs_entry();
+ jbd2_create_jbd_stats_proc_entry();
return ret;
}

@@ -1966,6 +2303,7 @@ static void __exit journal_exit(void)
printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
#endif
jbd2_remove_debugfs_entry();
+ jbd2_remove_jbd_stats_proc_entry();
jbd2_journal_destroy_caches();
}

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index b1fcf2b..f30802a 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -59,6 +59,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)

J_ASSERT(journal->j_running_transaction == NULL);
journal->j_running_transaction = transaction;
+ transaction->t_max_wait = 0;
+ transaction->t_start = jiffies;

return transaction;
}
@@ -85,6 +87,7 @@ static int start_this_handle(journal_t *journal, handle_t *handle)
int nblocks = handle->h_buffer_credits;
transaction_t *new_transaction = NULL;
int ret = 0;
+ unsigned long ts = jiffies;

if (nblocks > journal->j_max_transaction_buffers) {
printk(KERN_ERR "JBD: %s wants too many credits (%d > %d)\n",
@@ -217,6 +220,12 @@ repeat_locked:
/* OK, account for the buffers that this operation expects to
* use and add the handle to the running transaction. */

+ if (time_after(transaction->t_start, ts)) {
+ ts = jbd2_time_diff(ts, transaction->t_start);
+ if (ts > transaction->t_max_wait)
+ transaction->t_max_wait = ts;
+ }
+
handle->h_transaction = transaction;
transaction->t_outstanding_credits += nblocks;
transaction->t_updates++;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index d861ffd..6856400 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -395,6 +395,16 @@ struct handle_s
};


+/*
+ * Some stats for checkpoint phase
+ */
+struct transaction_chp_stats_s {
+ unsigned long cs_chp_time;
+ unsigned long cs_forced_to_close;
+ unsigned long cs_written;
+ unsigned long cs_dropped;
+};
+
/* The transaction_t type is the guts of the journaling mechanism. It
* tracks a compound transaction through its various states:
*
@@ -532,6 +542,21 @@ struct transaction_s
spinlock_t t_handle_lock;

/*
+ * Longest time some handle had to wait for running transaction
+ */
+ unsigned long t_max_wait;
+
+ /*
+ * When transaction started
+ */
+ unsigned long t_start;
+
+ /*
+ * Checkpointing stats [j_checkpoint_sem]
+ */
+ struct transaction_chp_stats_s t_chp_stats;
+
+ /*
* Number of outstanding updates running on this transaction
* [t_handle_lock]
*/
@@ -562,6 +587,39 @@ struct transaction_s

};

+struct transaction_run_stats_s {
+ unsigned long rs_wait;
+ unsigned long rs_running;
+ unsigned long rs_locked;
+ unsigned long rs_flushing;
+ unsigned long rs_logging;
+
+ unsigned long rs_handle_count;
+ unsigned long rs_blocks;
+ unsigned long rs_blocks_logged;
+};
+
+struct transaction_stats_s {
+ int ts_type;
+ unsigned long ts_tid;
+ union {
+ struct transaction_run_stats_s run;
+ struct transaction_chp_stats_s chp;
+ } u;
+};
+
+#define JBD2_STATS_RUN 1
+#define JBD2_STATS_CHECKPOINT 2
+
+static inline unsigned long
+jbd2_time_diff(unsigned long start, unsigned long end)
+{
+ if (end >= start)
+ return end - start;
+
+ return end + (MAX_JIFFY_OFFSET - start);
+}
+
/**
* struct journal_s - The journal_s type is the concrete type associated with
* journal_t.
@@ -623,6 +681,12 @@ struct transaction_s
* @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
* number that will fit in j_blocksize
* @j_last_sync_writer: most recent pid which did a synchronous write
+ * @j_history: Buffer storing the transactions statistics history
+ * @j_history_max: Maximum number of transactions in the statistics history
+ * @j_history_cur: Current number of transactions in the statistics history
+ * @j_history_lock: Protect the transactions statistics history
+ * @j_proc_entry: procfs entry for the jbd statistics directory
+ * @j_stats: Overall statistics
* @j_private: An opaque pointer to fs-private information.
*/

@@ -815,6 +879,19 @@ struct journal_s
pid_t j_last_sync_writer;

/*
+ * Journal statistics
+ */
+ struct transaction_stats_s *j_history;
+ int j_history_max;
+ int j_history_cur;
+ /*
+ * Protect the transactions statistics history
+ */
+ spinlock_t j_history_lock;
+ struct proc_dir_entry *j_proc_entry;
+ struct transaction_stats_s j_stats;
+
+ /*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
*/
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:22:18

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 06/49] ext4: fixes block group number being set to a negative value

From: Avantika Mathur <[email protected]>

This patch fixes various places where the group number is set to a negative
value.

Signed-off-by: Avantika Mathur <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ialloc.c | 101 ++++++++++++++++++++++++++++-------------------------
1 files changed, 53 insertions(+), 48 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 64dea86..7b5cfa6 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -260,12 +260,14 @@ error_return:
* For other inodes, search forward from the parent directory\'s block
* group to find a free inode.
*/
-static ext4_group_t find_group_dir(struct super_block *sb, struct inode *parent)
+static int find_group_dir(struct super_block *sb, struct inode *parent,
+ ext4_group_t *best_group)
{
ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
unsigned int freei, avefreei;
struct ext4_group_desc *desc, *best_desc = NULL;
- ext4_group_t group, best_group = -1;
+ ext4_group_t group;
+ int ret = -1;

freei = percpu_counter_read_positive(&EXT4_SB(sb)->s_freeinodes_counter);
avefreei = freei / ngroups;
@@ -279,11 +281,12 @@ static ext4_group_t find_group_dir(struct super_block *sb, struct inode *parent)
if (!best_desc ||
(le16_to_cpu(desc->bg_free_blocks_count) >
le16_to_cpu(best_desc->bg_free_blocks_count))) {
- best_group = group;
+ *best_group = group;
best_desc = desc;
+ ret = 0;
}
}
- return best_group;
+ return ret;
}

/*
@@ -314,8 +317,8 @@ static ext4_group_t find_group_dir(struct super_block *sb, struct inode *parent)
#define INODE_COST 64
#define BLOCK_COST 256

-static ext4_group_t find_group_orlov(struct super_block *sb,
- struct inode *parent)
+static int find_group_orlov(struct super_block *sb, struct inode *parent,
+ ext4_group_t *group)
{
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -328,7 +331,7 @@ static ext4_group_t find_group_orlov(struct super_block *sb,
unsigned int ndirs;
int max_debt, max_dirs, min_inodes;
ext4_grpblk_t min_blocks;
- ext4_group_t group = -1, i;
+ ext4_group_t i;
struct ext4_group_desc *desc;

freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
@@ -341,13 +344,14 @@ static ext4_group_t find_group_orlov(struct super_block *sb,
if ((parent == sb->s_root->d_inode) ||
(EXT4_I(parent)->i_flags & EXT4_TOPDIR_FL)) {
int best_ndir = inodes_per_group;
- ext4_group_t best_group = -1;
+ ext4_group_t grp;
+ int ret = -1;

- get_random_bytes(&group, sizeof(group));
- parent_group = (unsigned)group % ngroups;
+ get_random_bytes(&grp, sizeof(grp));
+ parent_group = (unsigned)grp % ngroups;
for (i = 0; i < ngroups; i++) {
- group = (parent_group + i) % ngroups;
- desc = ext4_get_group_desc (sb, group, NULL);
+ grp = (parent_group + i) % ngroups;
+ desc = ext4_get_group_desc(sb, grp, NULL);
if (!desc || !desc->bg_free_inodes_count)
continue;
if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
@@ -356,11 +360,12 @@ static ext4_group_t find_group_orlov(struct super_block *sb,
continue;
if (le16_to_cpu(desc->bg_free_blocks_count) < avefreeb)
continue;
- best_group = group;
+ *group = grp;
+ ret = 0;
best_ndir = le16_to_cpu(desc->bg_used_dirs_count);
}
- if (best_group >= 0)
- return best_group;
+ if (ret == 0)
+ return ret;
goto fallback;
}

@@ -381,8 +386,8 @@ static ext4_group_t find_group_orlov(struct super_block *sb,
max_debt = 1;

for (i = 0; i < ngroups; i++) {
- group = (parent_group + i) % ngroups;
- desc = ext4_get_group_desc (sb, group, NULL);
+ *group = (parent_group + i) % ngroups;
+ desc = ext4_get_group_desc(sb, *group, NULL);
if (!desc || !desc->bg_free_inodes_count)
continue;
if (le16_to_cpu(desc->bg_used_dirs_count) >= max_dirs)
@@ -391,17 +396,16 @@ static ext4_group_t find_group_orlov(struct super_block *sb,
continue;
if (le16_to_cpu(desc->bg_free_blocks_count) < min_blocks)
continue;
- return group;
+ return 0;
}

fallback:
for (i = 0; i < ngroups; i++) {
- group = (parent_group + i) % ngroups;
- desc = ext4_get_group_desc (sb, group, NULL);
- if (!desc || !desc->bg_free_inodes_count)
- continue;
- if (le16_to_cpu(desc->bg_free_inodes_count) >= avefreei)
- return group;
+ *group = (parent_group + i) % ngroups;
+ desc = ext4_get_group_desc(sb, *group, NULL);
+ if (desc && desc->bg_free_inodes_count &&
+ le16_to_cpu(desc->bg_free_inodes_count) >= avefreei)
+ return 0;
}

if (avefreei) {
@@ -416,22 +420,22 @@ fallback:
return -1;
}

-static ext4_group_t find_group_other(struct super_block *sb,
- struct inode *parent)
+static int find_group_other(struct super_block *sb, struct inode *parent,
+ ext4_group_t *group)
{
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
struct ext4_group_desc *desc;
- ext4_group_t group, i;
+ ext4_group_t i;

/*
* Try to place the inode in its parent directory
*/
- group = parent_group;
- desc = ext4_get_group_desc (sb, group, NULL);
+ *group = parent_group;
+ desc = ext4_get_group_desc(sb, *group, NULL);
if (desc && le16_to_cpu(desc->bg_free_inodes_count) &&
le16_to_cpu(desc->bg_free_blocks_count))
- return group;
+ return 0;

/*
* We're going to place this inode in a different blockgroup from its
@@ -442,33 +446,33 @@ static ext4_group_t find_group_other(struct super_block *sb,
*
* So add our directory's i_ino into the starting point for the hash.
*/
- group = (group + parent->i_ino) % ngroups;
+ *group = (*group + parent->i_ino) % ngroups;

/*
* Use a quadratic hash to find a group with a free inode and some free
* blocks.
*/
for (i = 1; i < ngroups; i <<= 1) {
- group += i;
- if (group >= ngroups)
- group -= ngroups;
- desc = ext4_get_group_desc (sb, group, NULL);
+ *group += i;
+ if (*group >= ngroups)
+ *group -= ngroups;
+ desc = ext4_get_group_desc(sb, *group, NULL);
if (desc && le16_to_cpu(desc->bg_free_inodes_count) &&
le16_to_cpu(desc->bg_free_blocks_count))
- return group;
+ return 0;
}

/*
* That failed: try linear search for a free inode, even if that group
* has no free blocks.
*/
- group = parent_group;
+ *group = parent_group;
for (i = 0; i < ngroups; i++) {
- if (++group >= ngroups)
- group = 0;
- desc = ext4_get_group_desc (sb, group, NULL);
+ if (++*group >= ngroups)
+ *group = 0;
+ desc = ext4_get_group_desc(sb, *group, NULL);
if (desc && le16_to_cpu(desc->bg_free_inodes_count))
- return group;
+ return 0;
}

return -1;
@@ -489,16 +493,17 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode)
struct super_block *sb;
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *bh2;
- ext4_group_t group;
+ ext4_group_t group = 0;
unsigned long ino = 0;
struct inode * inode;
struct ext4_group_desc * gdp = NULL;
struct ext4_super_block * es;
struct ext4_inode_info *ei;
struct ext4_sb_info *sbi;
- int err = 0;
+ int ret2, err = 0;
struct inode *ret;
- int i, free = 0;
+ ext4_group_t i;
+ int free = 0;

/* Cannot create files in a deleted directory */
if (!dir || !dir->i_nlink)
@@ -514,14 +519,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode)
es = sbi->s_es;
if (S_ISDIR(mode)) {
if (test_opt (sb, OLDALLOC))
- group = find_group_dir(sb, dir);
+ ret2 = find_group_dir(sb, dir, &group);
else
- group = find_group_orlov(sb, dir);
+ ret2 = find_group_orlov(sb, dir, &group);
} else
- group = find_group_other(sb, dir);
+ ret2 = find_group_other(sb, dir, &group);

err = -ENOSPC;
- if (group == -1)
+ if (ret2 == -1)
goto out;

for (i = 0; i < sbi->s_groups_count; i++) {
--
1.5.4.rc3.31.g1271-dirty

2008-01-22 03:51:56

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 49/49] jbd2: sparse pointer use of zero as null

From: Mingming Cao <[email protected]>

Get rid of sparse related warnings from places that use integer as NULL
pointer. (Ported from upstream ext3/jbd changes.)

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/transaction.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 0c8adab..b9b0b6f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1182,7 +1182,7 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
}

/* That test should have eliminated the following case: */
- J_ASSERT_JH(jh, jh->b_frozen_data == 0);
+ J_ASSERT_JH(jh, jh->b_frozen_data == NULL);

JBUFFER_TRACE(jh, "file as BJ_Metadata");
spin_lock(&journal->j_list_lock);
@@ -1532,7 +1532,7 @@ void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)

J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
if (jh->b_jlist != BJ_None)
- J_ASSERT_JH(jh, transaction != 0);
+ J_ASSERT_JH(jh, transaction != NULL);

switch (jh->b_jlist) {
case BJ_None:
@@ -1601,11 +1601,11 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
if (buffer_locked(bh) || buffer_dirty(bh))
goto out;

- if (jh->b_next_transaction != 0)
+ if (jh->b_next_transaction != NULL)
goto out;

spin_lock(&journal->j_list_lock);
- if (jh->b_transaction != 0 && jh->b_cp_transaction == 0) {
+ if (jh->b_transaction != NULL && jh->b_cp_transaction == NULL) {
if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) {
/* A written-back ordered data buffer */
JBUFFER_TRACE(jh, "release data");
@@ -1613,7 +1613,7 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
jbd2_journal_remove_journal_head(bh);
__brelse(bh);
}
- } else if (jh->b_cp_transaction != 0 && jh->b_transaction == 0) {
+ } else if (jh->b_cp_transaction != NULL && jh->b_transaction == NULL) {
/* written-back checkpointed metadata buffer */
if (jh->b_jlist == BJ_None) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
@@ -1973,7 +1973,7 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,

J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
J_ASSERT_JH(jh, jh->b_transaction == transaction ||
- jh->b_transaction == 0);
+ jh->b_transaction == NULL);

if (jh->b_transaction && jh->b_jlist == jlist)
return;
--
1.5.4.rc3.31.g1271-dirty

2008-01-23 12:43:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: ext4 merge plans for 2.6.25

The log is pretty messy, any chance you could reshuffle the gazillions
of fixup patches into the few feature patches to give a proper readable
history in git?

2008-01-23 22:08:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 33/49] ext4: Add the journal checksum feature

> On Mon, 21 Jan 2008 22:02:12 -0500 "Theodore Ts'o" <[email protected]> wrote:
> From: Girish Shilamkar <[email protected]>
>
> The journal checksum feature adds two new flags i.e
> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
>
> JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
> checksum for the blocks described by the descriptor blocks.
> Due to checksums, writing of the commit record no longer needs to be
> synchronous. Now commit record can be sent to disk without waiting for
> descriptor blocks to be written to disk. This behavior is controlled
> using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
> able to recover the journal with _ASYNC_COMMIT hence it is made
> incompat.
> The commit header has been extended to hold the checksum along with the
> type of the checksum.
>
> For recovery in pass scan checksums are verified to ensure the sanity
> and completeness(in case of _ASYNC_COMMIT) of every transaction.
>
> ...
>
> +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)

unneeded inlining.

> +{
> + struct page *page = bh->b_page;
> + char *addr;
> + __u32 checksum;
> +
> + addr = kmap_atomic(page, KM_USER0);
> + checksum = crc32_be(crc32_sum,
> + (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> + kunmap_atomic(addr, KM_USER0);
> +
> + return checksum;
> +}

Can this buffer actually be in highmem?

> static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
> unsigned long long block)

More unnecessary inlining.

> +/*
> + * jbd2_journal_clear_features () - Clear a given journal feature in the
> + * superblock
> + * @journal: Journal to act on.
> + * @compat: bitmask of compatible features
> + * @ro: bitmask of features that force read-only mount
> + * @incompat: bitmask of incompatible features
> + *
> + * Clear a given journal feature as present on the
> + * superblock. Returns true if the requested features could be reset.
> + */
> +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
> + unsigned long ro, unsigned long incompat)
> +{
> + journal_superblock_t *sb;
> +
> + jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
> + compat, ro, incompat);
> +
> + sb = journal->j_superblock;
> +
> + sb->s_feature_compat &= ~cpu_to_be32(compat);
> + sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
> + sb->s_feature_incompat &= ~cpu_to_be32(incompat);
> +
> + return 1;
> +}
> +EXPORT_SYMBOL(jbd2_journal_clear_features);

Kernel usually returns 0 on success. So we can return a useful errno on
failure.

> +/*
> + * calc_chksums calculates the checksums for the blocks described in the
> + * descriptor block.
> + */
> +static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> + unsigned long *next_log_block, __u32 *crc32_sum)
> +{
> + int i, num_blks, err;
> + unsigned io_block;
> + struct buffer_head *obh;
> +
> + num_blks = count_tags(journal, bh);
> + /* Calculate checksum of the descriptor block. */
> + *crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
> +
> + for (i = 0; i < num_blks; i++) {
> + io_block = (*next_log_block)++;

unsigned <- unsigned long.

Are all the types appropriate in here?

> + wrap(journal, *next_log_block);
> + err = jread(&obh, journal, io_block);
> + if (err) {
> + printk(KERN_ERR "JBD: IO error %d recovering block "
> + "%u in log\n", err, io_block);
> + return 1;
> + } else {
> + *crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
> + obh->b_size);
> + }
> + }
> + return 0;
> +}
> +
> static int do_one_pass(journal_t *journal,
> struct recovery_info *info, enum passtype pass)
> {
> @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
> unsigned int sequence;
> int blocktype;
> int tag_bytes = journal_tag_bytes(journal);
> + __u32 crc32_sum = ~0; /* Transactional Checksums */
>
> /* Precompute the maximum metadata descriptors in a descriptor block */
> int MAX_BLOCKS_PER_DESC;
> @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
> switch(blocktype) {
> case JBD2_DESCRIPTOR_BLOCK:
> /* If it is a valid descriptor block, replay it
> - * in pass REPLAY; otherwise, just skip over the
> - * blocks it describes. */
> + * in pass REPLAY; if journal_checksums enabled, then
> + * calculate checksums in PASS_SCAN, otherwise,
> + * just skip over the blocks it describes. */
> if (pass != PASS_REPLAY) {
> + if (pass == PASS_SCAN &&
> + JBD2_HAS_COMPAT_FEATURE(journal,
> + JBD2_FEATURE_COMPAT_CHECKSUM) &&
> + !info->end_transaction) {
> + if (calc_chksums(journal, bh,
> + &next_log_block,
> + &crc32_sum)) {

put_bh()

> + brelse(bh);
> + break;
> + }
> + brelse(bh);
> + continue;

put_bh()

> + }
> next_log_block += count_tags(journal, bh);
> wrap(journal, next_log_block);
> brelse(bh);
> @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal,
> continue;
>
> + brelse(bh);

etc

2008-01-23 22:08:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 23/49] Add buffer head related helper functions

> On Mon, 21 Jan 2008 22:02:02 -0500 "Theodore Ts'o" <[email protected]> wrote:
> +}
> +EXPORT_SYMBOL(bh_uptodate_or_lock);
> +/**

Missing newline.

> + * bh_submit_read: Submit a locked buffer for reading
> + * @bh: struct buffer_head
> + *
> + * Returns a negative error
> + */
> +int bh_submit_read(struct buffer_head *bh)
> +{
> + if (!buffer_locked(bh))
> + lock_buffer(bh);
> +
> + if (buffer_uptodate(bh))
> + return 0;

Here it can lock the buffer then return zero

> + get_bh(bh);
> + bh->b_end_io = end_buffer_read_sync;
> + submit_bh(READ, bh);
> + wait_on_buffer(bh);
> + if (buffer_uptodate(bh))
> + return 0;

Here it will unlock the buffer and return zero.

This function is unusable when passed an unlocked buffer.

The return value should (always) be documented.

> + return -EIO;
> +}
> +EXPORT_SYMBOL(bh_submit_read);
> void __init buffer_init(void)

Missing newline.

2008-01-23 22:09:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 24/49] ext4: add block bitmap validation

> On Mon, 21 Jan 2008 22:02:03 -0500 "Theodore Ts'o" <[email protected]> wrote:
> + if (bh_submit_read(bh) < 0) {
> + brelse(bh);
> + ext4_error(sb, __FUNCTION__,
> "Cannot read block bitmap - "
> - "block_group = %lu, block_bitmap = %llu",
> - block_group, bitmap_blk);
> + "block_group = %d, block_bitmap = %llu",
> + (int)block_group, (unsigned long long)bitmap_blk);
> + return NULL;
> + }
> + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
> + brelse(bh);
> + return NULL;
> + }

brelse() should only be used when the bh might be NULL - put_bh()
can be used here.

Please review all ext4/jbd2 code for this trivial speedup.

2008-01-23 22:10:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

> On Mon, 21 Jan 2008 22:02:20 -0500 "Theodore Ts'o" <[email protected]> wrote:
> From: Alex Tomas <[email protected]>
>
> Signed-off-by: Alex Tomas <[email protected]>
> Signed-off-by: Andreas Dilger <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Eric Sandeen <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> ...
>
> +#if BITS_PER_LONG == 64
> +#define mb_correct_addr_and_bit(bit, addr) \
> +{ \
> + bit += ((unsigned long) addr & 7UL) << 3; \
> + addr = (void *) ((unsigned long) addr & ~7UL); \
> +}
> +#elif BITS_PER_LONG == 32
> +#define mb_correct_addr_and_bit(bit, addr) \
> +{ \
> + bit += ((unsigned long) addr & 3UL) << 3; \
> + addr = (void *) ((unsigned long) addr & ~3UL); \
> +}
> +#else
> +#error "how many bits you are?!"
> +#endif

Why do these exist?

> +static inline int mb_test_bit(int bit, void *addr)
> +{
> + mb_correct_addr_and_bit(bit, addr);
> + return ext4_test_bit(bit, addr);
> +}

ext2_test_bit() already handles bitnum > wordsize.

If mb_correct_addr_and_bit() is actually needed then some suitable comment
would help.

> +static inline void mb_set_bit(int bit, void *addr)
> +{
> + mb_correct_addr_and_bit(bit, addr);
> + ext4_set_bit(bit, addr);
> +}
> +
> +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
> +{
> + mb_correct_addr_and_bit(bit, addr);
> + ext4_set_bit_atomic(lock, bit, addr);
> +}
> +
> +static inline void mb_clear_bit(int bit, void *addr)
> +{
> + mb_correct_addr_and_bit(bit, addr);
> + ext4_clear_bit(bit, addr);
> +}
> +
> +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
> +{
> + mb_correct_addr_and_bit(bit, addr);
> + ext4_clear_bit_atomic(lock, bit, addr);
> +}
> +
> +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)

uninlining this will save about eighty squigabytes of text.

Please review all of ext4/jbd2 with a view to removig unnecessary and wrong
inlings.

> +{
> + char *bb;
> +
> + /* FIXME!! is this needed */
> + BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
> + BUG_ON(max == NULL);
> +
> + if (order > e4b->bd_blkbits + 1) {
> + *max = 0;
> + return NULL;
> + }
> +
> + /* at order 0 we see each particular block */
> + *max = 1 << (e4b->bd_blkbits + 3);
> + if (order == 0)
> + return EXT4_MB_BITMAP(e4b);
> +
> + bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
> + *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
> +
> + return bb;
> +}
> +
>
> ...
>
> +#else
> +#define mb_free_blocks_double(a, b, c, d)
> +#define mb_mark_used_double(a, b, c)
> +#define mb_cmp_bitmaps(a, b)
> +#endif

Please use the do{}while(0) thing. Or, better, proper C functions which
have typechecking (unless this will cause undefined-var compile errors,
which happens sometimes)

> +/* find most significant bit */
> +static int fmsb(unsigned short word)
> +{
> + int order;
> +
> + if (word > 255) {
> + order = 7;
> + word >>= 8;
> + } else {
> + order = -1;
> + }
> +
> + do {
> + order++;
> + word >>= 1;
> + } while (word != 0);
> +
> + return order;
> +}

Did we just reinvent fls()?

> +/* FIXME!! need more doc */
> +static void ext4_mb_mark_free_simple(struct super_block *sb,
> + void *buddy, unsigned first, int len,
> + struct ext4_group_info *grp)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + unsigned short min;
> + unsigned short max;
> + unsigned short chunk;
> + unsigned short border;
> +
> + BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb));
> +
> + border = 2 << sb->s_blocksize_bits;

Won't this explode with >= 32k blocksize?

> + while (len > 0) {
> + /* find how many blocks can be covered since this position */
> + max = ffs(first | border) - 1;
> +
> + /* find how many blocks of power 2 we need to mark */
> + min = fmsb(len);
> +
> + if (max < min)
> + min = max;
> + chunk = 1 << min;
> +
> + /* mark multiblock chunks only */
> + grp->bb_counters[min]++;
> + if (min > 0)
> + mb_clear_bit(first >> min,
> + buddy + sbi->s_mb_offsets[min]);
> +
> + len -= chunk;
> + first += chunk;
> + }
> +}
> +
>
> ...
>
> +static int ext4_mb_init_cache(struct page *page, char *incore)
> +{
> + int blocksize;
> + int blocks_per_page;
> + int groups_per_page;
> + int err = 0;
> + int i;
> + ext4_group_t first_group;
> + int first_block;
> + struct super_block *sb;
> + struct buffer_head *bhs;
> + struct buffer_head **bh;
> + struct inode *inode;
> + char *data;
> + char *bitmap;
> +
> + mb_debug("init page %lu\n", page->index);
> +
> + inode = page->mapping->host;
> + sb = inode->i_sb;
> + blocksize = 1 << inode->i_blkbits;
> + blocks_per_page = PAGE_CACHE_SIZE / blocksize;
> +
> + groups_per_page = blocks_per_page >> 1;
> + if (groups_per_page == 0)
> + groups_per_page = 1;
> +
> + /* allocate buffer_heads to read bitmaps */
> + if (groups_per_page > 1) {
> + err = -ENOMEM;
> + i = sizeof(struct buffer_head *) * groups_per_page;
> + bh = kmalloc(i, GFP_NOFS);
> + if (bh == NULL)
> + goto out;
> + memset(bh, 0, i);

kzalloc()

> + } else
> + bh = &bhs;
> +
> + first_group = page->index * blocks_per_page / 2;
> +
> + /* read all groups the page covers into the cache */
> + for (i = 0; i < groups_per_page; i++) {
> + struct ext4_group_desc *desc;
> +
> + if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> + break;
> +
> + err = -EIO;
> + desc = ext4_get_group_desc(sb, first_group + i, NULL);
> + if (desc == NULL)
> + goto out;
> +
> + err = -ENOMEM;
> + bh[i] = sb_getblk(sb, ext4_block_bitmap(sb, desc));
> + if (bh[i] == NULL)
> + goto out;
> +
> + if (buffer_uptodate(bh[i]))
> + continue;
> +
> + lock_buffer(bh[i]);
> + if (buffer_uptodate(bh[i])) {
> + unlock_buffer(bh[i]);
> + continue;
> + }

Didn't we just add a helper in fs/buffer.c to do this?

> + if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> + ext4_init_block_bitmap(sb, bh[i],
> + first_group + i, desc);
> + set_buffer_uptodate(bh[i]);
> + unlock_buffer(bh[i]);
> + continue;
> + }
> + get_bh(bh[i]);
> + bh[i]->b_end_io = end_buffer_read_sync;
> + submit_bh(READ, bh[i]);
> + mb_debug("read bitmap for group %lu\n", first_group + i);
> + }
> +
> + /* wait for I/O completion */
> + for (i = 0; i < groups_per_page && bh[i]; i++)
> + wait_on_buffer(bh[i]);
> +
> + err = -EIO;
> + for (i = 0; i < groups_per_page && bh[i]; i++)
> + if (!buffer_uptodate(bh[i]))
> + goto out;
> +
> + first_block = page->index * blocks_per_page;
> + for (i = 0; i < blocks_per_page; i++) {
> + int group;
> + struct ext4_group_info *grinfo;
> +
> + group = (first_block + i) >> 1;
> + if (group >= EXT4_SB(sb)->s_groups_count)
> + break;
> +
> + /*
> + * data carry information regarding this
> + * particular group in the format specified
> + * above
> + *
> + */
> + data = page_address(page) + (i * blocksize);
> + bitmap = bh[group - first_group]->b_data;
> +
> + /*
> + * We place the buddy block and bitmap block
> + * close together
> + */
> + if ((first_block + i) & 1) {
> + /* this is block of buddy */
> + BUG_ON(incore == NULL);
> + mb_debug("put buddy for group %u in page %lu/%x\n",
> + group, page->index, i * blocksize);
> + memset(data, 0xff, blocksize);
> + grinfo = ext4_get_group_info(sb, group);
> + grinfo->bb_fragments = 0;
> + memset(grinfo->bb_counters, 0,
> + sizeof(unsigned short)*(sb->s_blocksize_bits+2));
> + /*
> + * incore got set to the group block bitmap below
> + */
> + ext4_mb_generate_buddy(sb, data, incore, group);
> + incore = NULL;
> + } else {
> + /* this is block of bitmap */
> + BUG_ON(incore != NULL);
> + mb_debug("put bitmap for group %u in page %lu/%x\n",
> + group, page->index, i * blocksize);
> +
> + /* see comments in ext4_mb_put_pa() */
> + ext4_lock_group(sb, group);
> + memcpy(data, bitmap, blocksize);
> +
> + /* mark all preallocated blks used in in-core bitmap */
> + ext4_mb_generate_from_pa(sb, data, group);
> + ext4_unlock_group(sb, group);
> +
> + /* set incore so that the buddy information can be
> + * generated using this
> + */
> + incore = data;
> + }
> + }
> + SetPageUptodate(page);

Is the page locked here?

> +out:
> + if (bh) {
> + for (i = 0; i < groups_per_page && bh[i]; i++)
> + brelse(bh[i]);

put_bh()

> + if (bh != &bhs)
> + kfree(bh);
> + }
> + return err;
> +}
> +
>
> ...
>
> +static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
> +{
> + __u32 *addr;
> +
> + len = cur + len;
> + while (cur < len) {
> + if ((cur & 31) == 0 && (len - cur) >= 32) {
> + /* fast path: clear whole word at once */

s/clear/set/

> + addr = bm + (cur >> 3);
> + *addr = 0xffffffff;
> + cur += 32;
> + continue;
> + }
> + mb_set_bit_atomic(lock, cur, bm);
> + cur++;
> + }
> +}
> +
>
> ...
>
> +static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> + struct ext4_buddy *e4b)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + int ret;
> +
> + BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group);
> + BUG_ON(ac->ac_status == AC_STATUS_FOUND);
> +
> + ac->ac_b_ex.fe_len = min(ac->ac_b_ex.fe_len, ac->ac_g_ex.fe_len);
> + ac->ac_b_ex.fe_logical = ac->ac_g_ex.fe_logical;
> + ret = mb_mark_used(e4b, &ac->ac_b_ex);
> +
> + /* preallocation can change ac_b_ex, thus we store actually
> + * allocated blocks for history */
> + ac->ac_f_ex = ac->ac_b_ex;
> +
> + ac->ac_status = AC_STATUS_FOUND;
> + ac->ac_tail = ret & 0xffff;
> + ac->ac_buddy = ret >> 16;
> +
> + /* XXXXXXX: SUCH A HORRIBLE **CK */
> + /*FIXME!! Why ? */

?

> + ac->ac_bitmap_page = e4b->bd_bitmap_page;
> + get_page(ac->ac_bitmap_page);
> + ac->ac_buddy_page = e4b->bd_buddy_page;
> + get_page(ac->ac_buddy_page);
> +
> + /* store last allocated for subsequent stream allocation */
> + if ((ac->ac_flags & EXT4_MB_HINT_DATA)) {
> + spin_lock(&sbi->s_md_lock);
> + sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> + sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
> + spin_unlock(&sbi->s_md_lock);
> + }
> +}
>
> ...
>
> +static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
> + ext4_group_t group)
> +{
> + struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> + struct ext4_prealloc_space *pa;
> + struct list_head *cur;
> + ext4_group_t groupnr;
> + ext4_grpblk_t start;
> + int preallocated = 0;
> + int count = 0;
> + int len;
> +
> + /* all form of preallocation discards first load group,
> + * so the only competing code is preallocation use.
> + * we don't need any locking here
> + * notice we do NOT ignore preallocations with pa_deleted
> + * otherwise we could leave used blocks available for
> + * allocation in buddy when concurrent ext4_mb_put_pa()
> + * is dropping preallocation
> + */
> + list_for_each_rcu(cur, &grp->bb_prealloc_list) {
> + pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
> + spin_lock(&pa->pa_lock);
> + ext4_get_group_no_and_offset(sb, pa->pa_pstart,
> + &groupnr, &start);
> + len = pa->pa_len;
> + spin_unlock(&pa->pa_lock);
> + if (unlikely(len == 0))
> + continue;
> + BUG_ON(groupnr != group);
> + mb_set_bits(sb_bgl_lock(EXT4_SB(sb), group),
> + bitmap, start, len);
> + preallocated += len;
> + count++;
> + }

Seems to be missing rcu_read_lock()

> + mb_debug("prellocated %u for group %lu\n", preallocated, group);
> +}
> +
> +static void ext4_mb_pa_callback(struct rcu_head *head)
> +{
> + struct ext4_prealloc_space *pa;
> + pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
> + kmem_cache_free(ext4_pspace_cachep, pa);
> +}
> +#define mb_call_rcu(__pa) call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)

Is there any reason why this had to be implemented as a macro?

>
> ...
>
> +static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> +{
> + struct super_block *sb = ac->ac_sb;
> + struct ext4_prealloc_space *pa;
> + struct ext4_group_info *grp;
> + struct ext4_inode_info *ei;
> +
> + /* preallocate only when found space is larger then requested */
> + BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
> + BUG_ON(ac->ac_status != AC_STATUS_FOUND);
> + BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> +
> + pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);

Do all the GFP_NOFS's in this code really need to be GFP_NOFS?

> + if (pa == NULL)
> + return -ENOMEM;
> +
> + if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) {
> + int winl;
> + int wins;
> + int win;
> + int offs;
> +
> + /* we can't allocate as much as normalizer wants.
> + * so, found space must get proper lstart
> + * to cover original request */
> + BUG_ON(ac->ac_g_ex.fe_logical > ac->ac_o_ex.fe_logical);
> + BUG_ON(ac->ac_g_ex.fe_len < ac->ac_o_ex.fe_len);
> +
> + /* we're limited by original request in that
> + * logical block must be covered any way
> + * winl is window we can move our chunk within */
> + winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical;
> +
> + /* also, we should cover whole original request */
> + wins = ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len;
> +
> + /* the smallest one defines real window */
> + win = min(winl, wins);
> +
> + offs = ac->ac_o_ex.fe_logical % ac->ac_b_ex.fe_len;
> + if (offs && offs < win)
> + win = offs;
> +
> + ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - win;
> + BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
> + BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
> + }
> +
> + /* preallocation can change ac_b_ex, thus we store actually
> + * allocated blocks for history */
> + ac->ac_f_ex = ac->ac_b_ex;
> +
> + pa->pa_lstart = ac->ac_b_ex.fe_logical;
> + pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
> + pa->pa_len = ac->ac_b_ex.fe_len;
> + pa->pa_free = pa->pa_len;
> + atomic_set(&pa->pa_count, 1);
> + spin_lock_init(&pa->pa_lock);
> + pa->pa_deleted = 0;
> + pa->pa_linear = 0;
> +
> + mb_debug("new inode pa %p: %llu/%u for %u\n", pa,
> + pa->pa_pstart, pa->pa_len, pa->pa_lstart);
> +
> + ext4_mb_use_inode_pa(ac, pa);
> + atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
> +
> + ei = EXT4_I(ac->ac_inode);
> + grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
> +
> + pa->pa_obj_lock = &ei->i_prealloc_lock;
> + pa->pa_inode = ac->ac_inode;
> +
> + ext4_lock_group(sb, ac->ac_b_ex.fe_group);
> + list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
> + ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> +
> + spin_lock(pa->pa_obj_lock);
> + list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
> + spin_unlock(pa->pa_obj_lock);

hm. Strange to see list_add_rcu() inside spinlock like this.

> + return 0;
> +}
> +
>
> ...
>
> +static int ext4_mb_discard_group_preallocations(struct super_block *sb,
> + ext4_group_t group, int needed)
> +{
> + struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> + struct buffer_head *bitmap_bh = NULL;
> + struct ext4_prealloc_space *pa, *tmp;
> + struct list_head list;
> + struct ext4_buddy e4b;
> + int err;
> + int busy = 0;
> + int free = 0;
> +
> + mb_debug("discard preallocation for group %lu\n", group);
> +
> + if (list_empty(&grp->bb_prealloc_list))
> + return 0;
> +
> + bitmap_bh = read_block_bitmap(sb, group);
> + if (bitmap_bh == NULL) {
> + /* error handling here */
> + ext4_mb_release_desc(&e4b);
> + BUG_ON(bitmap_bh == NULL);
> + }
> +
> + err = ext4_mb_load_buddy(sb, group, &e4b);
> + BUG_ON(err != 0); /* error handling here */
> +
> + if (needed == 0)
> + needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
> +
> + grp = ext4_get_group_info(sb, group);
> + INIT_LIST_HEAD(&list);
> +
> +repeat:
> + ext4_lock_group(sb, group);
> + list_for_each_entry_safe(pa, tmp,
> + &grp->bb_prealloc_list, pa_group_list) {
> + spin_lock(&pa->pa_lock);
> + if (atomic_read(&pa->pa_count)) {
> + spin_unlock(&pa->pa_lock);
> + busy = 1;
> + continue;
> + }
> + if (pa->pa_deleted) {
> + spin_unlock(&pa->pa_lock);
> + continue;
> + }
> +
> + /* seems this one can be freed ... */
> + pa->pa_deleted = 1;
> +
> + /* we can trust pa_free ... */
> + free += pa->pa_free;
> +
> + spin_unlock(&pa->pa_lock);
> +
> + list_del_rcu(&pa->pa_group_list);
> + list_add(&pa->u.pa_tmp_list, &list);
> + }

Strange to see rcu operations outside rcu_read_lock().

> + /* if we still need more blocks and some PAs were used, try again */
> + if (free < needed && busy) {
> + busy = 0;
> + ext4_unlock_group(sb, group);
> + /*
> + * Yield the CPU here so that we don't get soft lockup
> + * in non preempt case.
> + */
> + yield();

argh, no, yield() is basically unusable. schedule_timeout(1) is preferable.

Please test this code whe there are lots of cpu-intensive tasks running.

> + goto repeat;
> + }
> +
> + /* found anything to free? */
> + if (list_empty(&list)) {
> + BUG_ON(free != 0);
> + goto out;
> + }
> +
> + /* now free all selected PAs */
> + list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
> +
> + /* remove from object (inode or locality group) */
> + spin_lock(pa->pa_obj_lock);
> + list_del_rcu(&pa->pa_inode_list);
> + spin_unlock(pa->pa_obj_lock);
> +
> + if (pa->pa_linear)
> + ext4_mb_release_group_pa(&e4b, pa);
> + else
> + ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> +
> + list_del(&pa->u.pa_tmp_list);
> + mb_call_rcu(pa);
> + }
> +
> +out:
> + ext4_unlock_group(sb, group);
> + ext4_mb_release_desc(&e4b);
> + brelse(bitmap_bh);

put_bh()

> + return free;
> +}
> +
> +/*
> + * releases all non-used preallocated blocks for given inode
> + *
> + * It's important to discard preallocations under i_data_sem
> + * We don't want another block to be served from the prealloc
> + * space when we are discarding the inode prealloc space.
> + *
> + * FIXME!! Make sure it is valid at all the call sites
> + */
> +void ext4_mb_discard_inode_preallocations(struct inode *inode)
> +{
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + struct super_block *sb = inode->i_sb;
> + struct buffer_head *bitmap_bh = NULL;
> + struct ext4_prealloc_space *pa, *tmp;
> + ext4_group_t group = 0;
> + struct list_head list;
> + struct ext4_buddy e4b;
> + int err;
> +
> + if (!test_opt(sb, MBALLOC) || !S_ISREG(inode->i_mode)) {
> + /*BUG_ON(!list_empty(&ei->i_prealloc_list));*/
> + return;
> + }
> +
> + mb_debug("discard preallocation for inode %lu\n", inode->i_ino);
> +
> + INIT_LIST_HEAD(&list);
> +
> +repeat:
> + /* first, collect all pa's in the inode */
> + spin_lock(&ei->i_prealloc_lock);
> + while (!list_empty(&ei->i_prealloc_list)) {
> + pa = list_entry(ei->i_prealloc_list.next,
> + struct ext4_prealloc_space, pa_inode_list);
> + BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
> + spin_lock(&pa->pa_lock);
> + if (atomic_read(&pa->pa_count)) {
> + /* this shouldn't happen often - nobody should
> + * use preallocation while we're discarding it */
> + spin_unlock(&pa->pa_lock);
> + spin_unlock(&ei->i_prealloc_lock);
> + printk(KERN_ERR "uh-oh! used pa while discarding\n");
> + dump_stack();

WARN_ON(1) would be more conventional.

> + current->state = TASK_UNINTERRUPTIBLE;
> + schedule_timeout(HZ);

schedule_timeout_uninterruptible()

> + goto repeat;
> +
> + }
> + if (pa->pa_deleted == 0) {
> + pa->pa_deleted = 1;
> + spin_unlock(&pa->pa_lock);
> + list_del_rcu(&pa->pa_inode_list);
> + list_add(&pa->u.pa_tmp_list, &list);
> + continue;
> + }
> +
> + /* someone is deleting pa right now */
> + spin_unlock(&pa->pa_lock);
> + spin_unlock(&ei->i_prealloc_lock);
> +
> + /* we have to wait here because pa_deleted
> + * doesn't mean pa is already unlinked from
> + * the list. as we might be called from
> + * ->clear_inode() the inode will get freed
> + * and concurrent thread which is unlinking
> + * pa from inode's list may access already
> + * freed memory, bad-bad-bad */
> +
> + /* XXX: if this happens too often, we can
> + * add a flag to force wait only in case
> + * of ->clear_inode(), but not in case of
> + * regular truncate */
> + current->state = TASK_UNINTERRUPTIBLE;
> + schedule_timeout(HZ);

ditto

> + goto repeat;
> + }
> + spin_unlock(&ei->i_prealloc_lock);
> +
> + list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
> + BUG_ON(pa->pa_linear != 0);
> + ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
> +
> + err = ext4_mb_load_buddy(sb, group, &e4b);
> + BUG_ON(err != 0); /* error handling here */
> +
> + bitmap_bh = read_block_bitmap(sb, group);
> + if (bitmap_bh == NULL) {
> + /* error handling here */
> + ext4_mb_release_desc(&e4b);
> + BUG_ON(bitmap_bh == NULL);
> + }
> +
> + ext4_lock_group(sb, group);
> + list_del_rcu(&pa->pa_group_list);
> + ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> + ext4_unlock_group(sb, group);
> +
> + ext4_mb_release_desc(&e4b);
> + brelse(bitmap_bh);
> +
> + list_del(&pa->u.pa_tmp_list);
> + mb_call_rcu(pa);
> + }
> +}

Would be nice to ask Paul to review all the rcu usage in here. It looks odd.

>
> ...
>
> +#else
> +#define ext4_mb_show_ac(x)
> +#endif

static inlined C functions are preferred (+1e6 dittoes)

> +/*
> + * We use locality group preallocation for small size file. The size of the
> + * file is determined by the current size or the resulting size after
> + * allocation which ever is larger
> + *
> + * One can tune this size via /proc/fs/ext4/<partition>/stream_req
> + */
> +static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + int bsbits = ac->ac_sb->s_blocksize_bits;
> + loff_t size, isize;
> +
> + if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
> + return;
> +
> + size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> + isize = i_size_read(ac->ac_inode) >> bsbits;
> + if (size < isize)
> + size = isize;

min()?

> + /* don't use group allocation for large files */
> + if (size >= sbi->s_mb_stream_request)
> + return;
> +
> + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
> + return;
> +
> + BUG_ON(ac->ac_lg != NULL);
> + ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
> + put_cpu();

Strange-looking code. I'd be interested in a description of the per-cou
design here.

> + /* we're going to use group allocation */
> + ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
> +
> + /* serialize all allocations in the group */
> + down(&ac->ac_lg->lg_sem);

This should be a mutex, shouldn't it?

> +}
> +
>
> ...
>
> +static int ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
> + ext4_group_t group, ext4_grpblk_t block, int count)
> +{
> + struct ext4_group_info *db = e4b->bd_info;
> + struct super_block *sb = e4b->bd_sb;
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct ext4_free_metadata *md;
> + int i;
> +
> + BUG_ON(e4b->bd_bitmap_page == NULL);
> + BUG_ON(e4b->bd_buddy_page == NULL);
> +
> + ext4_lock_group(sb, group);
> + for (i = 0; i < count; i++) {
> + md = db->bb_md_cur;
> + if (md && db->bb_tid != handle->h_transaction->t_tid) {
> + db->bb_md_cur = NULL;
> + md = NULL;
> + }
> +
> + if (md == NULL) {
> + ext4_unlock_group(sb, group);
> + md = kmalloc(sizeof(*md), GFP_KERNEL);

Why was this one not GFP_NOFS?

> + if (md == NULL)
> + return -ENOMEM;

Did we just leak some memory?

> + md->num = 0;
> + md->group = group;
> +
> + ext4_lock_group(sb, group);
> + if (db->bb_md_cur == NULL) {
> + spin_lock(&sbi->s_md_lock);
> + list_add(&md->list, &sbi->s_active_transaction);
> + spin_unlock(&sbi->s_md_lock);
> + /* protect buddy cache from being freed,
> + * otherwise we'll refresh it from
> + * on-disk bitmap and lose not-yet-available
> + * blocks */
> + page_cache_get(e4b->bd_buddy_page);
> + page_cache_get(e4b->bd_bitmap_page);
> + db->bb_md_cur = md;
> + db->bb_tid = handle->h_transaction->t_tid;
> + mb_debug("new md 0x%p for group %lu\n",
> + md, md->group);
> + } else {
> + kfree(md);
> + md = db->bb_md_cur;
> + }
> + }
> +
> + BUG_ON(md->num >= EXT4_BB_MAX_BLOCKS);
> + md->blocks[md->num] = block + i;
> + md->num++;
> + if (md->num == EXT4_BB_MAX_BLOCKS) {
> + /* no more space, put full container on a sb's list */
> + db->bb_md_cur = NULL;
> + }
> + }
> + ext4_unlock_group(sb, group);
> + return 0;
> +}
> +
>
> ...
>
> + case Opt_mballoc:
> + set_opt(sbi->s_mount_opt, MBALLOC);
> + break;
> + case Opt_nomballoc:
> + clear_opt(sbi->s_mount_opt, MBALLOC);
> + break;
> + case Opt_stripe:
> + if (match_int(&args[0], &option))
> + return 0;
> + if (option < 0)
> + return 0;
> + sbi->s_stripe = option;
> + break;

These appear to be undocumented.

> default:
> printk (KERN_ERR
> "EXT4-fs: Unrecognized mount option \"%s\" "
> @@ -1742,6 +1762,33 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,
> return (has_super + ext4_group_first_block_no(sb, bg));
> }
>
> +/**
> + * ext4_get_stripe_size: Get the stripe size.
> + * @sbi: In memory super block info
> + *
> + * If we have specified it via mount option, then
> + * use the mount option value. If the value specified at mount time is
> + * greater than the blocks per group use the super block value.
> + * If the super block value is greater than blocks per group return 0.
> + * Allocator needs it be less than blocks per group.
> + *
> + */
> +static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
> +{
> + unsigned long stride = le16_to_cpu(sbi->s_es->s_raid_stride);
> + unsigned long stripe_width =
> + le32_to_cpu(sbi->s_es->s_raid_stripe_width);
> +
> + if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
> + return sbi->s_stripe;
> + } else if (stripe_width <= sbi->s_blocks_per_group) {
> + return stripe_width;
> + } else if (stride <= sbi->s_blocks_per_group) {
> + return stride;
> + }

unneeded braces.

> + return 0;
> +}
>
> ...
>
> +static inline
> +struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
> + ext4_group_t group)
> +{
> + struct ext4_group_info ***grp_info;
> + long indexv, indexh;
> + grp_info = EXT4_SB(sb)->s_group_info;
> + indexv = group >> (EXT4_DESC_PER_BLOCK_BITS(sb));
> + indexh = group & ((EXT4_DESC_PER_BLOCK(sb)) - 1);
> + return grp_info[indexv][indexh];
> +}

This should be uninlined.



Gosh what a lot of code. Is it faster?

2008-01-23 22:12:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.

> On Mon, 21 Jan 2008 22:02:09 -0500 "Theodore Ts'o" <[email protected]> wrote:
> +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> + unsigned long max_blocks, struct buffer_head *bh,
> + int create, int extend_disksize)
> +{
> + int retval;
> + if (create) {
> + down_write((&EXT4_I(inode)->i_data_sem));
> + } else {
> + down_read((&EXT4_I(inode)->i_data_sem));
> + }
> + if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
> + retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
> + bh, create, extend_disksize);
> + } else {
> + retval = ext4_get_blocks_handle(handle, inode, block,
> + max_blocks, bh, create, extend_disksize);
> + }
> + if (create) {
> + up_write((&EXT4_I(inode)->i_data_sem));
> + } else {
> + up_read((&EXT4_I(inode)->i_data_sem));
> + }

This function has many unneeded braces. checkpatch used to detect this
but it seems to have broken.

> + return retval;
> +}
> static int ext4_get_block(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create)

Mising newline.

2008-01-23 22:13:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

> On Mon, 21 Jan 2008 22:02:15 -0500 "Theodore Ts'o" <[email protected]> wrote:
> The below patch add ioctl for migrating ext3 indirect block mapped inode
> to ext4 extent mapped inode.

This patch adds lots of weird and inexplicable single- and double-newlines
in inappropriate places. However it frequently forgets to add newlines
between end-of-locals and start-of-code, which is usual practice.


+struct list_blocks_struct {
+ ext4_lblk_t first_block, last_block;
+ ext4_fsblk_t first_pblock, last_pblock;
+};

This structure would benefit from some code comments.

2008-01-23 22:41:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 33/49] ext4: Add the journal checksum feature

On Jan 23, 2008 14:07 -0800, Andrew Morton wrote:
> > +{
> > + struct page *page = bh->b_page;
> > + char *addr;
> > + __u32 checksum;
> > +
> > + addr = kmap_atomic(page, KM_USER0);
> > + checksum = crc32_be(crc32_sum,
> > + (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> > + kunmap_atomic(addr, KM_USER0);
> > +
> > + return checksum;
> > +}
>
> Can this buffer actually be in highmem?

Yes, this was found during system testing. While ext3/4 will only allocate
buffer heads in lowmem, the jbd/jbd2 code can allocate buffers in highmem.
I was surprised about this also.

Please see the thread in ext4-devel:
[PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-01-23 23:21:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

On Jan 23, 2008 14:07 -0800, Andrew Morton wrote:
> > +#define mb_correct_addr_and_bit(bit, addr) \
> > +{ \
> > + bit += ((unsigned long) addr & 3UL) << 3; \
> > + addr = (void *) ((unsigned long) addr & ~3UL); \
> > +}
>
> Why do these exist?

They seem to be a holdover from when mballoc stored the buddy bitmaps
on disk. That no longer happens (to avoid bitmap vs. buddy consistency
problems), so I suspect they can be removed.

I can't comment on many of the other issues because Alex wrote most
of the code.

> Gosh what a lot of code. Is it faster?

Yes, and also importantly it uses a lot less CPU to do a given amount
of allocation, which is critical in our environments where there is
very high disk bandwidth on a single node and CPU becomes the limiting
factor of the IO speed. This of course also helps any write-intensive
environment where the CPU is doing something "useful".

Some older test results include:
https://ols2006.108.redhat.com/2007/Reprints/mathur-Reprint.pdf (Section 7)

In the linux-ext4 thread "compilebench numbers for ext4":
http://www.mail-archive.com/[email protected]/msg03834.html

http://oss.oracle.com/~mason/compilebench/ext4/ext-create-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-compile-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-read-compare.png
http://oss.oracle.com/~mason/compilebench/ext4/ext-rm-compare.png

note the ext-read-compare.png graph shows lower read performance, but
a couple of bugs in mballoc were since fixed to have ext4 allocate more
contiguous extents.

In the old linux-ext4 thread "[RFC] delayed allocation testing on node zefir"
http://www.mail-archive.com/[email protected]/msg00587.html

: dd2048rw
: REAL UTIME STIME READ WRITTEN DETAILS
EXT3 : 58.46 23 1491 2572 2097292 17 extents
EXT4 : 44.56 19 1018 12 2097244 19 extents
REISERFS: 56.80 26 1370 2952 2097336 457 extents
JFS : 45.77 22 984 0 2097216 1 extents
XFS : 50.97 20 1394 0 2100825 7 extents

: kernuntar
: REAL UTIME STIME READ WRITTEN DETAILS
EXT3 : 56.99 5037 651 68 252016
EXT4 : 55.03 5034 553 36 249884
REISERFS: 52.55 4996 854 64 238068
JFS : 70.15 5057 630 496 288116
XFS : 72.84 5052 953 132 316798

: kernstat
: REAL UTIME STIME READ WRITTEN DETAILS
EXT3 : 2.83 8 15 5892 0
EXT4 : 0.51 9 10 5892 0
REISERFS: 0.81 7 49 2696 0
JFS : 6.19 11 49 12552 0
XFS : 2.09 9 61 6504 0

: kerncat
: REAL UTIME STIME READ WRITTEN DETAILS
EXT3 : 9.48 25 213 241624 0
EXT4 : 6.29 27 197 238560 0
REISERFS: 14.69 33 230 234744 0
JFS : 23.51 23 231 244596 0
XFS : 18.24 36 254 238548 0

: kernrm
: REAL UTIME STIME READ WRITTEN DETAILS
EXT3 : 4.82 4 108 9628 4672
EXT4 : 1.61 5 110 6536 4632
REISERFS: 3.15 8 276 2768 236
JFS : 33.90 7 168 14400 33048
XFS : 20.03 8 296 6632 86160


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-01-24 05:22:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 23/49] Add buffer head related helper functions

On Wed, Jan 23, 2008 at 02:06:48PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:02 -0500 "Theodore Ts'o" <[email protected]> wrote:
> > +}
> > +EXPORT_SYMBOL(bh_uptodate_or_lock);
> > +/**
>
> Missing newline.
>
> > + * bh_submit_read: Submit a locked buffer for reading
> > + * @bh: struct buffer_head
> > + *
> > + * Returns a negative error
> > + */
> > +int bh_submit_read(struct buffer_head *bh)
> > +{
> > + if (!buffer_locked(bh))
> > + lock_buffer(bh);
> > +
> > + if (buffer_uptodate(bh))
> > + return 0;
>
> Here it can lock the buffer then return zero
>
> > + get_bh(bh);
> > + bh->b_end_io = end_buffer_read_sync;
> > + submit_bh(READ, bh);
> > + wait_on_buffer(bh);
> > + if (buffer_uptodate(bh))
> > + return 0;
>
> Here it will unlock the buffer and return zero.
>
> This function is unusable when passed an unlocked buffer.
>

Updated patch below.

commit 70d4ca32604e0935a8b9a49c5ac8b9c64c810693
Author: Aneesh Kumar K.V <[email protected]>
Date: Thu Jan 24 10:50:24 2008 +0530

Add buffer head related helper functions

Add buffer head related helper function bh_uptodate_or_lock and
bh_submit_read which can be used by file system

Signed-off-by: Aneesh Kumar K.V <[email protected]>

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..82aa2db 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3213,6 +3213,53 @@ static int buffer_cpu_notify(struct notifier_block *self,
return NOTIFY_OK;
}

+/**
+ * bh_uptodate_or_lock: Test whether the buffer is uptodate
+ * @bh: struct buffer_head
+ *
+ * Return true if the buffer is up-to-date and false,
+ * with the buffer locked, if not.
+ */
+int bh_uptodate_or_lock(struct buffer_head *bh)
+{
+ if (!buffer_uptodate(bh)) {
+ lock_buffer(bh);
+ if (!buffer_uptodate(bh))
+ return 0;
+ unlock_buffer(bh);
+ }
+ return 1;
+}
+EXPORT_SYMBOL(bh_uptodate_or_lock);
+
+/**
+ * bh_submit_read: Submit a locked buffer for reading
+ * @bh: struct buffer_head
+ *
+ * Returns zero on success and -EIO on error.If the input
+ * buffer is not locked returns -EINVAL
+ *
+ */
+int bh_submit_read(struct buffer_head *bh)
+{
+ if (!buffer_locked(bh))
+ return -EINVAL;
+
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ return 0;
+ }
+
+ get_bh(bh);
+ bh->b_end_io = end_buffer_read_sync;
+ submit_bh(READ, bh);
+ wait_on_buffer(bh);
+ if (buffer_uptodate(bh))
+ return 0;
+ return -EIO;
+}
+EXPORT_SYMBOL(bh_submit_read);
+
void __init buffer_init(void)
{
int nrpages;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index da0d83f..e98801f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh);
int submit_bh(int, struct buffer_head *);
void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
+int bh_uptodate_or_lock(struct buffer_head *bh);
+int bh_submit_read(struct buffer_head *bh);

extern int buffer_heads_over_limit;

2008-01-24 05:29:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.

On Wed, Jan 23, 2008 at 02:06:59PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:09 -0500 "Theodore Ts'o" <[email protected]> wrote:
> > +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> > + unsigned long max_blocks, struct buffer_head *bh,
> > + int create, int extend_disksize)
> > +{
> > + int retval;
> > + if (create) {
> > + down_write((&EXT4_I(inode)->i_data_sem));
> > + } else {
> > + down_read((&EXT4_I(inode)->i_data_sem));
> > + }
> > + if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
> > + retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
> > + bh, create, extend_disksize);
> > + } else {
> > + retval = ext4_get_blocks_handle(handle, inode, block,
> > + max_blocks, bh, create, extend_disksize);
> > + }
> > + if (create) {
> > + up_write((&EXT4_I(inode)->i_data_sem));
> > + } else {
> > + up_read((&EXT4_I(inode)->i_data_sem));
> > + }
>
> This function has many unneeded braces. checkpatch used to detect this
> but it seems to have broken.

The follow up patch "ext4: Take read lock during overwrite case" removes
those single line if statement.


>
> > + return retval;
> > +}
> > static int ext4_get_block(struct inode *inode, sector_t iblock,
> > struct buffer_head *bh_result, int create)
>
> Mising newline.

Fixed.

-aneesh

2008-01-24 05:57:30

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

On Wed, Jan 23, 2008 at 02:07:16PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:15 -0500 "Theodore Ts'o" <[email protected]> wrote:
> > The below patch add ioctl for migrating ext3 indirect block mapped inode
> > to ext4 extent mapped inode.
>
> This patch adds lots of weird and inexplicable single- and double-newlines
> in inappropriate places. However it frequently forgets to add newlines
> between end-of-locals and start-of-code, which is usual practice.
>
>
> +struct list_blocks_struct {
> + ext4_lblk_t first_block, last_block;
> + ext4_fsblk_t first_pblock, last_pblock;
> +};
>

Updated patch

commit c4786b67cdc5b24d2548a69b62774fb54f8f1575
Author: Aneesh Kumar K.V <[email protected]>
Date: Tue Jan 22 09:28:55 2008 +0530

ext4: Add EXT4_IOC_MIGRATE ioctl

The below patch add ioctl for migrating ext3 indirect block mapped inode
to ext4 extent mapped inode.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index ae6e7e5..d5fd80b 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o

ext4dev-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
- ext4_jbd2.o
+ ext4_jbd2.o migrate.o

ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 03d1bbb..323cd76 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -75,7 +75,7 @@ static ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix)
* stores a large physical block number into an extent struct,
* breaking it into parts
*/
-static void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb)
+void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb)
{
ex->ee_start_lo = cpu_to_le32((unsigned long) (pb & 0xffffffff));
ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index c0e5b8c..2ed7c37 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -254,6 +254,9 @@ flags_err:
return err;
}

+ case EXT4_IOC_MIGRATE:
+ return ext4_ext_migrate(inode, filp, cmd, arg);
+
default:
return -ENOTTY;
}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
new file mode 100644
index 0000000..deb2327
--- /dev/null
+++ b/fs/ext4/migrate.c
@@ -0,0 +1,588 @@
+/*
+ * Copyright IBM Corporation, 2007
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/ext4_jbd2.h>
+#include <linux/ext4_fs_extents.h>
+
+/*
+ * The contiguous blocks details which can be
+ * represented by a single extent
+ */
+struct list_blocks_struct {
+ ext4_lblk_t first_block, last_block;
+ ext4_fsblk_t first_pblock, last_pblock;
+};
+
+static int finish_range(handle_t *handle, struct inode *inode,
+ struct list_blocks_struct *lb)
+
+{
+ int retval = 0, needed;
+ struct ext4_extent newext;
+ struct ext4_ext_path *path;
+ if (lb->first_pblock == 0)
+ return 0;
+
+ /* Add the extent to temp inode*/
+ newext.ee_block = cpu_to_le32(lb->first_block);
+ newext.ee_len = cpu_to_le16(lb->last_block - lb->first_block + 1);
+ ext4_ext_store_pblock(&newext, lb->first_pblock);
+ path = ext4_ext_find_extent(inode, lb->first_block, NULL);
+
+ if (IS_ERR(path)) {
+ retval = PTR_ERR(path);
+ goto err_out;
+ }
+
+ /*
+ * Calculate the credit needed to inserting this extent
+ * Since we are doing this in loop we may accumalate extra
+ * credit. But below we try to not accumalate too much
+ * of them by restarting the journal.
+ */
+ needed = ext4_ext_calc_credits_for_insert(inode, path);
+
+ /*
+ * Make sure the credit we accumalated is not really high
+ */
+ if (needed && handle->h_buffer_credits >= EXT4_RESERVE_TRANS_BLOCKS) {
+
+ retval = ext4_journal_restart(handle, needed);
+ if (retval)
+ goto err_out;
+ }
+ if (needed) {
+ retval = ext4_journal_extend(handle, needed);
+ if (retval != 0) {
+ /*
+ * IF not able to extend the journal restart the journal
+ */
+ retval = ext4_journal_restart(handle, needed);
+ if (retval)
+ goto err_out;
+ }
+ }
+ retval = ext4_ext_insert_extent(handle, inode, path, &newext);
+
+err_out:
+ lb->first_pblock = 0;
+ return retval;
+}
+
+static int update_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t blk_num,
+ struct list_blocks_struct *lb)
+{
+ int retval;
+ /*
+ * See if we can add on to the existing range (if it exists)
+ */
+ if (lb->first_pblock &&
+ (lb->last_pblock+1 == pblock) &&
+ (lb->last_block+1 == blk_num)) {
+ lb->last_pblock = pblock;
+ lb->last_block = blk_num;
+ return 0;
+ }
+ /*
+ * Start a new range.
+ */
+ retval = finish_range(handle, inode, lb);
+ lb->first_pblock = lb->last_pblock = pblock;
+ lb->first_block = lb->last_block = blk_num;
+
+ return retval;
+
+}
+
+static int update_ind_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+ struct list_blocks_struct *lb)
+{
+ struct buffer_head *bh;
+ __le32 *i_data;
+ int i, retval = 0;
+ ext4_lblk_t blk_count = *blk_nump;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ if (!pblock) {
+ /* Only update the file block number */
+ *blk_nump += max_entries;
+ return 0;
+ }
+
+ bh = sb_bread(inode->i_sb, pblock);
+ if (!bh)
+ return -EIO;
+
+ i_data = (__le32 *)bh->b_data;
+
+ for (i = 0; i < max_entries; i++, blk_count++) {
+ if (i_data[i]) {
+ retval = update_extent_range(handle, inode,
+ le32_to_cpu(i_data[i]),
+ blk_count, lb);
+ if (retval)
+ break;
+ }
+ }
+
+ /* Update the file block number */
+ *blk_nump = blk_count;
+ brelse(bh);
+ return retval;
+
+}
+
+static int update_dind_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+ struct list_blocks_struct *lb)
+{
+ struct buffer_head *bh;
+ __le32 *i_data;
+ int i, retval = 0;
+ ext4_lblk_t blk_count = *blk_nump;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ if (!pblock) {
+ /* Only update the file block number */
+ *blk_nump += max_entries * max_entries;
+ return 0;
+ }
+ bh = sb_bread(inode->i_sb, pblock);
+ if (!bh)
+ return -EIO;
+
+ i_data = (__le32 *)bh->b_data;
+ for (i = 0; i < max_entries; i++) {
+ if (i_data[i]) {
+ retval = update_ind_extent_range(handle, inode,
+ le32_to_cpu(i_data[i]),
+ &blk_count, lb);
+ if (retval)
+ break;
+ } else {
+ /* Only update the file block number */
+ blk_count += max_entries;
+ }
+ }
+
+ /* Update the file block number */
+ *blk_nump = blk_count;
+ brelse(bh);
+ return retval;
+
+}
+
+static int update_tind_extent_range(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t pblock, ext4_lblk_t *blk_nump,
+ struct list_blocks_struct *lb)
+{
+ struct buffer_head *bh;
+ __le32 *i_data;
+ int i, retval = 0;
+ ext4_lblk_t blk_count = *blk_nump;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ if (!pblock) {
+ /* Only update the file block number */
+ *blk_nump += max_entries * max_entries * max_entries;
+ return 0;
+ }
+ bh = sb_bread(inode->i_sb, pblock);
+ if (!bh)
+ return -EIO;
+
+ i_data = (__le32 *)bh->b_data;
+ for (i = 0; i < max_entries; i++) {
+ if (i_data[i]) {
+ retval = update_dind_extent_range(handle, inode,
+ le32_to_cpu(i_data[i]),
+ &blk_count, lb);
+ if (retval)
+ break;
+ } else {
+ /* Only update the file block number */
+ blk_count += max_entries * max_entries;
+ }
+ }
+ /* Update the file block number */
+ *blk_nump = blk_count;
+ brelse(bh);
+ return retval;
+
+}
+
+static int free_dind_blocks(handle_t *handle,
+ struct inode *inode, __le32 i_data)
+{
+ int i;
+ __le32 *tmp_idata;
+ struct buffer_head *bh;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ bh = sb_bread(inode->i_sb, le32_to_cpu(i_data));
+ if (!bh)
+ return -EIO;
+
+ tmp_idata = (__le32 *)bh->b_data;
+ for (i = 0; i < max_entries; i++) {
+ if (tmp_idata[i]) {
+ ext4_free_blocks(handle, inode,
+ le32_to_cpu(tmp_idata[i]), 1);
+ }
+ }
+ brelse(bh);
+ ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+ return 0;
+}
+
+static int free_tind_blocks(handle_t *handle,
+ struct inode *inode, __le32 i_data)
+{
+ int i, retval = 0;
+ __le32 *tmp_idata;
+ struct buffer_head *bh;
+ unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+
+ bh = sb_bread(inode->i_sb, le32_to_cpu(i_data));
+ if (!bh)
+ return -EIO;
+
+ tmp_idata = (__le32 *)bh->b_data;
+ for (i = 0; i < max_entries; i++) {
+ if (tmp_idata[i]) {
+ retval = free_dind_blocks(handle,
+ inode, tmp_idata[i]);
+ if (retval) {
+ brelse(bh);
+ return retval;
+ }
+ }
+ }
+ brelse(bh);
+ ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1);
+ return 0;
+}
+
+static int free_ind_block(handle_t *handle, struct inode *inode)
+{
+ int retval;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (ei->i_data[EXT4_IND_BLOCK]) {
+ ext4_free_blocks(handle, inode,
+ le32_to_cpu(ei->i_data[EXT4_IND_BLOCK]), 1);
+ }
+
+ if (ei->i_data[EXT4_DIND_BLOCK]) {
+ retval = free_dind_blocks(handle, inode,
+ ei->i_data[EXT4_DIND_BLOCK]);
+ if (retval)
+ return retval;
+ }
+
+ if (ei->i_data[EXT4_TIND_BLOCK]) {
+ retval = free_tind_blocks(handle, inode,
+ ei->i_data[EXT4_TIND_BLOCK]);
+ if (retval)
+ return retval;
+ }
+ return 0;
+}
+
+static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
+ struct inode *tmp_inode, int retval)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
+
+ retval = free_ind_block(handle, inode);
+ if (retval)
+ goto err_out;
+
+ /*
+ * One credit accounted for writing the
+ * i_data field of the original inode
+ */
+ retval = ext4_journal_extend(handle, 1);
+ if (retval != 0) {
+ retval = ext4_journal_restart(handle, 1);
+ if (retval)
+ goto err_out;
+ }
+
+ /*
+ * We have the extent map build with the tmp inode.
+ * Now copy the i_data across
+ */
+ ei->i_flags |= EXT4_EXTENTS_FL;
+ memcpy(ei->i_data, tmp_ei->i_data, sizeof(ei->i_data));
+
+ /*
+ * Update i_blocks with the new blocks that got
+ * allocated while adding extents for extent index
+ * blocks.
+ *
+ * While converting to extents we need not
+ * update the orignal inode i_blocks for extent blocks
+ * via quota APIs. The quota update happened via tmp_inode already.
+ */
+ spin_lock(&inode->i_lock);
+ inode->i_blocks += tmp_inode->i_blocks;
+ spin_unlock(&inode->i_lock);
+
+ ext4_mark_inode_dirty(handle, inode);
+err_out:
+ return retval;
+}
+
+/* Will go away */
+static ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix)
+{
+ ext4_fsblk_t block;
+
+ block = le32_to_cpu(ix->ei_leaf_lo);
+ block |= ((ext4_fsblk_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+ return block;
+}
+
+static int free_ext_idx(handle_t *handle, struct inode *inode,
+ struct ext4_extent_idx *ix)
+{
+ int i, retval = 0;
+ ext4_fsblk_t block;
+ struct buffer_head *bh;
+ struct ext4_extent_header *eh;
+
+ block = idx_pblock(ix);
+ bh = sb_bread(inode->i_sb, block);
+ if (!bh)
+ return -EIO;
+
+ eh = (struct ext4_extent_header *)bh->b_data;
+ if (eh->eh_depth == 0) {
+ brelse(bh);
+ ext4_free_blocks(handle, inode, block, 1);
+ } else {
+ ix = EXT_FIRST_INDEX(eh);
+ for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+ retval = free_ext_idx(handle, inode, ix);
+ if (retval)
+ return retval;
+ }
+ }
+ return retval;
+}
+
+/*
+ * Free the extent meta data blocks only
+ */
+static int free_ext_block(handle_t *handle, struct inode *inode)
+{
+ int i, retval = 0;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_extent_header *eh = (struct ext4_extent_header *)ei->i_data;
+ struct ext4_extent_idx *ix;
+ if (eh->eh_depth == 0) {
+ /*
+ * No extra blocks allocated for extent meta data
+ */
+ return 0;
+ }
+ ix = EXT_FIRST_INDEX(eh);
+ for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+ retval = free_ext_idx(handle, inode, ix);
+ if (retval)
+ return retval;
+ }
+ return retval;
+
+}
+
+int ext4_ext_migrate(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+ handle_t *handle;
+ int retval = 0, i;
+ __le32 *i_data;
+ ext4_lblk_t blk_count = 0;
+ struct ext4_inode_info *ei;
+ struct inode *tmp_inode = NULL;
+ struct list_blocks_struct lb;
+ unsigned long max_entries;
+
+ if (!test_opt(inode->i_sb, EXTENTS)) {
+ /*
+ * if mounted with noextents
+ * we don't allow the migrate
+ */
+ return -EINVAL;
+ }
+
+ if ((EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return -EINVAL;
+
+ down_write(&EXT4_I(inode)->i_data_sem);
+ handle = ext4_journal_start(inode,
+ EXT4_DATA_TRANS_BLOCKS(inode->i_sb) +
+ EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+ 2 * EXT4_QUOTA_INIT_BLOCKS(inode->i_sb)
+ + 1);
+ if (IS_ERR(handle)) {
+ retval = PTR_ERR(handle);
+ goto err_out;
+ }
+ tmp_inode = ext4_new_inode(handle,
+ inode->i_sb->s_root->d_inode,
+ S_IFREG);
+ if (IS_ERR(tmp_inode)) {
+ retval = -ENOMEM;
+ ext4_journal_stop(handle);
+ tmp_inode = NULL;
+ goto err_out;
+ }
+ i_size_write(tmp_inode, i_size_read(inode));
+ /*
+ * We don't want the inode to be reclaimed
+ * if we got interrupted in between. We have
+ * this tmp inode carrying reference to the
+ * data blocks of the original file. We set
+ * the i_nlink to zero at the last stage after
+ * switching the original file to extent format
+ */
+ tmp_inode->i_nlink = 1;
+
+ ext4_ext_tree_init(handle, tmp_inode);
+ ext4_orphan_add(handle, tmp_inode);
+ ext4_journal_stop(handle);
+
+ ei = EXT4_I(inode);
+ i_data = ei->i_data;
+ memset(&lb, 0, sizeof(lb));
+
+ /* 32 bit block address 4 bytes */
+ max_entries = inode->i_sb->s_blocksize >> 2;
+
+ /*
+ * start with one credit accounted for
+ * superblock modification.
+ *
+ * For the tmp_inode we already have commited the
+ * trascation that created the inode. Later as and
+ * when we add extents we extent the journal
+ */
+ handle = ext4_journal_start(inode, 1);
+ for (i = 0; i < EXT4_NDIR_BLOCKS; i++, blk_count++) {
+ if (i_data[i]) {
+ retval = update_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[i]),
+ blk_count, &lb);
+ if (retval)
+ goto err_out;
+ }
+ }
+ if (i_data[EXT4_IND_BLOCK]) {
+ retval = update_ind_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[EXT4_IND_BLOCK]),
+ &blk_count, &lb);
+ if (retval)
+ goto err_out;
+ } else {
+ blk_count += max_entries;
+ }
+ if (i_data[EXT4_DIND_BLOCK]) {
+ retval = update_dind_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[EXT4_DIND_BLOCK]),
+ &blk_count, &lb);
+ if (retval)
+ goto err_out;
+ } else {
+ blk_count += max_entries * max_entries;
+ }
+ if (i_data[EXT4_TIND_BLOCK]) {
+ retval = update_tind_extent_range(handle, tmp_inode,
+ le32_to_cpu(i_data[EXT4_TIND_BLOCK]),
+ &blk_count, &lb);
+ if (retval)
+ goto err_out;
+ }
+ /*
+ * Build the last extent
+ */
+ retval = finish_range(handle, tmp_inode, &lb);
+err_out:
+ /*
+ * We are either freeing extent information or indirect
+ * blocks. During this we touch superblock, group descriptor
+ * and block bitmap. Later we mark the tmp_inode dirty
+ * via ext4_ext_tree_init. So allocate a credit of 4
+ * We may update quota (user and group).
+ *
+ * FIXME!! we may be touching bitmaps in different block groups.
+ */
+ if (ext4_journal_extend(handle,
+ 4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb)) != 0) {
+
+ ext4_journal_restart(handle,
+ 4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ }
+ if (retval) {
+ /*
+ * Failure case delete the extent information with the
+ * tmp_inode
+ */
+ free_ext_block(handle, tmp_inode);
+
+ } else {
+
+ retval = ext4_ext_swap_inode_data(handle, inode,
+ tmp_inode, retval);
+ }
+
+ /*
+ * Mark the tmp_inode as of size zero
+ */
+ i_size_write(tmp_inode, 0);
+
+ /*
+ * set the i_blocks count to zero
+ * so that the ext4_delete_inode does the
+ * right job
+ *
+ * We don't need to take the i_lock because
+ * the inode is not visible to user space.
+ */
+ tmp_inode->i_blocks = 0;
+
+ /* Reset the extent details */
+ ext4_ext_tree_init(handle, tmp_inode);
+
+ /*
+ * Set the i_nlink to zero so that
+ * generic_drop_inode really deletes the
+ * inode
+ */
+ tmp_inode->i_nlink = 0;
+
+ ext4_journal_stop(handle);
+
+ up_write(&EXT4_I(inode)->i_data_sem);
+
+ if (tmp_inode)
+ iput(tmp_inode);
+
+ return retval;
+}
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index b609294..213974f 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -243,6 +243,7 @@ struct ext4_new_group_data {
#endif
#define EXT4_IOC_GETRSVSZ _IOR('f', 5, long)
#define EXT4_IOC_SETRSVSZ _IOW('f', 6, long)
+#define EXT4_IOC_MIGRATE _IO('f', 7)

/*
* ioctl commands in 32 bit emulation
@@ -983,6 +984,9 @@ extern int ext4_ioctl (struct inode *, struct file *, unsigned int,
unsigned long);
extern long ext4_compat_ioctl (struct file *, unsigned int, unsigned long);

+/* migrate.c */
+extern int ext4_ext_migrate(struct inode *, struct file *, unsigned int,
+ unsigned long);
/* namei.c */
extern int ext4_orphan_add(handle_t *, struct inode *);
extern int ext4_orphan_del(handle_t *, struct inode *);
diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h
index 023683b..db64509 100644
--- a/include/linux/ext4_fs_extents.h
+++ b/include/linux/ext4_fs_extents.h
@@ -212,6 +212,7 @@ static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
(le16_to_cpu(ext->ee_len) - EXT_INIT_MAX_LEN));
}

+extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
extern int ext4_extent_tree_init(handle_t *, struct inode *);
extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
extern int ext4_ext_try_to_merge(struct inode *inode,

2008-01-24 07:57:59

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

On Wed, Jan 23, 2008 at 02:07:27PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:20 -0500 "Theodore Ts'o" <[email protected]> wrote:
> > From: Alex Tomas <[email protected]>
> >
> > Signed-off-by: Alex Tomas <[email protected]>
> > Signed-off-by: Andreas Dilger <[email protected]>
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > Signed-off-by: Eric Sandeen <[email protected]>
> > Signed-off-by: "Theodore Ts'o" <[email protected]>
> >
> > ...
> >
> > +#if BITS_PER_LONG == 64
> > +#define mb_correct_addr_and_bit(bit, addr) \
> > +{ \
> > + bit += ((unsigned long) addr & 7UL) << 3; \
> > + addr = (void *) ((unsigned long) addr & ~7UL); \
> > +}
> > +#elif BITS_PER_LONG == 32
> > +#define mb_correct_addr_and_bit(bit, addr) \
> > +{ \
> > + bit += ((unsigned long) addr & 3UL) << 3; \
> > + addr = (void *) ((unsigned long) addr & ~3UL); \
> > +}
> > +#else
> > +#error "how many bits you are?!"
> > +#endif
>
> Why do these exist?

Initial version on mballoc supported on x86 32 this was there to give
compile warning on 64 bit platform. I guess we can remove that now.
Or may be we can keep it as such because it is harmless.


>
> > +static inline int mb_test_bit(int bit, void *addr)
> > +{
> > + mb_correct_addr_and_bit(bit, addr);
> > + return ext4_test_bit(bit, addr);
> > +}
>
> ext2_test_bit() already handles bitnum > wordsize.
>
> If mb_correct_addr_and_bit() is actually needed then some suitable comment
> would help.

ext4_test_bit on powerpc needs the addr to be 8 byte aligned. Othewise
it fails

>
> > +static inline void mb_set_bit(int bit, void *addr)
> > +{
> > + mb_correct_addr_and_bit(bit, addr);
> > + ext4_set_bit(bit, addr);
> > +}
> > +
> > +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
> > +{
> > + mb_correct_addr_and_bit(bit, addr);
> > + ext4_set_bit_atomic(lock, bit, addr);
> > +}
> > +
> > +static inline void mb_clear_bit(int bit, void *addr)
> > +{
> > + mb_correct_addr_and_bit(bit, addr);
> > + ext4_clear_bit(bit, addr);
> > +}
> > +
> > +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
> > +{
> > + mb_correct_addr_and_bit(bit, addr);
> > + ext4_clear_bit_atomic(lock, bit, addr);
> > +}
> > +
> > +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
>
> uninlining this will save about eighty squigabytes of text.

Fixed


>
> Please review all of ext4/jbd2 with a view to removig unnecessary and wrong
> inlings.
>
> > +{
> > + char *bb;
> > +
> > + /* FIXME!! is this needed */
> > + BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
> > + BUG_ON(max == NULL);
> > +
> > + if (order > e4b->bd_blkbits + 1) {
> > + *max = 0;
> > + return NULL;
> > + }
> > +
> > + /* at order 0 we see each particular block */
> > + *max = 1 << (e4b->bd_blkbits + 3);
> > + if (order == 0)
> > + return EXT4_MB_BITMAP(e4b);
> > +
> > + bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
> > + *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
> > +
> > + return bb;
> > +}
> > +
> >
> > ...
> >
> > +#else
> > +#define mb_free_blocks_double(a, b, c, d)
> > +#define mb_mark_used_double(a, b, c)
> > +#define mb_cmp_bitmaps(a, b)
> > +#endif
>
> Please use the do{}while(0) thing. Or, better, proper C functions which
> have typechecking (unless this will cause undefined-var compile errors,
> which happens sometimes)

makde static inline void.

>
> > +/* find most significant bit */
> > +static int fmsb(unsigned short word)
> > +{
> > + int order;
> > +
> > + if (word > 255) {
> > + order = 7;
> > + word >>= 8;
> > + } else {
> > + order = -1;
> > + }
> > +
> > + do {
> > + order++;
> > + word >>= 1;
> > + } while (word != 0);
> > +
> > + return order;
> > +}
>
> Did we just reinvent fls()?

replaced by fls.

>
> > +/* FIXME!! need more doc */
> > +static void ext4_mb_mark_free_simple(struct super_block *sb,
> > + void *buddy, unsigned first, int len,
> > + struct ext4_group_info *grp)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > + unsigned short min;
> > + unsigned short max;
> > + unsigned short chunk;
> > + unsigned short border;
> > +
> > + BUG_ON(len >= EXT4_BLOCKS_PER_GROUP(sb));
> > +
> > + border = 2 << sb->s_blocksize_bits;
>
> Won't this explode with >= 32k blocksize?
>
> > + while (len > 0) {
> > + /* find how many blocks can be covered since this position */
> > + max = ffs(first | border) - 1;
> > +
> > + /* find how many blocks of power 2 we need to mark */
> > + min = fmsb(len);
> > +
> > + if (max < min)
> > + min = max;
> > + chunk = 1 << min;
> > +
> > + /* mark multiblock chunks only */
> > + grp->bb_counters[min]++;
> > + if (min > 0)
> > + mb_clear_bit(first >> min,
> > + buddy + sbi->s_mb_offsets[min]);
> > +
> > + len -= chunk;
> > + first += chunk;
> > + }
> > +}
> > +
> >
> > ...
> >
> > +static int ext4_mb_init_cache(struct page *page, char *incore)
> > +{
> > + int blocksize;
> > + int blocks_per_page;
> > + int groups_per_page;
> > + int err = 0;
> > + int i;
> > + ext4_group_t first_group;
> > + int first_block;
> > + struct super_block *sb;
> > + struct buffer_head *bhs;
> > + struct buffer_head **bh;
> > + struct inode *inode;
> > + char *data;
> > + char *bitmap;
> > +
> > + mb_debug("init page %lu\n", page->index);
> > +
> > + inode = page->mapping->host;
> > + sb = inode->i_sb;
> > + blocksize = 1 << inode->i_blkbits;
> > + blocks_per_page = PAGE_CACHE_SIZE / blocksize;
> > +
> > + groups_per_page = blocks_per_page >> 1;
> > + if (groups_per_page == 0)
> > + groups_per_page = 1;
> > +
> > + /* allocate buffer_heads to read bitmaps */
> > + if (groups_per_page > 1) {
> > + err = -ENOMEM;
> > + i = sizeof(struct buffer_head *) * groups_per_page;
> > + bh = kmalloc(i, GFP_NOFS);
> > + if (bh == NULL)
> > + goto out;
> > + memset(bh, 0, i);
>
> kzalloc()

Fixed

>
> > + } else
> > + bh = &bhs;
> > +
> > + first_group = page->index * blocks_per_page / 2;
> > +
> > + /* read all groups the page covers into the cache */
> > + for (i = 0; i < groups_per_page; i++) {
> > + struct ext4_group_desc *desc;
> > +
> > + if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> > + break;
> > +
> > + err = -EIO;
> > + desc = ext4_get_group_desc(sb, first_group + i, NULL);
> > + if (desc == NULL)
> > + goto out;
> > +
> > + err = -ENOMEM;
> > + bh[i] = sb_getblk(sb, ext4_block_bitmap(sb, desc));
> > + if (bh[i] == NULL)
> > + goto out;
> > +
> > + if (buffer_uptodate(bh[i]))
> > + continue;
> > +
> > + lock_buffer(bh[i]);
> > + if (buffer_uptodate(bh[i])) {
> > + unlock_buffer(bh[i]);
> > + continue;
> > + }
>
> Didn't we just add a helper in fs/buffer.c to do this?
>

Fixed


> > + if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > + ext4_init_block_bitmap(sb, bh[i],
> > + first_group + i, desc);
> > + set_buffer_uptodate(bh[i]);
> > + unlock_buffer(bh[i]);
> > + continue;
> > + }
> > + get_bh(bh[i]);
> > + bh[i]->b_end_io = end_buffer_read_sync;

[... snip... ]
> > +
> > + /* set incore so that the buddy information can be
> > + * generated using this
> > + */
> > + incore = data;
> > + }
> > + }
> > + SetPageUptodate(page);
>
> Is the page locked here?


The page is locked via find_or_create_page

>
> > +out:
> > + if (bh) {
> > + for (i = 0; i < groups_per_page && bh[i]; i++)
> > + brelse(bh[i]);
>
> put_bh()
>
> > + if (bh != &bhs)
> > + kfree(bh);
> > + }
> > + return err;
> > +}
> > +
> >
> > ...
> >
> > +static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
> > +{
> > + __u32 *addr;
> > +
> > + len = cur + len;
> > + while (cur < len) {
> > + if ((cur & 31) == 0 && (len - cur) >= 32) {
> > + /* fast path: clear whole word at once */
>
> s/clear/set/

Fixed

>
> > + addr = bm + (cur >> 3);
> > + *addr = 0xffffffff;
> > + cur += 32;
> > + continue;
> > + }
> > + mb_set_bit_atomic(lock, cur, bm);
> > + cur++;
> > + }
> > +}
> > +
> >
> > ...
> >
> > +static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
> > + ext4_group_t group)
> > +{
> > + struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> > + struct ext4_prealloc_space *pa;
> > + struct list_head *cur;
> > + ext4_group_t groupnr;
> > + ext4_grpblk_t start;
> > + int preallocated = 0;
> > + int count = 0;
> > + int len;
> > +
> > + /* all form of preallocation discards first load group,
> > + * so the only competing code is preallocation use.
> > + * we don't need any locking here
> > + * notice we do NOT ignore preallocations with pa_deleted
> > + * otherwise we could leave used blocks available for
> > + * allocation in buddy when concurrent ext4_mb_put_pa()
> > + * is dropping preallocation
> > + */
> > + list_for_each_rcu(cur, &grp->bb_prealloc_list) {
> > + pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
> > + spin_lock(&pa->pa_lock);
> > + ext4_get_group_no_and_offset(sb, pa->pa_pstart,
> > + &groupnr, &start);
> > + len = pa->pa_len;
> > + spin_unlock(&pa->pa_lock);
> > + if (unlikely(len == 0))
> > + continue;
> > + BUG_ON(groupnr != group);
> > + mb_set_bits(sb_bgl_lock(EXT4_SB(sb), group),
> > + bitmap, start, len);
> > + preallocated += len;
> > + count++;
> > + }
>
> Seems to be missing rcu_read_lock()
>


bb_prealloc_list is actually modified under ext4_group_lock. So it is
not actually rcu. I this we should be using list_for_each there.
The rcu managed list are i_prealloc_list and lg_prealloc_list


> > + mb_debug("prellocated %u for group %lu\n", preallocated, group);
> > +}
> > +
> > +static void ext4_mb_pa_callback(struct rcu_head *head)
> > +{
> > + struct ext4_prealloc_space *pa;
> > + pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
> > + kmem_cache_free(ext4_pspace_cachep, pa);
> > +}
> > +#define mb_call_rcu(__pa) call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)
>
> Is there any reason why this had to be implemented as a macro?

Fixed

>
> >
> > ...
> >
> > +static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> > +{
> > + struct super_block *sb = ac->ac_sb;
> > + struct ext4_prealloc_space *pa;
> > + struct ext4_group_info *grp;
> > + struct ext4_inode_info *ei;
> > +
> > + /* preallocate only when found space is larger then requested */
> > + BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
> > + BUG_ON(ac->ac_status != AC_STATUS_FOUND);
> > + BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> > +
> > + pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
>
> Do all the GFP_NOFS's in this code really need to be GFP_NOFS?
>
> > + if (pa == NULL)
> > + return -ENOMEM;
> > + ext4_lock_group(sb, ac->ac_b_ex.fe_group);

....

> > + list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
> > + ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> > +
> > + spin_lock(pa->pa_obj_lock);
> > + list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
> > + spin_unlock(pa->pa_obj_lock);
>
> hm. Strange to see list_add_rcu() inside spinlock like this.

Few lines above we have
pa->pa_obj_lock = &ei->i_prealloc_lock;
So the spin_lock is there to prevent mutiple cpu's adding to the
prealloc list together.


>
> > + return 0;
> > +}
> > +
> >
> > ...
> >
> > +static int ext4_mb_discard_group_preallocations(struct super_block *sb,
> > + ext4_group_t group, int needed)
> > +{
> > + struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> > + struct buffer_head *bitmap_bh = NULL;
> > + struct ext4_prealloc_space *pa, *tmp;
> > + struct list_head list;
> > + struct ext4_buddy e4b;
> > + int err;
> > + int busy = 0;
> > + int free = 0;
> > + /* seems this one can be freed ... */

....

> > + pa->pa_deleted = 1;
> > +
> > + /* we can trust pa_free ... */
> > + free += pa->pa_free;
> > +
> > + spin_unlock(&pa->pa_lock);
> > +
> > + list_del_rcu(&pa->pa_group_list);
> > + list_add(&pa->u.pa_tmp_list, &list);
> > + }
>
> Strange to see rcu operations outside rcu_read_lock().

That need not be actually list_del_rcu. As i stated above that is
holding the bb_prealloc_list. It is updated under ext4_group_lock



>
> > + /* if we still need more blocks and some PAs were used, try again */
> > + if (free < needed && busy) {
> > + busy = 0;
> > + ext4_unlock_group(sb, group);
> > + /*
> > + * Yield the CPU here so that we don't get soft lockup
> > + * in non preempt case.
> > + */
> > + yield();
>
> argh, no, yield() is basically unusable. schedule_timeout(1) is preferable.

I actually schedule_timeout(HZ); This was actually a bug fix a soft
lockup happening when we were running non preemptible kernel. Well we
just want to make sure the high priority watchdog thread gets a chance
to run. And if there are no high priority threads we ourself would like
to run. My understanding was yield is the right choice there.


>
> Please test this code whe there are lots of cpu-intensive tasks running.
>
> > + goto repeat;
> > + }
> > +
> > + /* found anything to free? */
> > + if (list_empty(&list)) {
> > + BUG_ON(free != 0);
> > + goto out;
> > + }
> > +
> > + /* now free all selected PAs */
> > + if (atomic_read(&pa->pa_count)) {
> > + /* this shouldn't happen often - nobody should

.....


> > + * use preallocation while we're discarding it */
> > + spin_unlock(&pa->pa_lock);
> > + spin_unlock(&ei->i_prealloc_lock);
> > + printk(KERN_ERR "uh-oh! used pa while discarding\n");
> > + dump_stack();
>
> WARN_ON(1) would be more conventional.

Fixed

>
> > + current->state = TASK_UNINTERRUPTIBLE;
> > + schedule_timeout(HZ);
>
> schedule_timeout_uninterruptible()
>

Fixed

> > + goto repeat;
> > +
> > + }
> > + if (pa->pa_deleted == 0) {
> > + pa->pa_deleted = 1;
> > + spin_unlock(&pa->pa_lock);
> > + list_del_rcu(&pa->pa_inode_list);
> > + list_add(&pa->u.pa_tmp_list, &list);
> > + continue;
> > + }
> > +
> > + /* someone is deleting pa right now */
> > + spin_unlock(&pa->pa_lock);
> > + spin_unlock(&ei->i_prealloc_lock);
> > +
> > + /* we have to wait here because pa_deleted
> > + * doesn't mean pa is already unlinked from
> > + * the list. as we might be called from
> > + * ->clear_inode() the inode will get freed
> > + * and concurrent thread which is unlinking
> > + * pa from inode's list may access already
> > + * freed memory, bad-bad-bad */
> > +
> > + /* XXX: if this happens too often, we can
> > + * add a flag to force wait only in case
> > + * of ->clear_inode(), but not in case of
> > + * regular truncate */
> > + current->state = TASK_UNINTERRUPTIBLE;
> > + schedule_timeout(HZ);
>
> ditto
>

Fixed

> > + goto repeat;
> > + }
> > + spin_unlock(&ei->i_prealloc_lock);
> > +
> > + list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
> > + BUG_ON(pa->pa_linear != 0);
> > + ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
> > +
> > + err = ext4_mb_load_buddy(sb, group, &e4b);
> > + BUG_ON(err != 0); /* error handling here */
> > +
> > + bitmap_bh = read_block_bitmap(sb, group);
> > + if (bitmap_bh == NULL) {
> > + /* error handling here */
> > + ext4_mb_release_desc(&e4b);
> > + BUG_ON(bitmap_bh == NULL);
> > + }
> > +
> > + ext4_lock_group(sb, group);
> > + list_del_rcu(&pa->pa_group_list);
> > + ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
> > + ext4_unlock_group(sb, group);
> > +
> > + ext4_mb_release_desc(&e4b);
> > + brelse(bitmap_bh);
> > +
> > + list_del(&pa->u.pa_tmp_list);
> > + mb_call_rcu(pa);
> > + }
> > +}
>
> Would be nice to ask Paul to review all the rcu usage in here. It looks odd.
>

Will add Paul to the CC

> >
> > ...
> >
> > +#else
> > +#define ext4_mb_show_ac(x)
> > +#endif
>
> static inlined C functions are preferred (+1e6 dittoes)

Fixed

>
> > +/*
> > + * We use locality group preallocation for small size file. The size of the
> > + * file is determined by the current size or the resulting size after
> > + * allocation which ever is larger
> > + *
> > + * One can tune this size via /proc/fs/ext4/<partition>/stream_req
> > + */
> > +static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> > + int bsbits = ac->ac_sb->s_blocksize_bits;
> > + loff_t size, isize;
> > +
> > + if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
> > + return;
> > +
> > + size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> > + isize = i_size_read(ac->ac_inode) >> bsbits;
> > + if (size < isize)
> > + size = isize;
>
> min()?
>

updated as size = max(size, isize);


> > + /* don't use group allocation for large files */
> > + if (size >= sbi->s_mb_stream_request)
> > + return;
> > +
> > + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
> > + return;
> > +
> > + BUG_ON(ac->ac_lg != NULL);
> > + ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
> > + put_cpu();
>
> Strange-looking code. I'd be interested in a description of the per-cou
> design here.

I added the below doc


/*
* locality group prealloc space are per cpu. The reason for
* having per cpu locality group is to reduce the contention
* between block request from multiple CPUs.
*/




>
> > + /* we're going to use group allocation */
> > + ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
> > +
> > + /* serialize all allocations in the group */
> > + down(&ac->ac_lg->lg_sem);
>
> This should be a mutex, shouldn't it?
>

converted to mutex


> > +}
> > +
> >
> > ...
> >
> > +static int ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
> > + ext4_group_t group, ext4_grpblk_t block, int count)
> > +{
> > + struct ext4_group_info *db = e4b->bd_info;
> > + struct super_block *sb = e4b->bd_sb;
> > + struct ext4_sb_info *sbi = EXT4_SB(sb);
> > + struct ext4_free_metadata *md;
> > + int i;
> > +
> > + BUG_ON(e4b->bd_bitmap_page == NULL);
> > + BUG_ON(e4b->bd_buddy_page == NULL);
> > +
> > + ext4_lock_group(sb, group);
> > + for (i = 0; i < count; i++) {
> > + md = db->bb_md_cur;
> > + if (md && db->bb_tid != handle->h_transaction->t_tid) {
> > + db->bb_md_cur = NULL;
> > + md = NULL;
> > + }
> > +
> > + if (md == NULL) {
> > + ext4_unlock_group(sb, group);
> > + md = kmalloc(sizeof(*md), GFP_KERNEL);
>
> Why was this one not GFP_NOFS?
>
> > + if (md == NULL)
> > + return -ENOMEM;
>
> Did we just leak some memory?
>

No the data is allocated to carry information regarding the free blocks.


> > + md->num = 0;
> > + md->group = group;
> > +
> > + ext4_lock_group(sb, group);
> > + if (db->bb_md_cur == NULL) {
> > + spin_lock(&sbi->s_md_lock);
> > + list_add(&md->list, &sbi->s_active_transaction);
> > + spin_unlock(&sbi->s_md_lock);
> > + /* protect buddy cache from being freed,
> > + * otherwise we'll refresh it from
> > + * on-disk bitmap and lose not-yet-available
> > + * blocks */
> > + page_cache_get(e4b->bd_buddy_page);
> > + page_cache_get(e4b->bd_bitmap_page);
> > + db->bb_md_cur = md;
> > + db->bb_tid = handle->h_transaction->t_tid;
> > + mb_debug("new md 0x%p for group %lu\n",
> > + md, md->group);
> > + } else {
> > + kfree(md);
> > + md = db->bb_md_cur;
> > + }
> > + }
> > +
> > + BUG_ON(md->num >= EXT4_BB_MAX_BLOCKS);
> > + md->blocks[md->num] = block + i;
> > + md->num++;
> > + if (md->num == EXT4_BB_MAX_BLOCKS) {
> > + /* no more space, put full container on a sb's list */
> > + db->bb_md_cur = NULL;
> > + }
> > + }
> > + ext4_unlock_group(sb, group);
> > + return 0;
> > +}
> > +
> >
> > ...
> >
> > + case Opt_mballoc:
> > + set_opt(sbi->s_mount_opt, MBALLOC);
> > + break;
> > + case Opt_nomballoc:
> > + clear_opt(sbi->s_mount_opt, MBALLOC);
> > + break;
> > + case Opt_stripe:
> > + if (match_int(&args[0], &option))
> > + return 0;
> > + if (option < 0)
> > + return 0;
> > + sbi->s_stripe = option;
> > + break;
>
> These appear to be undocumented.

Updated


>
> > default:
> > printk (KERN_ERR
> > "EXT4-fs: Unrecognized mount option \"%s\" "
> > @@ -1742,6 +1762,33 @@ static ext4_fsblk_t descriptor_loc(struct super_block *sb,
> > return (has_super + ext4_group_first_block_no(sb, bg));
> > }
> >
> > +/**
> > + * ext4_get_stripe_size: Get the stripe size.
> > + * @sbi: In memory super block info
> > + *
> > + * If we have specified it via mount option, then
> > + * use the mount option value. If the value specified at mount time is
> > + * greater than the blocks per group use the super block value.
> > + * If the super block value is greater than blocks per group return 0.
> > + * Allocator needs it be less than blocks per group.
> > + *
> > + */
> > +static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
> > +{
> > + unsigned long stride = le16_to_cpu(sbi->s_es->s_raid_stride);
> > + unsigned long stripe_width =
> > + le32_to_cpu(sbi->s_es->s_raid_stripe_width);
> > +
> > + if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
> > + return sbi->s_stripe;
> > + } else if (stripe_width <= sbi->s_blocks_per_group) {
> > + return stripe_width;
> > + } else if (stride <= sbi->s_blocks_per_group) {
> > + return stride;
> > + }
>
> unneeded braces.

I was thinking it is ok these days. checkpatch didn't warn and i had
multiple else if. I could remove those else if


>
> > + return 0;
> > +}
> >
> > ...
> >
> > +static inline
> > +struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
> > + ext4_group_t group)
> > +{
> > + struct ext4_group_info ***grp_info;
> > + long indexv, indexh;
> > + grp_info = EXT4_SB(sb)->s_group_info;
> > + indexv = group >> (EXT4_DESC_PER_BLOCK_BITS(sb));
> > + indexh = group & ((EXT4_DESC_PER_BLOCK(sb)) - 1);
> > + return grp_info[indexv][indexh];
> > +}
>
> This should be uninlined.
>
>
>
> Gosh what a lot of code. Is it faster?

Performance numbers with compile bench http://ext4.wiki.kernel.org/index.php/Performance_results

2008-01-24 08:54:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 23/49] Add buffer head related helper functions

> On Thu, 24 Jan 2008 10:52:27 +0530 "Aneesh Kumar K.V" <[email protected]> wrote:
> + * Returns zero on success and -EIO on error.If the input
> + * buffer is not locked returns -EINVAL
> + *
> + */
> +int bh_submit_read(struct buffer_head *bh)
> +{
> + if (!buffer_locked(bh))
> + return -EINVAL;

Is this case just catching a programming bug?

If so, a plain old BUG_ON would be better.

2008-01-24 09:05:06

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

updated patch. Waiting for the test results.

I am only attaching the diff. Mballoc patch is really large.

-aneesh
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 4f329af..ec7d349 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are accepted:
extents ext4 will use extents to address file data. The
file system will no longer be mountable by ext3.

+noextents ext4 will not use extents for new files created.
+
journal_checksum Enable checksumming of the journal transactions.
This will allow the recovery code in e2fsck and the
kernel to detect corruption in the kernel. It is a
@@ -206,6 +208,10 @@ nobh (a) cache disk block mapping information
"nobh" option tries to avoid associating buffer
heads (supported only for "writeback" mode).

+mballoc (*) Use the mutliblock allocator for block allocation
+nomballoc disabled multiblock allocator for block allocation.
+stripe=n filesystem blocks per stripe for a RAID configuration.
+

Data Mode
---------
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index dec9945..4413a2d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -857,6 +857,45 @@ CPUs.
The "procs_blocked" line gives the number of processes currently blocked,
waiting for I/O to complete.

+1.9 Ext4 file system parameters
+------------------------------
+Ext4 file system have one directory per partition under /proc/fs/ext4/
+# ls /proc/fs/ext4/hdc/
+group_prealloc max_to_scan mb_groups mb_history min_to_scan order2_req
+stats stream_req
+
+mb_groups:
+This file gives the details of mutiblock allocator buddy cache of free blocks
+
+mb_history:
+Multiblock allocation history.
+
+stats:
+This file indicate whether the multiblock allocator should start collecting
+statistics. The statistics are shown during unmount
+
+group_prealloc:
+The multiblock allocator normalize the block allocation request to
+group_prealloc filesystem blocks if we don't have strip value set.
+The stripe value can be specified at mount time or during mke2fs.
+
+max_to_scan:
+How long multiblock allocator can look for a best extent (in found extents)
+
+min_to_scan:
+How long multiblock allocator must look for a best extent
+
+order2_req:
+Multiblock allocator use 2^N search using buddies only for requests greater
+than or equal to order2_req. The request size is specfied in file system
+blocks. A value of 2 indicate only if the requests are greater than or equal
+to 4 blocks.
+
+stream_req:
+Files smaller than stream_req are served by the stream allocator, whose
+purpose is to pack requests as close each to other as possible to
+produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16
+filesystem block size will use group based preallocation.

------------------------------------------------------------------------------
Summary
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0398aa0..310bad6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -489,7 +489,7 @@ struct ext4_free_extent {
*/
struct ext4_locality_group {
/* for allocator */
- struct semaphore lg_sem; /* to serialize allocates */
+ struct mutex lg_sem; /* to serialize allocates */
struct list_head lg_prealloc_list;/* list of preallocations */
spinlock_t lg_prealloc_lock;
};
@@ -563,7 +563,10 @@ struct ext4_buddy {
#define EXT4_MB_BUDDY(e4b) ((e4b)->bd_buddy)

#ifndef EXT4_MB_HISTORY
-#define ext4_mb_store_history(ac)
+static inline void ext4_mb_store_history(struct ext4_allocation_context *ac)
+{
+ return;
+}
#else
static void ext4_mb_store_history(struct ext4_allocation_context *ac);
#endif
@@ -641,6 +644,10 @@ static ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb,

static inline int mb_test_bit(int bit, void *addr)
{
+ /*
+ * ext4_test_bit on architecture like powerpc
+ * needs unsigned long aligned address
+ */
mb_correct_addr_and_bit(bit, addr);
return ext4_test_bit(bit, addr);
}
@@ -669,7 +676,7 @@ static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
ext4_clear_bit_atomic(lock, bit, addr);
}

-static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
+static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
{
char *bb;

@@ -752,9 +759,20 @@ static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
}

#else
-#define mb_free_blocks_double(a, b, c, d)
-#define mb_mark_used_double(a, b, c)
-#define mb_cmp_bitmaps(a, b)
+static inline void mb_free_blocks_double(struct inode *inode,
+ struct ext4_buddy *e4b, int first, int count)
+{
+ return;
+}
+static inline void mb_mark_used_double(struct ext4_buddy *e4b,
+ int first, int count)
+{
+ return;
+}
+static inline void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap)
+{
+ return;
+}
#endif

#ifdef AGGRESSIVE_CHECK
@@ -877,26 +895,6 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file,
#define mb_check_buddy(e4b)
#endif

-/* find most significant bit */
-static int fmsb(unsigned short word)
-{
- int order;
-
- if (word > 255) {
- order = 7;
- word >>= 8;
- } else {
- order = -1;
- }
-
- do {
- order++;
- word >>= 1;
- } while (word != 0);
-
- return order;
-}
-
/* FIXME!! need more doc */
static void ext4_mb_mark_free_simple(struct super_block *sb,
void *buddy, unsigned first, int len,
@@ -917,7 +915,7 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
max = ffs(first | border) - 1;

/* find how many blocks of power 2 we need to mark */
- min = fmsb(len);
+ min = fls(len);

if (max < min)
min = max;
@@ -1029,10 +1027,9 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
if (groups_per_page > 1) {
err = -ENOMEM;
i = sizeof(struct buffer_head *) * groups_per_page;
- bh = kmalloc(i, GFP_NOFS);
+ bh = kzalloc(i, GFP_NOFS);
if (bh == NULL)
goto out;
- memset(bh, 0, i);
} else
bh = &bhs;

@@ -1055,15 +1052,9 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
if (bh[i] == NULL)
goto out;

- if (buffer_uptodate(bh[i]))
+ if (bh_uptodate_or_lock(bh[i]))
continue;

- lock_buffer(bh[i]);
- if (buffer_uptodate(bh[i])) {
- unlock_buffer(bh[i]);
- continue;
- }
-
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh[i],
first_group + i, desc);
@@ -1302,7 +1293,7 @@ static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
len = cur + len;
while (cur < len) {
if ((cur & 31) == 0 && (len - cur) >= 32) {
- /* fast path: clear whole word at once */
+ /* fast path: set whole word at once */
addr = bm + (cur >> 3);
*addr = 0xffffffff;
cur += 32;
@@ -2675,7 +2666,7 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
for (i = 0; i < NR_CPUS; i++) {
struct ext4_locality_group *lg;
lg = &sbi->s_locality_groups[i];
- sema_init(&lg->lg_sem, 1);
+ mutex_init(&lg->lg_sem);
INIT_LIST_HEAD(&lg->lg_prealloc_list);
spin_lock_init(&lg->lg_prealloc_lock);
}
@@ -2687,6 +2678,7 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
return 0;
}

+/* need to called with ext4 group lock (ext4_lock_group) */
static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
{
struct ext4_prealloc_space *pa;
@@ -2695,7 +2687,7 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)

list_for_each_safe(cur, tmp, &grp->bb_prealloc_list) {
pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
- list_del_rcu(&pa->pa_group_list);
+ list_del(&pa->pa_group_list);
count++;
kfree(pa);
}
@@ -3441,6 +3433,7 @@ static int ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
/*
* the function goes through all preallocation in this group and marks them
* used in in-core bitmap. buddy must be generated from this bitmap
+ * Need to be called with ext4 group lock (ext4_lock_group)
*/
static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
ext4_group_t group)
@@ -3462,7 +3455,7 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
* allocation in buddy when concurrent ext4_mb_put_pa()
* is dropping preallocation
*/
- list_for_each_rcu(cur, &grp->bb_prealloc_list) {
+ list_for_each(cur, &grp->bb_prealloc_list) {
pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
spin_lock(&pa->pa_lock);
ext4_get_group_no_and_offset(sb, pa->pa_pstart,
@@ -3486,7 +3479,6 @@ static void ext4_mb_pa_callback(struct rcu_head *head)
pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
kmem_cache_free(ext4_pspace_cachep, pa);
}
-#define mb_call_rcu(__pa) call_rcu(&(__pa)->u.pa_rcu, ext4_mb_pa_callback)

/*
* drops a reference to preallocated space descriptor
@@ -3528,14 +3520,14 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
* against that pair
*/
ext4_lock_group(sb, grp);
- list_del_rcu(&pa->pa_group_list);
+ list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

spin_lock(pa->pa_obj_lock);
list_del_rcu(&pa->pa_inode_list);
spin_unlock(pa->pa_obj_lock);

- mb_call_rcu(pa);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}

/*
@@ -3615,7 +3607,7 @@ static int ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_inode = ac->ac_inode;

ext4_lock_group(sb, ac->ac_b_ex.fe_group);
- list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
+ list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
ext4_unlock_group(sb, ac->ac_b_ex.fe_group);

spin_lock(pa->pa_obj_lock);
@@ -3672,7 +3664,7 @@ static int ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
pa->pa_inode = NULL;

ext4_lock_group(sb, ac->ac_b_ex.fe_group);
- list_add_rcu(&pa->pa_group_list, &grp->bb_prealloc_list);
+ list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
ext4_unlock_group(sb, ac->ac_b_ex.fe_group);

spin_lock(pa->pa_obj_lock);
@@ -3853,7 +3845,7 @@ repeat:

spin_unlock(&pa->pa_lock);

- list_del_rcu(&pa->pa_group_list);
+ list_del(&pa->pa_group_list);
list_add(&pa->u.pa_tmp_list, &list);
}

@@ -3889,7 +3881,7 @@ repeat:
ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);

list_del(&pa->u.pa_tmp_list);
- mb_call_rcu(pa);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}

out:
@@ -3942,9 +3934,8 @@ repeat:
spin_unlock(&pa->pa_lock);
spin_unlock(&ei->i_prealloc_lock);
printk(KERN_ERR "uh-oh! used pa while discarding\n");
- dump_stack();
- current->state = TASK_UNINTERRUPTIBLE;
- schedule_timeout(HZ);
+ WARN_ON(1);
+ schedule_timeout_uninterruptible(HZ);
goto repeat;

}
@@ -3972,8 +3963,7 @@ repeat:
* add a flag to force wait only in case
* of ->clear_inode(), but not in case of
* regular truncate */
- current->state = TASK_UNINTERRUPTIBLE;
- schedule_timeout(HZ);
+ schedule_timeout_uninterruptible(HZ);
goto repeat;
}
spin_unlock(&ei->i_prealloc_lock);
@@ -3993,7 +3983,7 @@ repeat:
}

ext4_lock_group(sb, group);
- list_del_rcu(&pa->pa_group_list);
+ list_del(&pa->pa_group_list);
ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
ext4_unlock_group(sb, group);

@@ -4001,7 +3991,7 @@ repeat:
brelse(bitmap_bh);

list_del(&pa->u.pa_tmp_list);
- mb_call_rcu(pa);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}
}

@@ -4051,7 +4041,8 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
struct ext4_prealloc_space *pa;
ext4_grpblk_t start;
struct list_head *cur;
- list_for_each_rcu(cur, &grp->bb_prealloc_list) {
+ ext4_lock_group(sb, i);
+ list_for_each(cur, &grp->bb_prealloc_list) {
pa = list_entry(cur, struct ext4_prealloc_space,
pa_group_list);
spin_lock(&pa->pa_lock);
@@ -4061,6 +4052,7 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
printk(KERN_ERR "PA:%lu:%d:%u \n", i,
start, pa->pa_len);
}
+ ext4_lock_group(sb, i);

if (grp->bb_free == 0)
continue;
@@ -4070,7 +4062,10 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
printk(KERN_ERR "\n");
}
#else
-#define ext4_mb_show_ac(x)
+static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
+{
+ return;
+}
#endif

/*
@@ -4091,8 +4086,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)

size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
isize = i_size_read(ac->ac_inode) >> bsbits;
- if (size < isize)
- size = isize;
+ size = max(size, isize);

/* don't use group allocation for large files */
if (size >= sbi->s_mb_stream_request)
@@ -4102,6 +4096,11 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
return;

BUG_ON(ac->ac_lg != NULL);
+ /*
+ * locality group prealloc space are per cpu. The reason for having
+ * per cpu locality group is to reduce the contention between block
+ * request from multiple CPUs.
+ */
ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
put_cpu();

@@ -4109,7 +4108,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;

/* serialize all allocations in the group */
- down(&ac->ac_lg->lg_sem);
+ mutex_lock(&ac->ac_lg->lg_sem);
}

static int ext4_mb_initialize_context(struct ext4_allocation_context *ac,
@@ -4202,7 +4201,7 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
if (ac->ac_buddy_page)
page_cache_release(ac->ac_buddy_page);
if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
- up(&ac->ac_lg->lg_sem);
+ mutex_unlock(&ac->ac_lg->lg_sem);
ext4_mb_collect_stats(ac);
return 0;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 136d095..3a51ffc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1779,13 +1779,14 @@ static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
unsigned long stripe_width =
le32_to_cpu(sbi->s_es->s_raid_stripe_width);

- if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group) {
+ if (sbi->s_stripe && sbi->s_stripe <= sbi->s_blocks_per_group)
return sbi->s_stripe;
- } else if (stripe_width <= sbi->s_blocks_per_group) {
+
+ if (stripe_width <= sbi->s_blocks_per_group)
return stripe_width;
- } else if (stride <= sbi->s_blocks_per_group) {
+
+ if (stride <= sbi->s_blocks_per_group)
return stride;
- }

return 0;
}

2008-01-24 13:01:18

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH 30/49] ext4: Convert truncate_mutex to read write semaphore.

On Wed, Jan 23, 2008 at 02:06:59PM -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:09 -0500 "Theodore Ts'o" <[email protected]> wrote:
> > +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> > + unsigned long max_blocks, struct buffer_head *bh,
> > + int create, int extend_disksize)
> > +{
> > + int retval;
> > + if (create) {
> > + down_write((&EXT4_I(inode)->i_data_sem));
> > + } else {
> > + down_read((&EXT4_I(inode)->i_data_sem));
> > + }
> > + if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) {
> > + retval = ext4_ext_get_blocks(handle, inode, block, max_blocks,
> > + bh, create, extend_disksize);
> > + } else {
> > + retval = ext4_get_blocks_handle(handle, inode, block,
> > + max_blocks, bh, create, extend_disksize);
> > + }
> > + if (create) {
> > + up_write((&EXT4_I(inode)->i_data_sem));
> > + } else {
> > + up_read((&EXT4_I(inode)->i_data_sem));
> > + }
>
> This function has many unneeded braces. checkpatch used to detect this
> but it seems to have broken.

This is a side effect of this rule:

This does not apply if one branch of a conditional statement
is a single statement. Use braces in both branches.

Basically each arm is being considered in isolation, each arm is seen as
having a "sibling" arm with braces so it is permitted to have braces.
Bugger.

I guess I'll try and see if I can detect this.

> > + return retval;
> > +}
> > static int ext4_get_block(struct inode *inode, sector_t iblock,
> > struct buffer_head *bh_result, int create)
>
> Mising newline.

We could check for those ... will look to add in the next release.

-apw

2008-01-24 14:53:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

On Thu, Jan 24, 2008 at 01:26:14PM +0530, Aneesh Kumar K.V wrote:
>
> >
> > > +/* find most significant bit */
> > > +static int fmsb(unsigned short word)
> > > +{
> > > + int order;
> > > +
> > > + if (word > 255) {
> > > + order = 7;
> > > + word >>= 8;
> > > + } else {
> > > + order = -1;
> > > + }
> > > +
> > > + do {
> > > + order++;
> > > + word >>= 1;
> > > + } while (word != 0);
> > > +
> > > + return order;
> > > +}
> >
> > Did we just reinvent fls()?
>
> replaced by fls.
>
> >

That should be fls() - 1;

The full patch is at

http://www.radian.org/~kvaneesh/ext4/jan-24-2008/mballoc-core.patch

The patch is too big to inline.

-aneesh

2008-01-24 21:24:28

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 33/49] ext4: Add the journal checksum feature

On Wed, 2008-01-23 at 14:07 -0800, Andrew Morton wrote:
> > On Mon, 21 Jan 2008 22:02:12 -0500 "Theodore Ts'o" <[email protected]> wrote:
> > From: Girish Shilamkar <[email protected]>
> >
> > The journal checksum feature adds two new flags i.e
> > JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.
> >
> > JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
> > checksum for the blocks described by the descriptor blocks.
> > Due to checksums, writing of the commit record no longer needs to be
> > synchronous. Now commit record can be sent to disk without waiting for
> > descriptor blocks to be written to disk. This behavior is controlled
> > using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
> > able to recover the journal with _ASYNC_COMMIT hence it is made
> > incompat.
> > The commit header has been extended to hold the checksum along with the
> > type of the checksum.
> >
> > For recovery in pass scan checksums are verified to ensure the sanity
> > and completeness(in case of _ASYNC_COMMIT) of every transaction.
> >
> > ...
> >
> > +static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
>
> unneeded inlining.
>
> > +{
> > + struct page *page = bh->b_page;
> > + char *addr;
> > + __u32 checksum;
> > +
> > + addr = kmap_atomic(page, KM_USER0);
> > + checksum = crc32_be(crc32_sum,
> > + (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> > + kunmap_atomic(addr, KM_USER0);
> > +
> > + return checksum;
> > +}
>
> Can this buffer actually be in highmem?
>
> > static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
> > unsigned long long block)
>
> More unnecessary inlining.
>
> > +/*
> > + * jbd2_journal_clear_features () - Clear a given journal feature in the
> > + * superblock
> > + * @journal: Journal to act on.
> > + * @compat: bitmask of compatible features
> > + * @ro: bitmask of features that force read-only mount
> > + * @incompat: bitmask of incompatible features
> > + *
> > + * Clear a given journal feature as present on the
> > + * superblock. Returns true if the requested features could be reset.
> > + */
> > +int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
> > + unsigned long ro, unsigned long incompat)
> > +{
> > + journal_superblock_t *sb;
> > +
> > + jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
> > + compat, ro, incompat);
> > +
> > + sb = journal->j_superblock;
> > +
> > + sb->s_feature_compat &= ~cpu_to_be32(compat);
> > + sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
> > + sb->s_feature_incompat &= ~cpu_to_be32(incompat);
> > +
> > + return 1;
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_clear_features);
>
> Kernel usually returns 0 on success. So we can return a useful errno on
> failure.
>
> > +/*
> > + * calc_chksums calculates the checksums for the blocks described in the
> > + * descriptor block.
> > + */
> > +static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> > + unsigned long *next_log_block, __u32 *crc32_sum)
> > +{
> > + int i, num_blks, err;
> > + unsigned io_block;
> > + struct buffer_head *obh;
> > +
> > + num_blks = count_tags(journal, bh);
> > + /* Calculate checksum of the descriptor block. */
> > + *crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
> > +
> > + for (i = 0; i < num_blks; i++) {
> > + io_block = (*next_log_block)++;
>
> unsigned <- unsigned long.
>
> Are all the types appropriate in here?
>
> > + wrap(journal, *next_log_block);
> > + err = jread(&obh, journal, io_block);
> > + if (err) {
> > + printk(KERN_ERR "JBD: IO error %d recovering block "
> > + "%u in log\n", err, io_block);
> > + return 1;
> > + } else {
> > + *crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
> > + obh->b_size);
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > static int do_one_pass(journal_t *journal,
> > struct recovery_info *info, enum passtype pass)
> > {
> > @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal,
> > unsigned int sequence;
> > int blocktype;
> > int tag_bytes = journal_tag_bytes(journal);
> > + __u32 crc32_sum = ~0; /* Transactional Checksums */
> >
> > /* Precompute the maximum metadata descriptors in a descriptor block */
> > int MAX_BLOCKS_PER_DESC;
> > @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal,
> > switch(blocktype) {
> > case JBD2_DESCRIPTOR_BLOCK:
> > /* If it is a valid descriptor block, replay it
> > - * in pass REPLAY; otherwise, just skip over the
> > - * blocks it describes. */
> > + * in pass REPLAY; if journal_checksums enabled, then
> > + * calculate checksums in PASS_SCAN, otherwise,
> > + * just skip over the blocks it describes. */
> > if (pass != PASS_REPLAY) {
> > + if (pass == PASS_SCAN &&
> > + JBD2_HAS_COMPAT_FEATURE(journal,
> > + JBD2_FEATURE_COMPAT_CHECKSUM) &&
> > + !info->end_transaction) {
> > + if (calc_chksums(journal, bh,
> > + &next_log_block,
> > + &crc32_sum)) {
>
> put_bh()
>
> > + brelse(bh);
> > + break;
> > + }
> > + brelse(bh);
> > + continue;
>
> put_bh()
>
> > + }
> > next_log_block += count_tags(journal, bh);
> > wrap(journal, next_log_block);
> > brelse(bh);
> > @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal,
> > continue;
> >
> > + brelse(bh);
>
> etc
>

Thanks, Updated patch below:
ext4: Add the journal checksum feature

From: Girish Shilamkar <[email protected]>

The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.

Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Girish Shilamkar <[email protected]>
Signed-off-by: Dave Kleikamp <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

Documentation/filesystems/ext4.txt | 10 +
fs/Kconfig | 1
fs/ext4/super.c | 25 ++++
fs/jbd2/commit.c | 198 +++++++++++++++++++++++++++----------
fs/jbd2/journal.c | 26 ++++
fs/jbd2/recovery.c | 151 ++++++++++++++++++++++++++--
include/linux/ext4_fs.h | 3
include/linux/jbd2.h | 36 +++++-
8 files changed, 388 insertions(+), 62 deletions(-)


Index: linux-2.6.24-rc8/Documentation/filesystems/ext4.txt
===================================================================
--- linux-2.6.24-rc8.orig/Documentation/filesystems/ext4.txt 2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/Documentation/filesystems/ext4.txt 2008-01-24 13:00:44.000000000 -0800
@@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the fo
extents ext4 will use extents to address file data. The
file system will no longer be mountable by ext3.

+journal_checksum Enable checksumming of the journal transactions.
+ This will allow the recovery code in e2fsck and the
+ kernel to detect corruption in the kernel. It is a
+ compatible change and will be ignored by older kernels.
+
+journal_async_commit Commit block can be written to disk without waiting
+ for descriptor blocks. If enabled older kernels cannot
+ mount the device. This will enable 'journal_checksum'
+ internally.
+
journal=update Update the ext4 file system's journal to the current
format.

Index: linux-2.6.24-rc8/fs/Kconfig
===================================================================
--- linux-2.6.24-rc8.orig/fs/Kconfig 2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/fs/Kconfig 2008-01-24 11:18:55.000000000 -0800
@@ -236,6 +236,7 @@ config JBD_DEBUG

config JBD2
tristate
+ select CRC32
help
This is a generic journaling layer for block devices that support
both 32-bit and 64-bit block numbers. It is currently used by
Index: linux-2.6.24-rc8/fs/ext4/super.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/ext4/super.c 2008-01-24 11:18:52.000000000 -0800
+++ linux-2.6.24-rc8/fs/ext4/super.c 2008-01-24 13:00:45.000000000 -0800
@@ -869,6 +869,7 @@ enum {
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+ Opt_journal_checksum, Opt_journal_async_commit,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
@@ -908,6 +909,8 @@ static match_table_t tokens = {
{Opt_journal_update, "journal=update"},
{Opt_journal_inum, "journal=%u"},
{Opt_journal_dev, "journal_dev=%u"},
+ {Opt_journal_checksum, "journal_checksum"},
+ {Opt_journal_async_commit, "journal_async_commit"},
{Opt_abort, "abort"},
{Opt_data_journal, "data=journal"},
{Opt_data_ordered, "data=ordered"},
@@ -1095,6 +1098,13 @@ static int parse_options (char *options,
return 0;
*journal_devnum = option;
break;
+ case Opt_journal_checksum:
+ set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+ break;
+ case Opt_journal_async_commit:
+ set_opt(sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
+ set_opt(sbi->s_mount_opt, JOURNAL_CHECKSUM);
+ break;
case Opt_noload:
set_opt (sbi->s_mount_opt, NOLOAD);
break;
@@ -2114,6 +2124,21 @@ static int ext4_fill_super (struct super
goto failed_mount4;
}

+ if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
+ jbd2_journal_set_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
+ jbd2_journal_set_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+ jbd2_journal_clear_features(sbi->s_journal, 0, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ } else {
+ jbd2_journal_clear_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ }
+
/* We have now updated the journal if required, so we can
* validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
Index: linux-2.6.24-rc8/fs/jbd2/commit.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/commit.c 2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/commit.c 2008-01-24 13:02:43.000000000 -0800
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/jiffies.h>
+#include <linux/crc32.h>

/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -93,19 +94,23 @@ static int inverted_lock(journal_t *jour
return 1;
}

-/* Done it all: now write the commit record. We should have
+/*
+ * Done it all: now submit the commit record. We should have
* cleaned up our previous buffers by now, so if we are in abort
* mode we can now just skip the rest of the journal write
* entirely.
*
* Returns 1 if the journal needs to be aborted or 0 on success
*/
-static int journal_write_commit_record(journal_t *journal,
- transaction_t *commit_transaction)
+static int journal_submit_commit_record(journal_t *journal,
+ transaction_t *commit_transaction,
+ struct buffer_head **cbh,
+ __u32 crc32_sum)
{
struct journal_head *descriptor;
+ struct commit_header *tmp;
struct buffer_head *bh;
- int i, ret;
+ int ret;
int barrier_done = 0;

if (is_journal_aborted(journal))
@@ -117,21 +122,33 @@ static int journal_write_commit_record(j

bh = jh2bh(descriptor);

- /* AKPM: buglet - add `i' to tmp! */
- for (i = 0; i < bh->b_size; i += 512) {
- journal_header_t *tmp = (journal_header_t*)bh->b_data;
- tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
- tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
- tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+ tmp = (struct commit_header *)bh->b_data;
+ tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+ tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
+ tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
+
+ if (JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ tmp->h_chksum_type = JBD2_CRC32_CHKSUM;
+ tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE;
+ tmp->h_chksum[0] = cpu_to_be32(crc32_sum);
}

- JBUFFER_TRACE(descriptor, "write commit block");
+ JBUFFER_TRACE(descriptor, "submit commit block");
+ lock_buffer(bh);
+
set_buffer_dirty(bh);
- if (journal->j_flags & JBD2_BARRIER) {
+ set_buffer_uptodate(bh);
+ bh->b_end_io = journal_end_buffer_io_sync;
+
+ if (journal->j_flags & JBD2_BARRIER &&
+ !JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
set_buffer_ordered(bh);
barrier_done = 1;
}
- ret = sync_dirty_buffer(bh);
+ ret = submit_bh(WRITE, bh);
+
/* is it possible for another commit to fail at roughly
* the same time as this one? If so, we don't want to
* trust the barrier flag in the super, but instead want
@@ -152,14 +169,72 @@ static int journal_write_commit_record(j
clear_buffer_ordered(bh);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
- ret = sync_dirty_buffer(bh);
+ ret = submit_bh(WRITE, bh);
}
- put_bh(bh); /* One for getblk() */
- jbd2_journal_put_journal_head(descriptor);
+ *cbh = bh;
+ return ret;
+}
+
+/*
+ * This function along with journal_submit_commit_record
+ * allows to write the commit record asynchronously.
+ */
+static int journal_wait_on_commit_record(struct buffer_head *bh)
+{
+ int ret = 0;
+
+ clear_buffer_dirty(bh);
+ wait_on_buffer(bh);
+
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+ put_bh(bh); /* One for getblk() */
+ jbd2_journal_put_journal_head(bh2jh(bh));

- return (ret == -EIO);
+ return ret;
}

+/*
+ * Wait for all submitted IO to complete.
+ */
+static int journal_wait_on_locked_list(journal_t *journal,
+ transaction_t *commit_transaction)
+{
+ int ret = 0;
+ struct journal_head *jh;
+
+ while (commit_transaction->t_locked_list) {
+ struct buffer_head *bh;
+
+ jh = commit_transaction->t_locked_list->b_tprev;
+ bh = jh2bh(jh);
+ get_bh(bh);
+ if (buffer_locked(bh)) {
+ spin_unlock(&journal->j_list_lock);
+ wait_on_buffer(bh);
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+ spin_lock(&journal->j_list_lock);
+ }
+ if (!inverted_lock(journal, bh)) {
+ put_bh(bh);
+ spin_lock(&journal->j_list_lock);
+ continue;
+ }
+ if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+ __jbd2_journal_unfile_buffer(jh);
+ jbd_unlock_bh_state(bh);
+ jbd2_journal_remove_journal_head(bh);
+ put_bh(bh);
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ put_bh(bh);
+ cond_resched_lock(&journal->j_list_lock);
+ }
+ return ret;
+ }
+
static void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
{
int i;
@@ -275,7 +350,21 @@ write_out_data:
journal_do_submit_data(wbuf, bufs);
}

-static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
+static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+{
+ struct page *page = bh->b_page;
+ char *addr;
+ __u32 checksum;
+
+ addr = kmap_atomic(page, KM_USER0);
+ checksum = crc32_be(crc32_sum,
+ (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
+ kunmap_atomic(addr, KM_USER0);
+
+ return checksum;
+}
+
+static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
unsigned long long block)
{
tag->t_blocknr = cpu_to_be32(block & (u32)~0);
@@ -307,6 +396,8 @@ void jbd2_journal_commit_transaction(jou
int tag_flag;
int i;
int tag_bytes = journal_tag_bytes(journal);
+ struct buffer_head *cbh = NULL; /* For transactional checksums */
+ __u32 crc32_sum = ~0;

/*
* First job: lock down the current transaction and wait for
@@ -451,38 +542,15 @@ void jbd2_journal_commit_transaction(jou
journal_submit_data_buffers(journal, commit_transaction);

/*
- * Wait for all previously submitted IO to complete.
+ * Wait for all previously submitted IO to complete if commit
+ * record is to be written synchronously.
*/
spin_lock(&journal->j_list_lock);
- while (commit_transaction->t_locked_list) {
- struct buffer_head *bh;
+ if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT))
+ err = journal_wait_on_locked_list(journal,
+ commit_transaction);

- jh = commit_transaction->t_locked_list->b_tprev;
- bh = jh2bh(jh);
- get_bh(bh);
- if (buffer_locked(bh)) {
- spin_unlock(&journal->j_list_lock);
- wait_on_buffer(bh);
- if (unlikely(!buffer_uptodate(bh)))
- err = -EIO;
- spin_lock(&journal->j_list_lock);
- }
- if (!inverted_lock(journal, bh)) {
- put_bh(bh);
- spin_lock(&journal->j_list_lock);
- continue;
- }
- if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
- __jbd2_journal_unfile_buffer(jh);
- jbd_unlock_bh_state(bh);
- jbd2_journal_remove_journal_head(bh);
- put_bh(bh);
- } else {
- jbd_unlock_bh_state(bh);
- }
- put_bh(bh);
- cond_resched_lock(&journal->j_list_lock);
- }
spin_unlock(&journal->j_list_lock);

if (err)
@@ -656,6 +724,15 @@ void jbd2_journal_commit_transaction(jou
start_journal_io:
for (i = 0; i < bufs; i++) {
struct buffer_head *bh = wbuf[i];
+ /*
+ * Compute checksum.
+ */
+ if (JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ crc32_sum =
+ jbd2_checksum_data(crc32_sum, bh);
+ }
+
lock_buffer(bh);
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
@@ -672,6 +749,23 @@ start_journal_io:
}
}

+ /* Done it all: now write the commit record asynchronously. */
+
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ err = journal_submit_commit_record(journal, commit_transaction,
+ &cbh, crc32_sum);
+ if (err)
+ __jbd2_journal_abort_hard(journal);
+
+ spin_lock(&journal->j_list_lock);
+ err = journal_wait_on_locked_list(journal,
+ commit_transaction);
+ spin_unlock(&journal->j_list_lock);
+ if (err)
+ __jbd2_journal_abort_hard(journal);
+ }
+
/* Lo and behold: we have just managed to send a transaction to
the log. Before we can commit it, wait for the IO so far to
complete. Control buffers being written are on the
@@ -771,8 +865,14 @@ wait_for_iobuf:

jbd_debug(3, "JBD: commit phase 6\n");

- if (journal_write_commit_record(journal, commit_transaction))
- err = -EIO;
+ if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ err = journal_submit_commit_record(journal, commit_transaction,
+ &cbh, crc32_sum);
+ if (err)
+ __jbd2_journal_abort_hard(journal);
+ }
+ err = journal_wait_on_commit_record(cbh);

if (err)
jbd2_journal_abort(journal, err);
Index: linux-2.6.24-rc8/fs/jbd2/journal.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/journal.c 2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/journal.c 2008-01-24 13:06:53.000000000 -0800
@@ -1578,6 +1578,32 @@ int jbd2_journal_set_features (journal_t
return 1;
}

+/*
+ * jbd2_journal_clear_features () - Clear a given journal feature in the
+ * superblock
+ * @journal: Journal to act on.
+ * @compat: bitmask of compatible features
+ * @ro: bitmask of features that force read-only mount
+ * @incompat: bitmask of incompatible features
+ *
+ * Clear a given journal feature as present on the
+ * superblock.
+ */
+int jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
+ unsigned long ro, unsigned long incompat)
+{
+ journal_superblock_t *sb;
+
+ jbd_debug(1, "Clear features 0x%lx/0x%lx/0x%lx\n",
+ compat, ro, incompat);
+
+ sb = journal->j_superblock;
+
+ sb->s_feature_compat &= ~cpu_to_be32(compat);
+ sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
+ sb->s_feature_incompat &= ~cpu_to_be32(incompat);
+}
+EXPORT_SYMBOL(jbd2_journal_clear_features);

/**
* int jbd2_journal_update_format () - Update on-disk journal structure.
Index: linux-2.6.24-rc8/fs/jbd2/recovery.c
===================================================================
--- linux-2.6.24-rc8.orig/fs/jbd2/recovery.c 2008-01-24 11:18:08.000000000 -0800
+++ linux-2.6.24-rc8/fs/jbd2/recovery.c 2008-01-24 13:16:07.000000000 -0800
@@ -21,6 +21,7 @@
#include <linux/jbd2.h>
#include <linux/errno.h>
#include <linux/slab.h>
+#include <linux/crc32.h>
#endif

/*
@@ -316,6 +317,37 @@ static inline unsigned long long read_ta
return block;
}

+/*
+ * calc_chksums calculates the checksums for the blocks described in the
+ * descriptor block.
+ */
+static int calc_chksums(journal_t *journal, struct buffer_head *bh,
+ unsigned long *next_log_block, __u32 *crc32_sum)
+{
+ int i, num_blks, err;
+ unsigned long io_block;
+ struct buffer_head *obh;
+
+ num_blks = count_tags(journal, bh);
+ /* Calculate checksum of the descriptor block. */
+ *crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
+
+ for (i = 0; i < num_blks; i++) {
+ io_block = (*next_log_block)++;
+ wrap(journal, *next_log_block);
+ err = jread(&obh, journal, io_block);
+ if (err) {
+ printk(KERN_ERR "JBD: IO error %d recovering block "
+ "%lu in log\n", err, io_block);
+ return 1;
+ } else {
+ *crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
+ obh->b_size);
+ }
+ }
+ return 0;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa
unsigned int sequence;
int blocktype;
int tag_bytes = journal_tag_bytes(journal);
+ __u32 crc32_sum = ~0; /* Transactional Checksums */

/* Precompute the maximum metadata descriptors in a descriptor block */
int MAX_BLOCKS_PER_DESC;
@@ -419,12 +452,26 @@ static int do_one_pass(journal_t *journa
switch(blocktype) {
case JBD2_DESCRIPTOR_BLOCK:
/* If it is a valid descriptor block, replay it
- * in pass REPLAY; otherwise, just skip over the
- * blocks it describes. */
+ * in pass REPLAY; if journal_checksums enabled, then
+ * calculate checksums in PASS_SCAN, otherwise,
+ * just skip over the blocks it describes. */
if (pass != PASS_REPLAY) {
+ if (pass == PASS_SCAN &&
+ JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM) &&
+ !info->end_transaction) {
+ if (calc_chksums(journal, bh,
+ &next_log_block,
+ &crc32_sum)) {
+ put_bh(bh);
+ break;
+ }
+ put_bh(bh);
+ continue;
+ }
next_log_block += count_tags(journal, bh);
wrap(journal, next_log_block);
- brelse(bh);
+ put_bh(bh);
continue;
}

@@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journa
continue;

case JBD2_COMMIT_BLOCK:
- /* Found an expected commit block: not much to
- * do other than move on to the next sequence
+ /* How to differentiate between interrupted commit
+ * and journal corruption ?
+ *
+ * {nth transaction}
+ * Checksum Verification Failed
+ * |
+ * ____________________
+ * | |
+ * async_commit sync_commit
+ * | |
+ * | GO TO NEXT "Journal Corruption"
+ * | TRANSACTION
+ * |
+ * {(n+1)th transanction}
+ * |
+ * _______|______________
+ * | |
+ * Commit block found Commit block not found
+ * | |
+ * "Journal Corruption" |
+ * _____________|_________
+ * | |
+ * nth trans corrupt OR nth trans
+ * and (n+1)th interrupted interrupted
+ * before commit block
+ * could reach the disk.
+ * (Cannot find the difference in above
+ * mentioned conditions. Hence assume
+ * "Interrupted Commit".)
+ */
+
+ /* Found an expected commit block: if checksums
+ * are present verify them in PASS_SCAN; else not
+ * much to do other than move on to the next sequence
* number. */
+ if (pass == PASS_SCAN &&
+ JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ int chksum_err, chksum_seen;
+ struct commit_header *cbh =
+ (struct commit_header *)bh->b_data;
+ unsigned found_chksum =
+ be32_to_cpu(cbh->h_chksum[0]);
+
+ chksum_err = chksum_seen = 0;
+
+ if (info->end_transaction) {
+ printk(KERN_ERR "JBD: Transaction %u "
+ "found to be corrupt.\n",
+ next_commit_ID - 1);
+ brelse(bh);
+ break;
+ }
+
+ if (crc32_sum == found_chksum &&
+ cbh->h_chksum_type == JBD2_CRC32_CHKSUM &&
+ cbh->h_chksum_size ==
+ JBD2_CRC32_CHKSUM_SIZE)
+ chksum_seen = 1;
+ else if (!(cbh->h_chksum_type == 0 &&
+ cbh->h_chksum_size == 0 &&
+ found_chksum == 0 &&
+ !chksum_seen))
+ /*
+ * If fs is mounted using an old kernel and then
+ * kernel with journal_chksum is used then we
+ * get a situation where the journal flag has
+ * checksum flag set but checksums are not
+ * present i.e chksum = 0, in the individual
+ * commit blocks.
+ * Hence to avoid checksum failures, in this
+ * situation, this extra check is added.
+ */
+ chksum_err = 1;
+
+ if (chksum_err) {
+ info->end_transaction = next_commit_ID;
+
+ if (!JBD2_HAS_COMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)){
+ printk(KERN_ERR
+ "JBD: Transaction %u "
+ "found to be corrupt.\n",
+ next_commit_ID);
+ brelse(bh);
+ break;
+ }
+ }
+ crc32_sum = ~0;
+ }
brelse(bh);
next_commit_ID++;
continue;
@@ -554,9 +688,10 @@ static int do_one_pass(journal_t *journa
* transaction marks the end of the valid log.
*/

- if (pass == PASS_SCAN)
- info->end_transaction = next_commit_ID;
- else {
+ if (pass == PASS_SCAN) {
+ if (!info->end_transaction)
+ info->end_transaction = next_commit_ID;
+ } else {
/* It's really bad news if different passes end up at
* different places (but possible due to IO errors). */
if (info->end_transaction != next_commit_ID) {
Index: linux-2.6.24-rc8/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.24-rc8.orig/include/linux/ext4_fs.h 2008-01-24 11:18:52.000000000 -0800
+++ linux-2.6.24-rc8/include/linux/ext4_fs.h 2008-01-24 13:00:45.000000000 -0800
@@ -467,7 +467,8 @@ do { \
#define EXT4_MOUNT_USRQUOTA 0x100000 /* "old" user quota */
#define EXT4_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
#define EXT4_MOUNT_EXTENTS 0x400000 /* Extents support */
-
+#define EXT4_MOUNT_JOURNAL_CHECKSUM 0x800000 /* Journal checksums */
+#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
#define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt
Index: linux-2.6.24-rc8/include/linux/jbd2.h
===================================================================
--- linux-2.6.24-rc8.orig/include/linux/jbd2.h 2008-01-24 11:18:54.000000000 -0800
+++ linux-2.6.24-rc8/include/linux/jbd2.h 2008-01-24 11:46:16.000000000 -0800
@@ -149,6 +149,28 @@ typedef struct journal_header_s
__be32 h_sequence;
} journal_header_t;

+/*
+ * Checksum types.
+ */
+#define JBD2_CRC32_CHKSUM 1
+#define JBD2_MD5_CHKSUM 2
+#define JBD2_SHA1_CHKSUM 3
+
+#define JBD2_CRC32_CHKSUM_SIZE 4
+
+#define JBD2_CHECKSUM_BYTES (32 / sizeof(u32))
+/*
+ * Commit block header for storing transactional checksums:
+ */
+struct commit_header {
+ __be32 h_magic;
+ __be32 h_blocktype;
+ __be32 h_sequence;
+ unsigned char h_chksum_type;
+ unsigned char h_chksum_size;
+ unsigned char h_padding[2];
+ __be32 h_chksum[JBD2_CHECKSUM_BYTES];
+};

/*
* The block tag: used to describe a single buffer in the journal.
@@ -242,14 +264,18 @@ typedef struct journal_superblock_s
((j)->j_format_version >= 2 && \
((j)->j_superblock->s_feature_incompat & cpu_to_be32((mask))))

-#define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001
-#define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002
+#define JBD2_FEATURE_COMPAT_CHECKSUM 0x00000001
+
+#define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001
+#define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002
+#define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004

/* Features known to this kernel version: */
-#define JBD2_KNOWN_COMPAT_FEATURES 0
+#define JBD2_KNOWN_COMPAT_FEATURES JBD2_FEATURE_COMPAT_CHECKSUM
#define JBD2_KNOWN_ROCOMPAT_FEATURES 0
#define JBD2_KNOWN_INCOMPAT_FEATURES (JBD2_FEATURE_INCOMPAT_REVOKE | \
- JBD2_FEATURE_INCOMPAT_64BIT)
+ JBD2_FEATURE_INCOMPAT_64BIT | \
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)

#ifdef __KERNEL__

@@ -997,6 +1023,8 @@ extern int jbd2_journal_check_availab
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int jbd2_journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
+extern int jbd2_journal_clear_features
+ (journal_t *, unsigned long, unsigned long, unsigned long);
extern int jbd2_journal_create (journal_t *);
extern int jbd2_journal_load (journal_t *journal);
extern void jbd2_journal_destroy (journal_t *);


>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-01-26 04:15:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
> +static int free_ext_idx(handle_t *handle, struct inode *inode,
> + struct ext4_extent_idx *ix)
> +{
> + int i, retval = 0;
> + ext4_fsblk_t block;
> + struct buffer_head *bh;
> + struct ext4_extent_header *eh;
> +
> + block = idx_pblock(ix);
> + bh = sb_bread(inode->i_sb, block);
> + if (!bh)
> + return -EIO;
> +
> + eh = (struct ext4_extent_header *)bh->b_data;
> + if (eh->eh_depth == 0) {
> + brelse(bh);
> + ext4_free_blocks(handle, inode, block, 1);
> + } else {
> + ix = EXT_FIRST_INDEX(eh);
> + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> + retval = free_ext_idx(handle, inode, ix);
> + if (retval)
> + return retval;
> + }
> + }
> + return retval;
> +}

Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked. This is
how I plan to fix it up:

+static int free_ext_idx(handle_t *handle, struct inode *inode,
+ struct ext4_extent_idx *ix)
+{
+ int i, retval = 0;
+ ext4_fsblk_t block;
+ struct buffer_head *bh;
+ struct ext4_extent_header *eh;
+
+ block = idx_pblock(ix);
+ bh = sb_bread(inode->i_sb, block);
+ if (!bh)
+ return -EIO;
+
+ eh = (struct ext4_extent_header *)bh->b_data;
+ if (eh->eh_depth == 0)
+ ext4_free_blocks(handle, inode, block, 1);
+ else {
+ ix = EXT_FIRST_INDEX(eh);
+ for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
+ retval = free_ext_idx(handle, inode, ix);
+ if (retval)
+ break;
+ }
+ }
+ put_bh(bh);
+ return retval;
+}

- Ted

2008-01-26 08:43:12

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

On Fri, Jan 25, 2008 at 11:15:00PM -0500, Theodore Tso wrote:
> On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
> > +static int free_ext_idx(handle_t *handle, struct inode *inode,
> > + struct ext4_extent_idx *ix)
> > +{
> > + int i, retval = 0;
> > + ext4_fsblk_t block;
> > + struct buffer_head *bh;
> > + struct ext4_extent_header *eh;
> > +
> > + block = idx_pblock(ix);
> > + bh = sb_bread(inode->i_sb, block);
> > + if (!bh)
> > + return -EIO;
> > +
> > + eh = (struct ext4_extent_header *)bh->b_data;
> > + if (eh->eh_depth == 0) {
> > + brelse(bh);
> > + ext4_free_blocks(handle, inode, block, 1);
> > + } else {
> > + ix = EXT_FIRST_INDEX(eh);
> > + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> > + retval = free_ext_idx(handle, inode, ix);
> > + if (retval)
> > + return retval;
> > + }
> > + }
> > + return retval;
> > +}
>
> Aneesh, looks like if eh->eh_depth is != 0, bh gets leaked. This is
> how I plan to fix it up:
>
> +static int free_ext_idx(handle_t *handle, struct inode *inode,
> + struct ext4_extent_idx *ix)
> +{
> + int i, retval = 0;
> + ext4_fsblk_t block;
> + struct buffer_head *bh;
> + struct ext4_extent_header *eh;
> +
> + block = idx_pblock(ix);
> + bh = sb_bread(inode->i_sb, block);
> + if (!bh)
> + return -EIO;
> +
> + eh = (struct ext4_extent_header *)bh->b_data;
> + if (eh->eh_depth == 0)
> + ext4_free_blocks(handle, inode, block, 1);
> + else {
> + ix = EXT_FIRST_INDEX(eh);
> + for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
> + retval = free_ext_idx(handle, inode, ix);
> + if (retval)
> + break;
> + }
> + }
> + put_bh(bh);


We need to mark the index block as free.
via ext4_free_blocks(handle, inode, block, 1);

I remember making this change. May be it was related to dind/tind
blocks.


-aneesh

2008-01-26 13:26:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 24/49] ext4: add block bitmap validation

On Wed, Jan 23, 2008 at 02:06:54PM -0800, Andrew Morton wrote:
> brelse() should only be used when the bh might be NULL - put_bh()
> can be used here.
>
> Please review all ext4/jbd2 code for this trivial speedup.

I've reviewed all of the pending patches in the stable queue for this
speedup, and applied them where necessary; it was useful, since I
detected a buffer head leak in one of the patches while I was at it.

The ext4/jbd2 code as a whole still needs to be reviewed for this
speedup, but I don't want to fix this in the initial stable push, lest
I break something by accident. I'll put it in the "TO DO" queue.

Regards,

- Ted

2008-01-28 18:46:23

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

The latest version of the mballoc patch in the ext4dev git patch queue
has a potential uninitialized use: CC [M] fs/ext4/mballoc.o

fs/ext4/mballoc.c: In function ?ext4_mb_free_blocks?:
fs/ext4/mballoc.c:4408: warning: ?bitmap_bh? may be used uninitialized in this function

There are 2 gotos which will call put_bh on a NULL bitmap_bh.

Signed-off-by: Eric Sandeen <[email protected]>

Index: linux-2.6.24-rc6-mm1/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.24-rc6-mm1.orig/fs/ext4/mballoc.c
+++ linux-2.6.24-rc6-mm1/fs/ext4/mballoc.c
@@ -4405,7 +4405,7 @@ void ext4_mb_free_blocks(handle_t *handl
unsigned long block, unsigned long count,
int metadata, unsigned long *freed)
{
- struct buffer_head *bitmap_bh;
+ struct buffer_head *bitmap_bh = NULL;
struct super_block *sb = inode->i_sb;
struct ext4_allocation_context ac;
struct ext4_group_desc *gdp;
@@ -4546,7 +4546,8 @@ do_more:
}
sb->s_dirt = 1;
error_return:
- put_bh(bitmap_bh);
+ if (bitmap_bh)
+ put_bh(bitmap_bh);
ext4_std_error(sb, err);
return;
}

2008-02-01 20:55:42

by Girish Shilamkar

[permalink] [raw]
Subject: Re: [PATCH 33/49] ext4: Add the journal checksum feature

Hi,

On Thu, 2008-01-24 at 13:24 -0800, Mingming Cao wrote:
> -static int journal_write_commit_record(journal_t *journal,
> - transaction_t *commit_transaction)
> +static int journal_submit_commit_record(journal_t *journal,
> + transaction_t *commit_transaction,
> + struct buffer_head **cbh,
> + __u32 crc32_sum)
> {
> struct journal_head *descriptor;
> + struct commit_header *tmp;
> struct buffer_head *bh;
> - int i, ret;
> + int ret;
> int barrier_done = 0;
>
> if (is_journal_aborted(journal))
> @@ -117,21 +122,33 @@ static int journal_write_commit_record(j
>
> bh = jh2bh(descriptor);
>
> - /* AKPM: buglet - add `i' to tmp! */
> - for (i = 0; i < bh->b_size; i += 512) {
> - journal_header_t *tmp = (journal_header_t*)bh->b_data;
> - tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> - tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> - tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
> + tmp = (struct commit_header *)bh->b_data;
> + tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> + tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> + tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
> +
> + if (JBD2_HAS_COMPAT_FEATURE(journal,
> + JBD2_FEATURE_COMPAT_CHECKSUM)) {
> + tmp->h_chksum_type = JBD2_CRC32_CHKSUM;
> + tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE;
> + tmp->h_chksum[0] = cpu_to_be32(crc32_sum);
> }
>
> - JBUFFER_TRACE(descriptor, "write commit block");
> + JBUFFER_TRACE(descriptor, "submit commit block");
> + lock_buffer(bh);
> +
get_bh() is missing here.
bh refcount is decremented in journal_wait_on_commit_record(), but it is
not incremented in journal_submit_commit_record().
Thanks to Johann Lombardi for pointing this out.

Comments.

> set_buffer_dirty(bh);
> - if (journal->j_flags & JBD2_BARRIER) {
> + set_buffer_uptodate(bh);
> + bh->b_end_io = journal_end_buffer_io_sync;
> +
> + if (journal->j_flags & JBD2_BARRIER &&
> + !JBD2_HAS_COMPAT_FEATURE(journal,
> + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
> set_buffer_ordered(bh);
> barrier_done = 1;
> }
> - ret = sync_dirty_buffer(bh);
> + ret = submit_bh(WRITE, bh);
> +
> /* is it possible for another commit to fail at roughly
> * the same time as this one? If so, we don't want to
> * trust the barrier flag in the super, but instead want
> @@ -152,14 +169,72 @@ static int journal_write_commit_record(j
> clear_buffer_ordered(bh);
> set_buffer_uptodate(bh);
> set_buffer_dirty(bh);
> - ret = sync_dirty_buffer(bh);
> + ret = submit_bh(WRITE, bh);
> }
> - put_bh(bh); /* One for getblk() */
> - jbd2_journal_put_journal_head(descriptor);
> + *cbh = bh;
> + return ret;
> +}
> +
> +/*
> + * This function along with journal_submit_commit_record
> + * allows to write the commit record asynchronously.
> + */
> +static int journal_wait_on_commit_record(struct buffer_head *bh)
> +{
> + int ret = 0;
> +
> + clear_buffer_dirty(bh);
> + wait_on_buffer(bh);
> +
> + if (unlikely(!buffer_uptodate(bh)))
> + ret = -EIO;
> + put_bh(bh); /* One for getblk() */
> + jbd2_journal_put_journal_head(bh2jh(bh));
>
> - return (ret == -EIO);
> + return ret;
> }
>
-Girish