Hi all
Please see the contents of the file
for details of two similar crashes observed
on two different commits. Both crashes were
observed while using ext4 on the server while
accessing the export through NFSv3. The link
has more details.
http://www.gelato.unsw.edu.au/~shehjart/docs/jbd_ext4_crash.txt
Shehjar
On Mon, Jul 21, 2008 at 02:28:39PM +1000, Shehjar Tikoo wrote:
> Hi all
>
> Please see the contents of the file
> for details of two similar crashes observed
> on two different commits. Both crashes were
> observed while using ext4 on the server while
> accessing the export through NFSv3. The link
> has more details.
>
> http://www.gelato.unsw.edu.au/~shehjart/docs/jbd_ext4_crash.txt
>
I guess we cannot use
#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
We do multiple block allocation in ext4_da_writepages and we can
get these blocks from different block groups. So we may need to update
bitmap of multiple groups.
-aneesh
在 2008-07-21一的 13:50 +0530,Aneesh Kumar K.V写道:
> On Mon, Jul 21, 2008 at 02:28:39PM +1000, Shehjar Tikoo wrote:
> > Hi all
> >
> > Please see the contents of the file
> > for details of two similar crashes observed
> > on two different commits. Both crashes were
> > observed while using ext4 on the server while
> > accessing the export through NFSv3. The link
> > has more details.
> >
> > http://www.gelato.unsw.edu.au/~shehjart/docs/jbd_ext4_crash.txt
> >
>
> I guess we cannot use
> #define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
>
> We do multiple block allocation in ext4_da_writepages and we can
> get these blocks from different block groups. So we may need to update
> bitmap of multiple groups.
>
> -aneesh
Yeah, the journal credit reservation for writepages() need a little
attention. I will send a RFC patch shortly.
Mingming
Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
From: Mingming Cao <[email protected]>
With delalloc, at writepages() time, we need to reserve enough credits to start
a new handle, to allow possible multiple segment of block allocations under a
single call mapge_da_writepages(), so that metadata updates could fit into a single
transaction. This patch fixed this by calculating the needed credits for
write-out given number of dirty pages, with the consideration of discontinues
block allocations. It fixed both extent files and non extent files.
This patch also fixed the journal credit reservation for DIO. Currently the
estimated credits for DIO is only based on non extent format file. That credit
is not enough for mballoc a single extent on extent based file. This patch
fixed that.
The fallocate double booking credits for modifying super block etc, this patch
fixed that.
This also fix credit reservation in migration and defrag code.
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/defrag.c | 4 +-
fs/ext4/ext4.h | 4 +-
fs/ext4/extents.c | 37 ++++++++++++---------
fs/ext4/inode.c | 93 ++++++++++++++++++++++++------------------------------
fs/ext4/migrate.c | 4 +-
5 files changed, 72 insertions(+), 70 deletions(-)
Index: linux-2.6.26-git6/fs/ext4/ext4.h
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/ext4.h 2008-07-21 17:35:17.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/ext4.h 2008-07-21 17:36:17.000000000 -0700
@@ -1149,7 +1149,7 @@ extern void ext4_truncate (struct inode
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
extern void ext4_set_aops(struct inode *inode);
-extern int ext4_writepage_trans_blocks(struct inode *);
+extern int ext4_writepages_trans_blocks(struct inode *, int num);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
@@ -1314,7 +1314,7 @@ extern const struct inode_operations ext
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
-extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext4_ext_writepages_trans_blocks(struct inode *inode, int);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
Index: linux-2.6.26-git6/fs/ext4/extents.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/extents.c 2008-07-21 17:35:17.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/extents.c 2008-07-21 17:36:17.000000000 -0700
@@ -1887,11 +1887,12 @@ static int ext4_ext_rm_idx(handle_t *han
/*
* ext4_ext_calc_credits_for_insert:
- * This routine returns max. credits that the extent tree can consume.
+ * This routine returns max. credits that needed to insert an extent
+ * to the extent tree.
* It should be OK for low-performance paths like ->writepage()
* To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under i_data_sem and
- * pass the actual path.
+ * When pass the actual path, the caller should calculate credits
+ * under i_data_sem.
*/
int ext4_ext_calc_credits_for_insert(struct inode *inode,
struct ext4_ext_path *path)
@@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
*/
needed += (depth * 2) + (depth * 2);
- /* any allocation modifies superblock */
- needed += 1;
-
return needed;
}
@@ -2940,8 +2938,8 @@ void ext4_ext_truncate(struct inode *ino
/*
* probably first extent we're gonna free will be last in block
*/
- err = ext4_writepage_trans_blocks(inode) + 3;
- handle = ext4_journal_start(inode, err);
+ handle = ext4_journal_start(inode,
+ ext4_writepages_trans_blocks(inode, 1) + 3);
if (IS_ERR(handle))
return;
@@ -2994,18 +2992,28 @@ out_stop:
}
/*
- * ext4_ext_writepage_trans_blocks:
+ * ext4_ext_writepages_trans_blocks:
* calculate max number of blocks we could modify
* in order to allocate new block for an inode
*/
-int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
+int ext4_ext_writepages_trans_blocks(struct inode *inode, int num)
{
int needed;
+ /* cost of adding a single extent:
+ * index blocks, leafs, bitmaps,
+ * groupdescp
+ */
needed = ext4_ext_calc_credits_for_insert(inode, NULL);
- /* caller wants to allocate num blocks, but note it includes sb */
- needed = needed * num - (num - 1);
+ /*
+ * For data=journalled mode need to account for the data blocks
+ * Also need to add super block and inode block
+ */
+ if (ext4_should_journal_data(inode))
+ needed = num * (needed + 1) + 2;
+ else
+ needed = num * needed + 2;
#ifdef CONFIG_QUOTA
needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
@@ -3074,10 +3082,9 @@ long ext4_fallocate(struct inode *inode,
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- block;
/*
- * credits to insert 1 extent into extent tree + buffers to be able to
- * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ * credits to insert 1 extent into extent tree
*/
- credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+ credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
mutex_lock(&inode->i_mutex);
retry:
while (ret >= 0 && ret < max_blocks) {
Index: linux-2.6.26-git6/fs/ext4/inode.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/inode.c 2008-07-21 17:35:17.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/inode.c 2008-07-21 17:44:45.000000000 -0700
@@ -1015,15 +1015,6 @@ static void ext4_da_update_reserve_space
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Number of credits we need for writing DIO_MAX_BLOCKS:
- * We need sb + group descriptor + bitmap + inode -> 4
- * For B blocks with A block pointers per block we need:
- * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
- * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
- */
-#define DIO_CREDITS 25
-
/*
*
@@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
handle_t *handle = ext4_journal_current_handle();
int ret = 0, started = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+ int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
if (create && !handle) {
/* Direct IO write... */
if (max_blocks > DIO_MAX_BLOCKS)
max_blocks = DIO_MAX_BLOCKS;
- handle = ext4_journal_start(inode, DIO_CREDITS +
- 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
handle_t *handle;
int retries = 0;
struct page *page;
@@ -2179,18 +2170,7 @@ static int ext4_da_writepage(struct page
return ret;
}
-/*
- * For now just follow the DIO way to estimate the max credits
- * needed to write out EXT4_MAX_WRITEBACK_PAGES.
- * todo: need to calculate the max credits need for
- * extent based files, currently the DIO credits is based on
- * indirect-blocks mapping way.
- *
- * Probably should have a generic way to calculate credits
- * for DIO, writepages, and truncate
- */
#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -2210,13 +2190,8 @@ static int ext4_da_writepages(struct add
if (!mapping->nrpages)
return 0;
- /*
- * Estimate the worse case needed credits to write out
- * EXT4_MAX_BUF_BLOCKS pages
- */
- needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
-
to_write = wbc->nr_to_write;
+
if (!wbc->range_cyclic) {
/*
* If range_cyclic is not set force range_cont
@@ -2227,6 +2202,20 @@ static int ext4_da_writepages(struct add
}
while (!ret && to_write) {
+ /*
+ * set the max dirty pages could be write at a time
+ * to fit into the reserved transaction credits
+ */
+ if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
+ wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
+
+ /*
+ * Estimate the worse case needed credits to write out
+ * to_write pages
+ */
+ needed_blocks = ext4_writepages_trans_blocks(inode,
+ wbc->nr_to_write);
+
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -2246,12 +2235,6 @@ static int ext4_da_writepages(struct add
}
}
- /*
- * set the max dirty pages could be write at a time
- * to fit into the reserved transaction credits
- */
- if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
- wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
to_write -= wbc->nr_to_write;
ret = mpage_da_writepages(mapping, wbc,
@@ -2612,7 +2595,8 @@ static int __ext4_journalled_writepage(s
* references to buffers so we are safe */
unlock_page(page);
- handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+ handle = ext4_journal_start(inode,
+ ext4_writepages_trans_blocks(inode, 1));
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -4286,20 +4270,20 @@ int ext4_getattr(struct vfsmount *mnt, s
/*
* How many blocks doth make a writepage()?
*
- * With N blocks per page, it may be:
- * N data blocks
+ * With N blocks per page, and P pages, it may be:
+ * N*P data blocks
* 2 indirect block
* 2 dindirect
* 1 tindirect
- * N+5 bitmap blocks (from the above)
- * N+5 group descriptor summary blocks
+ * N*P+5 bitmap blocks (from the above)
+ * N*P+5 group descriptor summary blocks
* 1 inode block
* 1 superblock.
* 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
*
- * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
+ * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
*
- * With ordered or writeback data it's the same, less the N data blocks.
+ * With ordered or writeback data it's the same, less the N*P data blocks.
*
* If the inode's direct blocks can hold an integral number of pages then a
* page cannot straddle two indirect blocks, and we can only touch one indirect
@@ -4310,19 +4294,15 @@ int ext4_getattr(struct vfsmount *mnt, s
* block and work out the exact number of indirects which are touched. Pah.
*/
-int ext4_writepage_trans_blocks(struct inode *inode)
+static int ext4_writepages_trans_blocks_old(struct inode *inode, int num)
{
- int bpp = ext4_journal_blocks_per_page(inode);
- int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
+ int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3;
int ret;
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_writepage_trans_blocks(inode, bpp);
-
if (ext4_should_journal_data(inode))
- ret = 3 * (bpp + indirects) + 2;
+ ret = 3 * (num + indirects) + 2;
else
- ret = 2 * (bpp + indirects) + 2;
+ ret = 2 * (num + indirects) + 2;
#ifdef CONFIG_QUOTA
/* We know that structure was already allocated during DQUOT_INIT so
@@ -4334,6 +4314,19 @@ int ext4_writepage_trans_blocks(struct i
}
/*
+ * Calulate the total number of credits to reserve to fit
+ * the modification of @num pages into a single transaction
+ */
+int ext4_writepages_trans_blocks(struct inode *inode, int num)
+{
+ int bpp = ext4_journal_blocks_per_page(inode);
+ int nrblocks = num * bpp;
+
+ if (!EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
+ return ext4_writepages_trans_blocks_old(inode, nrblocks);
+ return ext4_ext_writepages_trans_blocks(inode, nrblocks);
+}
+/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
*/
Index: linux-2.6.26-git6/fs/ext4/defrag.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/defrag.c 2008-07-21 17:43:27.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/defrag.c 2008-07-22 17:44:08.000000000 -0700
@@ -1385,7 +1385,7 @@ ext4_defrag_alloc_blocks(handle_t *handl
struct buffer_head *bh = NULL;
int err, i, credits = 0;
- credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path);
+ credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path) + 4;
err = ext4_ext_journal_restart(handle,
credits + EXT4_TRANS_META_BLOCKS);
if (err)
@@ -1494,7 +1494,7 @@ ext4_defrag_partial(struct inode *tmp_in
* It needs twice the amount of ordinary journal buffers because
* inode and tmp_inode may change each different metadata blocks.
*/
- jblocks = ext4_writepage_trans_blocks(org_inode) * 2;
+ jblocks = ext4_writepages_trans_blocks(org_inode, 1) * 2;
handle = ext4_journal_start(org_inode, jblocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
Index: linux-2.6.26-git6/fs/ext4/migrate.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/migrate.c 2008-07-22 17:41:59.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/migrate.c 2008-07-22 17:43:43.000000000 -0700
@@ -52,8 +52,10 @@ static int finish_range(handle_t *handle
* Since we are doing this in loop we may accumalate extra
* credit. But below we try to not accumalate too much
* of them by restarting the journal.
+ *
+ * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
*/
- needed = ext4_ext_calc_credits_for_insert(inode, path);
+ needed = ext4_ext_calc_credits_for_insert(inode, path) + 4;
/*
* Make sure the credit we accumalated is not really high
On Jul 22, 2008 17:51 -0700, Mingming Cao wrote:
> + * Calulate the total number of credits to reserve to fit
> + * the modification of @num pages into a single transaction
> + */
> +int ext4_writepages_trans_blocks(struct inode *inode, int num)
> +{
> + int bpp = ext4_journal_blocks_per_page(inode);
> + int nrblocks = num * bpp;
> +
> + if (!EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> + return ext4_writepages_trans_blocks_old(inode, nrblocks);
This should be "if (!(EXT4_I(inode)->i_flags & EXT_EXTENTS_FL))", and
we should probably make it "unlikely()" since we expect most new files
in an ext4 filesystem are extent mapped.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Hi Mingming,
Comments below
On Tue, Jul 22, 2008 at 05:51:51PM -0700, Mingming Cao wrote:
> Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
>
> From: Mingming Cao <[email protected]>
>
> With delalloc, at writepages() time, we need to reserve enough credits to start
> a new handle, to allow possible multiple segment of block allocations under a
> single call mapge_da_writepages(), so that metadata updates could fit into a single
> transaction. This patch fixed this by calculating the needed credits for
> write-out given number of dirty pages, with the consideration of discontinues
> block allocations. It fixed both extent files and non extent files.
>
> This patch also fixed the journal credit reservation for DIO. Currently the
> estimated credits for DIO is only based on non extent format file. That credit
> is not enough for mballoc a single extent on extent based file. This patch
> fixed that.
>
> The fallocate double booking credits for modifying super block etc, this patch
> fixed that.
>
> This also fix credit reservation in migration and defrag code.
>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/ext4/defrag.c | 4 +-
> fs/ext4/ext4.h | 4 +-
> fs/ext4/extents.c | 37 ++++++++++++---------
> fs/ext4/inode.c | 93 ++++++++++++++++++++++++------------------------------
> fs/ext4/migrate.c | 4 +-
> 5 files changed, 72 insertions(+), 70 deletions(-)
>
> Index: linux-2.6.26-git6/fs/ext4/ext4.h
> ===================================================================
> --- linux-2.6.26-git6.orig/fs/ext4/ext4.h 2008-07-21 17:35:17.000000000 -0700
> +++ linux-2.6.26-git6/fs/ext4/ext4.h 2008-07-21 17:36:17.000000000 -0700
> @@ -1149,7 +1149,7 @@ extern void ext4_truncate (struct inode
> extern void ext4_set_inode_flags(struct inode *);
> extern void ext4_get_inode_flags(struct ext4_inode_info *);
> extern void ext4_set_aops(struct inode *inode);
> -extern int ext4_writepage_trans_blocks(struct inode *);
> +extern int ext4_writepages_trans_blocks(struct inode *, int num);
> extern int ext4_block_truncate_page(handle_t *handle,
> struct address_space *mapping, loff_t from);
> extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> @@ -1314,7 +1314,7 @@ extern const struct inode_operations ext
>
> /* extents.c */
> extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
> -extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
> +extern int ext4_ext_writepages_trans_blocks(struct inode *inode, int);
> extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> ext4_lblk_t iblock,
> unsigned long max_blocks, struct buffer_head *bh_result,
> Index: linux-2.6.26-git6/fs/ext4/extents.c
> ===================================================================
> --- linux-2.6.26-git6.orig/fs/ext4/extents.c 2008-07-21 17:35:17.000000000 -0700
> +++ linux-2.6.26-git6/fs/ext4/extents.c 2008-07-21 17:36:17.000000000 -0700
> @@ -1887,11 +1887,12 @@ static int ext4_ext_rm_idx(handle_t *han
>
> /*
> * ext4_ext_calc_credits_for_insert:
> - * This routine returns max. credits that the extent tree can consume.
> + * This routine returns max. credits that needed to insert an extent
> + * to the extent tree.
> * It should be OK for low-performance paths like ->writepage()
> * To allow many writing processes to fit into a single transaction,
> - * the caller should calculate credits under i_data_sem and
> - * pass the actual path.
> + * When pass the actual path, the caller should calculate credits
> + * under i_data_sem.
> */
> int ext4_ext_calc_credits_for_insert(struct inode *inode,
> struct ext4_ext_path *path)
> @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
> */
> needed += (depth * 2) + (depth * 2);
>
> - /* any allocation modifies superblock */
> - needed += 1;
> -
Why are we dropping the super block modification credit ? An insert of
an extent can result in super block modification also. If the goal is to
use ext4_writepages_trans_blocks everywhere then this change is correct.
But i see many place not using ext4_writepages_trans_blocks.
You also need to update the function comment saying that super block
update is not accounted.Also it doesn't account for block bitmap,
group descriptor and inode block update.
> return needed;
> }
>
> @@ -2940,8 +2938,8 @@ void ext4_ext_truncate(struct inode *ino
> /*
> * probably first extent we're gonna free will be last in block
> */
> - err = ext4_writepage_trans_blocks(inode) + 3;
> - handle = ext4_journal_start(inode, err);
> + handle = ext4_journal_start(inode,
> + ext4_writepages_trans_blocks(inode, 1) + 3);
What is +3 for ? Can you add a comment on that. I understand that it
was there before. I guess +3 is not needed.?
> if (IS_ERR(handle))
> return;
>
> @@ -2994,18 +2992,28 @@ out_stop:
> }
>
> /*
> - * ext4_ext_writepage_trans_blocks:
> + * ext4_ext_writepages_trans_blocks:
> * calculate max number of blocks we could modify
> * in order to allocate new block for an inode
> */
> -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
> +int ext4_ext_writepages_trans_blocks(struct inode *inode, int num)
> {
I guess the name is misleading. @num above is number of blocks. how
about ext4_ext_writeblocks_trans(struct inode *node, int nrblocks)
Also add a comment stating we consider the worst case where each block
can result in an extent.
> int needed;
>
> + /* cost of adding a single extent:
> + * index blocks, leafs, bitmaps,
> + * groupdescp
> + */
> needed = ext4_ext_calc_credits_for_insert(inode, NULL);
>
> - /* caller wants to allocate num blocks, but note it includes sb */
> - needed = needed * num - (num - 1);
> + /*
> + * For data=journalled mode need to account for the data blocks
> + * Also need to add super block and inode block
> + */
> + if (ext4_should_journal_data(inode))
> + needed = num * (needed + 1) + 2;
> + else
> + needed = num * needed + 2;
>
We are not accounting here for the bitmap and group descriptor.
ext4_ext_calc_credits_for_insert is not accounting for the credit need
to update bitmap and group descriptor. We also need to account updating
bitmap and group descriptor for new blocks that would be allocated
as a part of extent insert.
> #ifdef CONFIG_QUOTA
> needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> @@ -3074,10 +3082,9 @@ long ext4_fallocate(struct inode *inode,
> max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> - block;
> /*
> - * credits to insert 1 extent into extent tree + buffers to be able to
> - * modify 1 super block, 1 block bitmap and 1 group descriptor.
> + * credits to insert 1 extent into extent tree
> */
> - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
Why is this change ? I guess it is much better as below
credits = ext4_ext_calc_credits_for_insert(inode, NULL) + 4 +
2 * EXT4_SINGLEDATA_TRANS_BLOCKS ;
+4 for 1 super blocks, 1 block bitmap, 1 group descriptor , 1 inode
or how about
credits = ext4_writeblock_trans(inode, 1);
> mutex_lock(&inode->i_mutex);
> retry:
> while (ret >= 0 && ret < max_blocks) {
> Index: linux-2.6.26-git6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.26-git6.orig/fs/ext4/inode.c 2008-07-21 17:35:17.000000000 -0700
> +++ linux-2.6.26-git6/fs/ext4/inode.c 2008-07-21 17:44:45.000000000 -0700
> @@ -1015,15 +1015,6 @@ static void ext4_da_update_reserve_space
>
> /* Maximum number of blocks we map for direct IO at once. */
> #define DIO_MAX_BLOCKS 4096
> -/*
> - * Number of credits we need for writing DIO_MAX_BLOCKS:
> - * We need sb + group descriptor + bitmap + inode -> 4
> - * For B blocks with A block pointers per block we need:
> - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> - */
> -#define DIO_CREDITS 25
> -
>
> /*
> *
> @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
> handle_t *handle = ext4_journal_current_handle();
> int ret = 0, started = 0;
> unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
Again this should be
int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS);
where ext4_writeblocks_trans should be
int ext4_writeblocks_trans(struct inode *inode, int nrblocks)
{
if (!EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
return ext4_writeblocks_trans_old(inode, nrblocks);
return ext4_ext_writeblocks_trans(inode, nrblocks);
}
>
> if (create && !handle) {
> /* Direct IO write... */
> if (max_blocks > DIO_MAX_BLOCKS)
> max_blocks = DIO_MAX_BLOCKS;
> - handle = ext4_journal_start(inode, DIO_CREDITS +
> - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> + handle = ext4_journal_start(inode, dio_credits);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> goto out;
> @@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file
> struct page **pagep, void **fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> + int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
> handle_t *handle;
> int retries = 0;
> struct page *page;
> @@ -2179,18 +2170,7 @@ static int ext4_da_writepage(struct page
> return ret;
> }
>
> -/*
> - * For now just follow the DIO way to estimate the max credits
> - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> - * todo: need to calculate the max credits need for
> - * extent based files, currently the DIO credits is based on
> - * indirect-blocks mapping way.
> - *
> - * Probably should have a generic way to calculate credits
> - * for DIO, writepages, and truncate
> - */
> #define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
>
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> @@ -2210,13 +2190,8 @@ static int ext4_da_writepages(struct add
> if (!mapping->nrpages)
> return 0;
>
> - /*
> - * Estimate the worse case needed credits to write out
> - * EXT4_MAX_BUF_BLOCKS pages
> - */
> - needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> -
> to_write = wbc->nr_to_write;
> +
> if (!wbc->range_cyclic) {
> /*
> * If range_cyclic is not set force range_cont
> @@ -2227,6 +2202,20 @@ static int ext4_da_writepages(struct add
> }
>
> while (!ret && to_write) {
> + /*
> + * set the max dirty pages could be write at a time
> + * to fit into the reserved transaction credits
> + */
> + if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> + wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> +
> + /*
> + * Estimate the worse case needed credits to write out
> + * to_write pages
> + */
> + needed_blocks = ext4_writepages_trans_blocks(inode,
> + wbc->nr_to_write);
> +
> /* start a new transaction*/
> handle = ext4_journal_start(inode, needed_blocks);
> if (IS_ERR(handle)) {
> @@ -2246,12 +2235,6 @@ static int ext4_da_writepages(struct add
> }
>
> }
> - /*
> - * set the max dirty pages could be write at a time
> - * to fit into the reserved transaction credits
> - */
> - if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> - wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
>
> to_write -= wbc->nr_to_write;
> ret = mpage_da_writepages(mapping, wbc,
> @@ -2612,7 +2595,8 @@ static int __ext4_journalled_writepage(s
> * references to buffers so we are safe */
> unlock_page(page);
>
> - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> + handle = ext4_journal_start(inode,
> + ext4_writepages_trans_blocks(inode, 1));
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> goto out;
> @@ -4286,20 +4270,20 @@ int ext4_getattr(struct vfsmount *mnt, s
> /*
> * How many blocks doth make a writepage()?
> *
> - * With N blocks per page, it may be:
> - * N data blocks
> + * With N blocks per page, and P pages, it may be:
> + * N*P data blocks
> * 2 indirect block
> * 2 dindirect
> * 1 tindirect
> - * N+5 bitmap blocks (from the above)
> - * N+5 group descriptor summary blocks
> + * N*P+5 bitmap blocks (from the above)
> + * N*P+5 group descriptor summary blocks
> * 1 inode block
> * 1 superblock.
> * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
> *
> - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> *
> - * With ordered or writeback data it's the same, less the N data blocks.
> + * With ordered or writeback data it's the same, less the N*P data blocks.
> *
> * If the inode's direct blocks can hold an integral number of pages then a
> * page cannot straddle two indirect blocks, and we can only touch one indirect
> @@ -4310,19 +4294,15 @@ int ext4_getattr(struct vfsmount *mnt, s
> * block and work out the exact number of indirects which are touched. Pah.
> */
>
> -int ext4_writepage_trans_blocks(struct inode *inode)
> +static int ext4_writepages_trans_blocks_old(struct inode *inode, int num)
> {
a better name would be
static int ext4_writeblocks_trans_old(struct inode *inode, int nrblocks)
> - int bpp = ext4_journal_blocks_per_page(inode);
> - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> + int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3;
> int ret;
>
> - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> - return ext4_ext_writepage_trans_blocks(inode, bpp);
> -
> if (ext4_should_journal_data(inode))
> - ret = 3 * (bpp + indirects) + 2;
> + ret = 3 * (num + indirects) + 2;
> else
> - ret = 2 * (bpp + indirects) + 2;
> + ret = 2 * (num + indirects) + 2;
With non journalled moded we should just decrease num. I guess the above
should be
if (ext4_should_journal_data(inode))
ret = 3 * (num + indirects) + 2;
else
ret = 3 * (num + indirects) + 2 - num;
>
> #ifdef CONFIG_QUOTA
> /* We know that structure was already allocated during DQUOT_INIT so
> @@ -4334,6 +4314,19 @@ int ext4_writepage_trans_blocks(struct i
> }
>
> /*
> + * Calulate the total number of credits to reserve to fit
> + * the modification of @num pages into a single transaction
> + */
> +int ext4_writepages_trans_blocks(struct inode *inode, int num)
> +{
> + int bpp = ext4_journal_blocks_per_page(inode);
> + int nrblocks = num * bpp;
> +
> + if (!EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> + return ext4_writepages_trans_blocks_old(inode, nrblocks);
> + return ext4_ext_writepages_trans_blocks(inode, nrblocks);
> +}
can be
int ext4_writepages_trans_blocks(struct inode *inode, int nrpages)
{
int blocks_per_page = 1 << (PAGE_CACHE_SHIFT - inode->i_blkbits);
return ext4_writeblocks_trans(inode, nrpages * blocks_per_page);
}
I am not sure why we have journal call back
(ext4_journal_blocks_per_page) for finding blocks per page ?
> +/*
> * The caller must have previously called ext4_reserve_inode_write().
> * Give this, we know that the caller already has write access to iloc->bh.
> */
> Index: linux-2.6.26-git6/fs/ext4/defrag.c
> ===================================================================
> --- linux-2.6.26-git6.orig/fs/ext4/defrag.c 2008-07-21 17:43:27.000000000 -0700
> +++ linux-2.6.26-git6/fs/ext4/defrag.c 2008-07-22 17:44:08.000000000 -0700
> @@ -1385,7 +1385,7 @@ ext4_defrag_alloc_blocks(handle_t *handl
> struct buffer_head *bh = NULL;
> int err, i, credits = 0;
>
> - credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path);
> + credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path) + 4;
> err = ext4_ext_journal_restart(handle,
> credits + EXT4_TRANS_META_BLOCKS);
> if (err)
> @@ -1494,7 +1494,7 @@ ext4_defrag_partial(struct inode *tmp_in
> * It needs twice the amount of ordinary journal buffers because
> * inode and tmp_inode may change each different metadata blocks.
> */
> - jblocks = ext4_writepage_trans_blocks(org_inode) * 2;
> + jblocks = ext4_writepages_trans_blocks(org_inode, 1) * 2;
> handle = ext4_journal_start(org_inode, jblocks);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> Index: linux-2.6.26-git6/fs/ext4/migrate.c
> ===================================================================
> --- linux-2.6.26-git6.orig/fs/ext4/migrate.c 2008-07-22 17:41:59.000000000 -0700
> +++ linux-2.6.26-git6/fs/ext4/migrate.c 2008-07-22 17:43:43.000000000 -0700
> @@ -52,8 +52,10 @@ static int finish_range(handle_t *handle
> * Since we are doing this in loop we may accumalate extra
> * credit. But below we try to not accumalate too much
> * of them by restarting the journal.
> + *
> + * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
> */
> - needed = ext4_ext_calc_credits_for_insert(inode, path);
> + needed = ext4_ext_calc_credits_for_insert(inode, path) + 4;
>
> /*
> * Make sure the credit we accumalated is not really high
>
>
On Tue, Jul 22, 2008 at 07:18:02PM -0600, Andreas Dilger wrote:
> On Jul 22, 2008 17:51 -0700, Mingming Cao wrote:
> > + * Calulate the total number of credits to reserve to fit
> > + * the modification of @num pages into a single transaction
> > + */
> > +int ext4_writepages_trans_blocks(struct inode *inode, int num)
> > +{
> > + int bpp = ext4_journal_blocks_per_page(inode);
> > + int nrblocks = num * bpp;
> > +
> > + if (!EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > + return ext4_writepages_trans_blocks_old(inode, nrblocks);
>
> This should be "if (!(EXT4_I(inode)->i_flags & EXT_EXTENTS_FL))", and
> we should probably make it "unlikely()" since we expect most new files
> in an ext4 filesystem are extent mapped.
The cost of unlikely() can be pretty bad; the rule of thumb I've heard
is that unless it's less than 1% vs. 99%, you should probably avoid
using likely() and unlikely(). Given that there will be a fair number
of people who will be doing upgrades of existing ext3 filesystems, I
don't think using unlikely() would be a good choice here.
- Ted
On Jul 23, 2008 13:12 +0530, Aneesh Kumar wrote:
> On Tue, Jul 22, 2008 at 05:51:51PM -0700, Mingming Cao wrote:
> > @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
> > */
> > needed += (depth * 2) + (depth * 2);
> >
> > - /* any allocation modifies superblock */
> > - needed += 1;
>
> Why are we dropping the super block modification credit ? An insert of
> an extent can result in super block modification also. If the goal is to
> use ext4_writepages_trans_blocks everywhere then this change is correct.
> But i see many place not using ext4_writepages_trans_blocks.
We used to need to update the superblock on every allocation operation
(block or inode). That was removed a long time ago, and we rarely
update the superblock these days.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
在 2008-07-23三的 13:12 +0530,Aneesh Kumar K.V写道:
> Hi Mingming,
>
> Comments below
> On Tue, Jul 22, 2008 at 05:51:51PM -0700, Mingming Cao wrote:
> > Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
> >
> > From: Mingming Cao <[email protected]>
> >
> > With delalloc, at writepages() time, we need to reserve enough credits to start
> > a new handle, to allow possible multiple segment of block allocations under a
> > single call mapge_da_writepages(), so that metadata updates could fit into a single
> > transaction. This patch fixed this by calculating the needed credits for
> > write-out given number of dirty pages, with the consideration of discontinues
> > block allocations. It fixed both extent files and non extent files.
> >
> > This patch also fixed the journal credit reservation for DIO. Currently the
> > estimated credits for DIO is only based on non extent format file. That credit
> > is not enough for mballoc a single extent on extent based file. This patch
> > fixed that.
> >
> > The fallocate double booking credits for modifying super block etc, this patch
> > fixed that.
> >
> > This also fix credit reservation in migration and defrag code.
> >
> > Signed-off-by: Mingming Cao <[email protected]>
> > ---
> > fs/ext4/defrag.c | 4 +-
> > fs/ext4/ext4.h | 4 +-
> > fs/ext4/extents.c | 37 ++++++++++++---------
> > fs/ext4/inode.c | 93 ++++++++++++++++++++++++------------------------------
> > fs/ext4/migrate.c | 4 +-
> > 5 files changed, 72 insertions(+), 70 deletions(-)
> >
> > Index: linux-2.6.26-git6/fs/ext4/ext4.h
> > ===================================================================
> > --- linux-2.6.26-git6.orig/fs/ext4/ext4.h 2008-07-21 17:35:17.000000000 -0700
> > +++ linux-2.6.26-git6/fs/ext4/ext4.h 2008-07-21 17:36:17.000000000 -0700
> > @@ -1149,7 +1149,7 @@ extern void ext4_truncate (struct inode
> > extern void ext4_set_inode_flags(struct inode *);
> > extern void ext4_get_inode_flags(struct ext4_inode_info *);
> > extern void ext4_set_aops(struct inode *inode);
> > -extern int ext4_writepage_trans_blocks(struct inode *);
> > +extern int ext4_writepages_trans_blocks(struct inode *, int num);
> > extern int ext4_block_truncate_page(handle_t *handle,
> > struct address_space *mapping, loff_t from);
> > extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> > @@ -1314,7 +1314,7 @@ extern const struct inode_operations ext
> >
> > /* extents.c */
> > extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
> > -extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
> > +extern int ext4_ext_writepages_trans_blocks(struct inode *inode, int);
> > extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> > ext4_lblk_t iblock,
> > unsigned long max_blocks, struct buffer_head *bh_result,
> > Index: linux-2.6.26-git6/fs/ext4/extents.c
> > ===================================================================
> > --- linux-2.6.26-git6.orig/fs/ext4/extents.c 2008-07-21 17:35:17.000000000 -0700
> > +++ linux-2.6.26-git6/fs/ext4/extents.c 2008-07-21 17:36:17.000000000 -0700
> > @@ -1887,11 +1887,12 @@ static int ext4_ext_rm_idx(handle_t *han
> >
> > /*
> > * ext4_ext_calc_credits_for_insert:
> > - * This routine returns max. credits that the extent tree can consume.
> > + * This routine returns max. credits that needed to insert an extent
> > + * to the extent tree.
> > * It should be OK for low-performance paths like ->writepage()
> > * To allow many writing processes to fit into a single transaction,
> > - * the caller should calculate credits under i_data_sem and
> > - * pass the actual path.
> > + * When pass the actual path, the caller should calculate credits
> > + * under i_data_sem.
> > */
> > int ext4_ext_calc_credits_for_insert(struct inode *inode,
> > struct ext4_ext_path *path)
> > @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
> > */
> > needed += (depth * 2) + (depth * 2);
> >
> > - /* any allocation modifies superblock */
> > - needed += 1;
> > -
>
>
> Why are we dropping the super block modification credit ? An insert of
> an extent can result in super block modification also. If the goal is to
> use ext4_writepages_trans_blocks everywhere then this change is correct.
> But i see many place not using ext4_writepages_trans_blocks.
>
ext4_writepages_trans_blocks() will take care of the credits need for
updating superblock, for just once.ext4_ext_calc_credits_for_insert()
calculate the credits needed for inserting a single extent, You could
see in ext4_writepages_trans_blocks(), it will multiple it will the
total numberof blocks. If we account for superblock credit in
ext4_ext_calc_credits_for_insert(), then super block updates for
multiple extents allocation will be overaccounted, and have to remove
that later in ext4_writepages_trans_blocks()
Other places calling ext4_ext_calc_credits_for_insert() (mostly
migrate.c and defrag,c) are updated to add extra 4 credits, including
superblock, inode block and quota blocks.
> You also need to update the function comment saying that super block
> update is not accounted.Also it doesn't account for block bitmap,
> group descriptor and inode block update.
>
Credits for block bitmap and group descriptors are accounted in
ext4_ext_calc_credits_for_insert()
inode block update is only accounted once per writepages, so it's
accounted in
ext4_writepages_trans_blocks()/ext4_ext_writepages_trans_blocks()
> > return needed;
> > }
> >
> > @@ -2940,8 +2938,8 @@ void ext4_ext_truncate(struct inode *ino
> > /*
> > * probably first extent we're gonna free will be last in block
> > */
> > - err = ext4_writepage_trans_blocks(inode) + 3;
> > - handle = ext4_journal_start(inode, err);
> > + handle = ext4_journal_start(inode,
> > + ext4_writepages_trans_blocks(inode, 1) + 3);
>
>
> What is +3 for ? Can you add a comment on that. I understand that it
> was there before. I guess +3 is not needed.?
>
I guess it was there for superblock +inode block + quota block, but
actually superblock and inode block and quota blocks are already
accounted, it probably could be removed.
I did not pay a lot attention to it, I guess a little overbooking
probably safer, I could remove it if you feel strong about it.
>
>
> > if (IS_ERR(handle))
> > return;
> >
> > @@ -2994,18 +2992,28 @@ out_stop:
> > }
> >
> > /*
> > - * ext4_ext_writepage_trans_blocks:
> > + * ext4_ext_writepages_trans_blocks:
> > * calculate max number of blocks we could modify
> > * in order to allocate new block for an inode
> > */
> > -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
> > +int ext4_ext_writepages_trans_blocks(struct inode *inode, int num)
> > {
>
> I guess the name is misleading. @num above is number of blocks. how
> about ext4_ext_writeblocks_trans(struct inode *node, int nrblocks)
>
>
> Also add a comment stating we consider the worst case where each block
> can result in an extent.
>
I will add a comment about the worse case, and change num to nrblocks.
> > int needed;
> >
> > + /* cost of adding a single extent:
> > + * index blocks, leafs, bitmaps,
> > + * groupdescp
> > + */
> > needed = ext4_ext_calc_credits_for_insert(inode, NULL);
> >
> > - /* caller wants to allocate num blocks, but note it includes sb */
> > - needed = needed * num - (num - 1);
> > + /*
> > + * For data=journalled mode need to account for the data blocks
> > + * Also need to add super block and inode block
> > + */
> > + if (ext4_should_journal_data(inode))
> > + needed = num * (needed + 1) + 2;
> > + else
> > + needed = num * needed + 2;
> >
>
>
> We are not accounting here for the bitmap and group descriptor.
> ext4_ext_calc_credits_for_insert is not accounting for the credit need
> to update bitmap and group descriptor.
> We also need to account updating
> bitmap and group descriptor for new blocks that would be allocated
> as a part of extent insert.
>
>
No need for that. It's being accounted in
ext4_ext_calc_credits_for_insert(). In the worse case the tree depth is
5, inserting a extent would require a 2 updates (bitmap and group
descriptor) for each level (including the leaf, the new blocks), for old
tree and new tree.
/*
* Index split can happen, we would need:
* allocate intermediate indexes (bitmap + group)
* + change two blocks at each level, but root (already
included)
*/
needed += (depth * 2) + (depth * 2);
> > #ifdef CONFIG_QUOTA
> > needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > @@ -3074,10 +3082,9 @@ long ext4_fallocate(struct inode *inode,
> > max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > - block;
> > /*
> > - * credits to insert 1 extent into extent tree + buffers to be able to
> > - * modify 1 super block, 1 block bitmap and 1 group descriptor.
> > + * credits to insert 1 extent into extent tree
> > */
> > - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> > + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
>
>
>
> Why is this change ?
modify for block bitmap and group descriptor is included in
EXT4_DATA_TRANS_BLOCKS(inode->i_sb)-> (EXT4_SINGLEDATA_TRANS_BLOCKS(sb)
/* Define the number of blocks we need to account to a transaction to
* modify one block of data.
*
* We may have to touch one inode, one bitmap buffer, up to three
* indirection blocks, the group and superblock summaries, and the data
* block to complete the transaction.
*
* For extents-enabled fs we may have to allocate and modify up to
* 5 levels of tree + root which are stored in the inode. */
#define EXT4_SINGLEDATA_TRANS_BLOCKS(sb)
\
(EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)
\
|| test_opt(sb, EXTENTS) ? 27U : 8U)
others including superblock, inode block, quota and xattr blocks are
also included in
/* Define the minimum size for a transaction which modifies data. This
* needs to take into account the fact that we may end up modifying two
* quota files too (one for the group, one for the user quota). The
* superblock only gets updated once, of course, so don't bother
* counting that again for the quota updates. */
#define EXT4_DATA_TRANS_BLOCKS(sb)
(EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + \
EXT4_XATTR_TRANS_BLOCKS - 2 + \
2*EXT4_QUOTA_TRANS_BLOCKS(sb))
>
> > mutex_lock(&inode->i_mutex);
> > retry:
> > while (ret >= 0 && ret < max_blocks) {
> > Index: linux-2.6.26-git6/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.26-git6.orig/fs/ext4/inode.c 2008-07-21 17:35:17.000000000 -0700
> > +++ linux-2.6.26-git6/fs/ext4/inode.c 2008-07-21 17:44:45.000000000 -0700
> > @@ -1015,15 +1015,6 @@ static void ext4_da_update_reserve_space
> >
> > /* Maximum number of blocks we map for direct IO at once. */
> > #define DIO_MAX_BLOCKS 4096
> > -/*
> > - * Number of credits we need for writing DIO_MAX_BLOCKS:
> > - * We need sb + group descriptor + bitmap + inode -> 4
> > - * For B blocks with A block pointers per block we need:
> > - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> > - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> > - */
> > -#define DIO_CREDITS 25
> > -
> >
> > /*
> > *
> > @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
> > handle_t *handle = ext4_journal_current_handle();
> > int ret = 0, started = 0;
> > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
>
> Again this should be
> int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS);
>
No. DIO case is different than writepages(). When get_block() is called
from DIO path, the get_block() is called in a loop, and the credit is
reserved inside the loop. Each time get_block(),will return a single
extent, so we should use EXT4_DATA_TRANS_BLOCKS(inode->i_sb) which is
calculate a single chunk of allocation credits.
ext4_da_writepages() is different, we have to reserve the credits
outside the loop of calling get_block(), since the underlying
get_block() could be called multiple times, we need to worse case, so
ext4_writepages_trans_blocks() is called to handling the worse case.
> >
> > if (create && !handle) {
> > /* Direct IO write... */
> > if (max_blocks > DIO_MAX_BLOCKS)
> > max_blocks = DIO_MAX_BLOCKS;
> > - handle = ext4_journal_start(inode, DIO_CREDITS +
> > - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> > + handle = ext4_journal_start(inode, dio_credits);
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > goto out;
> > @@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file
> > struct page **pagep, void **fsdata)
> > {
> > struct inode *inode = mapping->host;
> > - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > + int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
> > handle_t *handle;
> > int retries = 0;
> > struct page *page;
> > @@ -2179,18 +2170,7 @@ static int ext4_da_writepage(struct page
> > return ret;
> > }
> >
> > -/*
> > - * For now just follow the DIO way to estimate the max credits
> > - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> > - * todo: need to calculate the max credits need for
> > - * extent based files, currently the DIO credits is based on
> > - * indirect-blocks mapping way.
> > - *
> > - * Probably should have a generic way to calculate credits
> > - * for DIO, writepages, and truncate
> > - */
> > #define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> > -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> >
> > static int ext4_da_writepages(struct address_space *mapping,
> > struct writeback_control *wbc)
> > @@ -2210,13 +2190,8 @@ static int ext4_da_writepages(struct add
> > if (!mapping->nrpages)
> > return 0;
> >
> > - /*
> > - * Estimate the worse case needed credits to write out
> > - * EXT4_MAX_BUF_BLOCKS pages
> > - */
> > - needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> > -
> > to_write = wbc->nr_to_write;
> > +
> > if (!wbc->range_cyclic) {
> > /*
> > * If range_cyclic is not set force range_cont
> > @@ -2227,6 +2202,20 @@ static int ext4_da_writepages(struct add
> > }
> >
> > while (!ret && to_write) {
> > + /*
> > + * set the max dirty pages could be write at a time
> > + * to fit into the reserved transaction credits
> > + */
> > + if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> > + wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> > +
> > + /*
> > + * Estimate the worse case needed credits to write out
> > + * to_write pages
> > + */
> > + needed_blocks = ext4_writepages_trans_blocks(inode,
> > + wbc->nr_to_write);
> > +
> > /* start a new transaction*/
> > handle = ext4_journal_start(inode, needed_blocks);
> > if (IS_ERR(handle)) {
> > @@ -2246,12 +2235,6 @@ static int ext4_da_writepages(struct add
> > }
> >
> > }
> > - /*
> > - * set the max dirty pages could be write at a time
> > - * to fit into the reserved transaction credits
> > - */
> > - if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> > - wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> >
> > to_write -= wbc->nr_to_write;
> > ret = mpage_da_writepages(mapping, wbc,
> > @@ -2612,7 +2595,8 @@ static int __ext4_journalled_writepage(s
> > * references to buffers so we are safe */
> > unlock_page(page);
> >
> > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > + handle = ext4_journal_start(inode,
> > + ext4_writepages_trans_blocks(inode, 1));
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > goto out;
> > @@ -4286,20 +4270,20 @@ int ext4_getattr(struct vfsmount *mnt, s
> > /*
> > * How many blocks doth make a writepage()?
> > *
> > - * With N blocks per page, it may be:
> > - * N data blocks
> > + * With N blocks per page, and P pages, it may be:
> > + * N*P data blocks
> > * 2 indirect block
> > * 2 dindirect
> > * 1 tindirect
> > - * N+5 bitmap blocks (from the above)
> > - * N+5 group descriptor summary blocks
> > + * N*P+5 bitmap blocks (from the above)
> > + * N*P+5 group descriptor summary blocks
> > * 1 inode block
> > * 1 superblock.
> > * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
> > *
> > - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> > + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> > *
> > - * With ordered or writeback data it's the same, less the N data blocks.
> > + * With ordered or writeback data it's the same, less the N*P data blocks.
> > *
> > * If the inode's direct blocks can hold an integral number of pages then a
> > * page cannot straddle two indirect blocks, and we can only touch one indirect
> > @@ -4310,19 +4294,15 @@ int ext4_getattr(struct vfsmount *mnt, s
> > * block and work out the exact number of indirects which are touched. Pah.
> > */
> >
> > -int ext4_writepage_trans_blocks(struct inode *inode)
> > +static int ext4_writepages_trans_blocks_old(struct inode *inode, int num)
> > {
>
> a better name would be
>
> static int ext4_writeblocks_trans_old(struct inode *inode, int nrblocks)
>
>
>
> > - int bpp = ext4_journal_blocks_per_page(inode);
> > - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> > + int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3;
> > int ret;
> >
> > - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > - return ext4_ext_writepage_trans_blocks(inode, bpp);
> > -
> > if (ext4_should_journal_data(inode))
> > - ret = 3 * (bpp + indirects) + 2;
> > + ret = 3 * (num + indirects) + 2;
> > else
> > - ret = 2 * (bpp + indirects) + 2;
> > + ret = 2 * (num + indirects) + 2;
>
>
> With non journalled moded we should just decrease num. I guess the above
> should be
> if (ext4_should_journal_data(inode))
> ret = 3 * (num + indirects) + 2;
> else
> ret = 3 * (num + indirects) + 2 - num;
>
>
>
Well, I think in the journalled mode we need to journal the content of
the indirect/double indirect blocks also, no?
Mingming
在 2008-07-22二的 19:18 -0600,Andreas Dilger写道:
> On Jul 22, 2008 17:51 -0700, Mingming Cao wrote:
> > + * Calulate the total number of credits to reserve to fit
> > + * the modification of @num pages into a single transaction
> > + */
> > +int ext4_writepages_trans_blocks(struct inode *inode, int num)
> > +{
> > + int bpp = ext4_journal_blocks_per_page(inode);
> > + int nrblocks = num * bpp;
> > +
> > + if (!EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > + return ext4_writepages_trans_blocks_old(inode, nrblocks);
>
> This should be "if (!(EXT4_I(inode)->i_flags & EXT_EXTENTS_FL))",
Thanks for catching this.
Mingming
Updated patch with Andrea and Aneesh's review comments.
I added this patch in patch queue, let me know if you have other concerns.
Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
From: Mingming Cao <[email protected]>
With delalloc, at writepages() time, we need to reserve enough credits to start
a new handle, to allow possible multiple segment of block allocations under a
single call mapge_da_writepages(), to fit metadata updates into the single
transaction. This patch fixed this by calculating the needed credits for
write-out given number of dirty pages, with the consideration of discontinues
block allocations. It fixed both extent files and non extent files.
This patch also fixed the journal credit reservation for DIO. Currently the
estimated credits for DIO is only based on non extent format file. That credit
is not enough for mballoc a single extent on extent based file. This patch
fixed that.
The fallocate double booking credits for modifying super block etc, this patch
fixed that.
This also fix credit reservation in migration and defrag code.
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/defrag.c | 5 +-
fs/ext4/ext4.h | 4 -
fs/ext4/ext4_extents.h | 3 -
fs/ext4/extents.c | 53 +++++++++++++++---------
fs/ext4/inode.c | 105 +++++++++++++++++++++++++------------------------
fs/ext4/migrate.c | 4 +
6 files changed, 99 insertions(+), 75 deletions(-)
Index: linux-2.6.26-git6/fs/ext4/ext4.h
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/ext4.h 2008-07-21 17:35:17.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/ext4.h 2008-07-25 17:32:21.000000000 -0700
@@ -1149,7 +1149,7 @@ extern void ext4_truncate (struct inode
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
extern void ext4_set_aops(struct inode *inode);
-extern int ext4_writepage_trans_blocks(struct inode *);
+extern int ext4_writepages_trans_blocks(struct inode *, int nrpages);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
@@ -1314,7 +1314,7 @@ extern const struct inode_operations ext
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
-extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext4_ext_writeblocks_trans_credits(struct inode *inode, int);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
Index: linux-2.6.26-git6/fs/ext4/extents.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/extents.c 2008-07-21 17:35:17.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/extents.c 2008-07-25 17:33:47.000000000 -0700
@@ -1886,14 +1886,20 @@ static int ext4_ext_rm_idx(handle_t *han
}
/*
- * ext4_ext_calc_credits_for_insert:
- * This routine returns max. credits that the extent tree can consume.
+ * ext4_ext_calc_credits_for_single_extent:
+ * This routine returns max. credits that needed to insert an extent
+ * to the extent tree.
* It should be OK for low-performance paths like ->writepage()
* To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under i_data_sem and
- * pass the actual path.
+ * When pass the actual path, the caller should calculate credits
+ * under i_data_sem.
+ *
+ * For inserting a single extent, in the worse case extent tree depth is 5
+ * for old tree and new tree, for every level we need to reserve
+ * credits to log the bitmap and block group descriptors
+ *
*/
-int ext4_ext_calc_credits_for_insert(struct inode *inode,
+int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
struct ext4_ext_path *path)
{
int depth, needed;
@@ -1930,9 +1936,6 @@ int ext4_ext_calc_credits_for_insert(str
*/
needed += (depth * 2) + (depth * 2);
- /* any allocation modifies superblock */
- needed += 1;
-
return needed;
}
@@ -2940,8 +2943,8 @@ void ext4_ext_truncate(struct inode *ino
/*
* probably first extent we're gonna free will be last in block
*/
- err = ext4_writepage_trans_blocks(inode) + 3;
- handle = ext4_journal_start(inode, err);
+ handle = ext4_journal_start(inode,
+ ext4_writepages_trans_blocks(inode, 1) + 3);
if (IS_ERR(handle))
return;
@@ -2994,18 +2997,31 @@ out_stop:
}
/*
- * ext4_ext_writepage_trans_blocks:
+ * ext4_ext_writeblocks_trans_credits:
* calculate max number of blocks we could modify
- * in order to allocate new block for an inode
+ * in order to allocate the required number of new blocks
+ *
+ * In the worse case, one block per extent.
+ *
*/
-int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
+int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
{
int needed;
- needed = ext4_ext_calc_credits_for_insert(inode, NULL);
+ /* cost of adding a single extent:
+ * index blocks, leafs, bitmaps,
+ * groupdescp
+ */
+ needed = ext4_ext_calc_credits_for_single_extent(inode, NULL);
- /* caller wants to allocate num blocks, but note it includes sb */
- needed = needed * num - (num - 1);
+ /*
+ * For data=journalled mode need to account for the data blocks
+ * Also need to add super block and inode block
+ */
+ if (ext4_should_journal_data(inode))
+ needed = nrblocks * (needed + 1) + 2;
+ else
+ needed = nrblocks * needed + 2;
#ifdef CONFIG_QUOTA
needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
@@ -3074,10 +3090,9 @@ long ext4_fallocate(struct inode *inode,
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- block;
/*
- * credits to insert 1 extent into extent tree + buffers to be able to
- * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ * credits to insert 1 extent into extent tree
*/
- credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+ credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
mutex_lock(&inode->i_mutex);
retry:
while (ret >= 0 && ret < max_blocks) {
Index: linux-2.6.26-git6/fs/ext4/inode.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/inode.c 2008-07-21 17:35:17.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/inode.c 2008-07-25 17:36:22.000000000 -0700
@@ -1015,15 +1015,6 @@ static void ext4_da_update_reserve_space
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Number of credits we need for writing DIO_MAX_BLOCKS:
- * We need sb + group descriptor + bitmap + inode -> 4
- * For B blocks with A block pointers per block we need:
- * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
- * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
- */
-#define DIO_CREDITS 25
-
/*
*
@@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
handle_t *handle = ext4_journal_current_handle();
int ret = 0, started = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+ int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
if (create && !handle) {
/* Direct IO write... */
if (max_blocks > DIO_MAX_BLOCKS)
max_blocks = DIO_MAX_BLOCKS;
- handle = ext4_journal_start(inode, DIO_CREDITS +
- 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
handle_t *handle;
int retries = 0;
struct page *page;
@@ -2179,18 +2170,7 @@ static int ext4_da_writepage(struct page
return ret;
}
-/*
- * For now just follow the DIO way to estimate the max credits
- * needed to write out EXT4_MAX_WRITEBACK_PAGES.
- * todo: need to calculate the max credits need for
- * extent based files, currently the DIO credits is based on
- * indirect-blocks mapping way.
- *
- * Probably should have a generic way to calculate credits
- * for DIO, writepages, and truncate
- */
#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -2210,13 +2190,8 @@ static int ext4_da_writepages(struct add
if (!mapping->nrpages)
return 0;
- /*
- * Estimate the worse case needed credits to write out
- * EXT4_MAX_BUF_BLOCKS pages
- */
- needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
-
to_write = wbc->nr_to_write;
+
if (!wbc->range_cyclic) {
/*
* If range_cyclic is not set force range_cont
@@ -2227,6 +2202,20 @@ static int ext4_da_writepages(struct add
}
while (!ret && to_write) {
+ /*
+ * set the max dirty pages could be write at a time
+ * to fit into the reserved transaction credits
+ */
+ if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
+ wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
+
+ /*
+ * Estimate the worse case needed credits to write out
+ * to_write pages
+ */
+ needed_blocks = ext4_writepages_trans_blocks(inode,
+ wbc->nr_to_write);
+
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -2246,12 +2235,6 @@ static int ext4_da_writepages(struct add
}
}
- /*
- * set the max dirty pages could be write at a time
- * to fit into the reserved transaction credits
- */
- if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
- wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
to_write -= wbc->nr_to_write;
ret = mpage_da_writepages(mapping, wbc,
@@ -2612,7 +2595,8 @@ static int __ext4_journalled_writepage(s
* references to buffers so we are safe */
unlock_page(page);
- handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+ handle = ext4_journal_start(inode,
+ ext4_writepages_trans_blocks(inode, 1));
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -4286,20 +4270,20 @@ int ext4_getattr(struct vfsmount *mnt, s
/*
* How many blocks doth make a writepage()?
*
- * With N blocks per page, it may be:
- * N data blocks
+ * With N blocks per page, and P pages, it may be:
+ * N*P data blocks
* 2 indirect block
* 2 dindirect
* 1 tindirect
- * N+5 bitmap blocks (from the above)
- * N+5 group descriptor summary blocks
+ * N*P+5 bitmap blocks (from the above)
+ * N*P+5 group descriptor summary blocks
* 1 inode block
* 1 superblock.
* 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
*
- * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
+ * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
*
- * With ordered or writeback data it's the same, less the N data blocks.
+ * With ordered or writeback data it's the same, less the N*P data blocks.
*
* If the inode's direct blocks can hold an integral number of pages then a
* page cannot straddle two indirect blocks, and we can only touch one indirect
@@ -4310,19 +4294,15 @@ int ext4_getattr(struct vfsmount *mnt, s
* block and work out the exact number of indirects which are touched. Pah.
*/
-int ext4_writepage_trans_blocks(struct inode *inode)
+static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks)
{
- int bpp = ext4_journal_blocks_per_page(inode);
- int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
+ int indirects = (EXT4_NDIR_BLOCKS % nrblocks) ? 5 : 3;
int ret;
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_writepage_trans_blocks(inode, bpp);
-
if (ext4_should_journal_data(inode))
- ret = 3 * (bpp + indirects) + 2;
+ ret = 3 * (nrblocks + indirects) + 2;
else
- ret = 2 * (bpp + indirects) + 2;
+ ret = 2 * (nrblocks + indirects) + 2;
#ifdef CONFIG_QUOTA
/* We know that structure was already allocated during DQUOT_INIT so
@@ -4334,6 +4314,31 @@ int ext4_writepage_trans_blocks(struct i
}
/*
+ * Calulate the total number of credits to reserve to fit
+ * the modification of @num pages into a single transaction
+ *
+ * This could be called via ext4_write_begin() or later
+ * ext4_da_writepages() in delalyed allocation case.
+ *
+ * In both case it's possible that we could allocating multiple
+ * chunks of blocks. We need to consider the worse case, when
+ * one new block per extent.
+ *
+ * For Direct IO and fallocate, the journal credits reservation
+ * is based on one single extent allocation, so they could use
+ * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
+ * chunk of allocation needs.
+ */
+int ext4_writepages_trans_blocks(struct inode *inode, int nrpages)
+{
+ int bpp = ext4_journal_blocks_per_page(inode);
+ int nrblocks = nrpages * bpp;
+
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return ext4_writeblocks_trans_credits_old(inode, nrblocks);
+ return ext4_ext_writeblocks_trans_credits(inode, nrblocks);
+}
+/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
*/
Index: linux-2.6.26-git6/fs/ext4/defrag.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/defrag.c 2008-07-21 17:43:27.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/defrag.c 2008-07-25 17:27:50.000000000 -0700
@@ -1385,7 +1385,8 @@ ext4_defrag_alloc_blocks(handle_t *handl
struct buffer_head *bh = NULL;
int err, i, credits = 0;
- credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path);
+ credits = ext4_ext_calc_credits_for_single_extent(dest_inode, dest_path)
+ + 4;
err = ext4_ext_journal_restart(handle,
credits + EXT4_TRANS_META_BLOCKS);
if (err)
@@ -1494,7 +1495,7 @@ ext4_defrag_partial(struct inode *tmp_in
* It needs twice the amount of ordinary journal buffers because
* inode and tmp_inode may change each different metadata blocks.
*/
- jblocks = ext4_writepage_trans_blocks(org_inode) * 2;
+ jblocks = ext4_writepages_trans_blocks(org_inode, 1) * 2;
handle = ext4_journal_start(org_inode, jblocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
Index: linux-2.6.26-git6/fs/ext4/migrate.c
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/migrate.c 2008-07-22 17:41:59.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/migrate.c 2008-07-25 17:26:56.000000000 -0700
@@ -52,8 +52,10 @@ static int finish_range(handle_t *handle
* Since we are doing this in loop we may accumalate extra
* credit. But below we try to not accumalate too much
* of them by restarting the journal.
+ *
+ * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
*/
- needed = ext4_ext_calc_credits_for_insert(inode, path);
+ needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 4;
/*
* Make sure the credit we accumalated is not really high
Index: linux-2.6.26-git6/fs/ext4/ext4_extents.h
===================================================================
--- linux-2.6.26-git6.orig/fs/ext4/ext4_extents.h 2008-07-25 17:28:14.000000000 -0700
+++ linux-2.6.26-git6/fs/ext4/ext4_extents.h 2008-07-25 17:34:06.000000000 -0700
@@ -229,7 +229,8 @@ extern int ext4_ext_calc_metadata_amount
extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
extern int ext4_extent_tree_init(handle_t *, struct inode *);
-extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
+ struct ext4_ext_path *path);
extern int ext4_ext_try_to_merge(struct inode *inode,
struct ext4_ext_path *path,
struct ext4_extent *);
On Fri, Jul 25, 2008 at 12:26:42PM -0700, Mingming Cao wrote:
>
......
> > > */
> > > int ext4_ext_calc_credits_for_insert(struct inode *inode,
> > > struct ext4_ext_path *path)
> > > @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
> > > */
> > > needed += (depth * 2) + (depth * 2);
> > >
> > > - /* any allocation modifies superblock */
> > > - needed += 1;
> > > -
> >
> >
> > Why are we dropping the super block modification credit ? An insert of
> > an extent can result in super block modification also. If the goal is to
> > use ext4_writepages_trans_blocks everywhere then this change is correct.
> > But i see many place not using ext4_writepages_trans_blocks.
> >
> ext4_writepages_trans_blocks() will take care of the credits need for
> updating superblock, for just once.ext4_ext_calc_credits_for_insert()
> calculate the credits needed for inserting a single extent, You could
> see in ext4_writepages_trans_blocks(), it will multiple it will the
> total numberof blocks. If we account for superblock credit in
> ext4_ext_calc_credits_for_insert(), then super block updates for
> multiple extents allocation will be overaccounted, and have to remove
> that later in ext4_writepages_trans_blocks()
ext4_ext_calc_credit for insert was not doing it currently with
path != NULL. I attaching a patch which reflect some of the changes
I suggested.
>
> Other places calling ext4_ext_calc_credits_for_insert() (mostly
> migrate.c and defrag,c) are updated to add extra 4 credits, including
> superblock, inode block and quota blocks.
I guess it should not be +4 but should be +2 + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
>
> > You also need to update the function comment saying that super block
> > update is not accounted.Also it doesn't account for block bitmap,
> > group descriptor and inode block update.
> >
>
> Credits for block bitmap and group descriptors are accounted in
> ext4_ext_calc_credits_for_insert()
>
> inode block update is only accounted once per writepages, so it's
> accounted in
> ext4_writepages_trans_blocks()/ext4_ext_writepages_trans_blocks()
>
....
> > > *
> > > @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
> > > handle_t *handle = ext4_journal_current_handle();
> > > int ret = 0, started = 0;
> > > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > > + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> >
> > Again this should be
> > int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS);
> >
> No. DIO case is different than writepages(). When get_block() is called
> from DIO path, the get_block() is called in a loop, and the credit is
> reserved inside the loop. Each time get_block(),will return a single
> extent, so we should use EXT4_DATA_TRANS_BLOCKS(inode->i_sb) which is
> calculate a single chunk of allocation credits.
That is true only for extent format. With noextents we need something
like below
if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
dio_credits = ext4_writeblocks_trans_credits_old(inode, max_blocks);
else {
/*
* For extent format get_block return only one extent
*/
dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
}
>
> ext4_da_writepages() is different, we have to reserve the credits
> outside the loop of calling get_block(), since the underlying
> get_block() could be called multiple times, we need to worse case, so
> ext4_writepages_trans_blocks() is called to handling the worse case.
>
> > >
> > > if (create && !handle) {
> > > /* Direct IO write... */
> > > if (max_blocks > DIO_MAX_BLOCKS)
> > > max_blocks = DIO_MAX_BLOCKS;
> > > - handle = ext4_journal_start(inode, DIO_CREDITS +
> > > - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> > > + handle = ext4_journal_start(inode, dio_credits);
> > > if (IS_ERR(handle)) {
> > > ret = PTR_ERR(handle);
> > > goto out;
> > > @@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file
> > > struct page **pagep, void **fsdata)
> > > {
> > > struct inode *inode = mapping->host;
....
> >
> > > - int bpp = ext4_journal_blocks_per_page(inode);
> > > - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> > > + int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3;
> > > int ret;
> > >
> > > - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > > - return ext4_ext_writepage_trans_blocks(inode, bpp);
> > > -
> > > if (ext4_should_journal_data(inode))
> > > - ret = 3 * (bpp + indirects) + 2;
> > > + ret = 3 * (num + indirects) + 2;
> > > else
> > > - ret = 2 * (bpp + indirects) + 2;
> > > + ret = 2 * (num + indirects) + 2;
> >
> >
> > With non journalled moded we should just decrease num. I guess the above
> > should be
> > if (ext4_should_journal_data(inode))
> > ret = 3 * (num + indirects) + 2;
> > else
> > ret = 3 * (num + indirects) + 2 - num;
> >
> >
> >
>
> Well, I think in the journalled mode we need to journal the content of
> the indirect/double indirect blocks also, no?
>
With non journalled mode we still need to journal the indirect, dindirect
and tindirect block so this should be
if (ext4_should_journal_data(inode))
ret = 3 * (num + indirects) + 2;
else
ret = 3 * (num + indirects) + 2 - num;
Attaching the patch for easy review
diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c
index 7c819fd..46e9600 100644
--- a/fs/ext4/defrag.c
+++ b/fs/ext4/defrag.c
@@ -1385,8 +1385,9 @@ ext4_defrag_alloc_blocks(handle_t *handle, struct inode *org_inode,
struct buffer_head *bh = NULL;
int err, i, credits = 0;
- credits = ext4_ext_calc_credits_for_single_extent(dest_inode, dest_path)
- + 4;
+ credits = ext4_ext_calc_credits_for_single_extent(dest_inode,
+ dest_path) + 2 +
+ 2 * EXT4_QUOTA_TRANS_BLOCKS(dest_inode->i_sb);
err = ext4_ext_journal_restart(handle,
credits + EXT4_TRANS_META_BLOCKS);
if (err)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 77f4f94..969b1e6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1898,6 +1898,9 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
* for old tree and new tree, for every level we need to reserve
* credits to log the bitmap and block group descriptors
*
+ * This doesn't take into account credit needed for the update of
+ * super block + inode block + quota files
+ *
*/
int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
struct ext4_ext_path *path)
@@ -1909,7 +1912,8 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
depth = ext_depth(inode);
if (le16_to_cpu(path[depth].p_hdr->eh_entries)
< le16_to_cpu(path[depth].p_hdr->eh_max))
- return 1;
+ /* 1 group desc + 1 block bitmap for allocated blocks */
+ return 2;
}
/*
@@ -1919,7 +1923,7 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
*/
depth = 5;
- /* allocation of new data block(s) */
+ /* 1 group desc + 1 block bitmap for allocated blocks */
needed = 2;
/*
@@ -2059,9 +2063,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
correct_index = 1;
credits += (ext_depth(inode)) + 1;
}
-#ifdef CONFIG_QUOTA
credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
err = ext4_ext_journal_restart(handle, credits);
if (err)
@@ -3023,9 +3025,7 @@ int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
else
needed = nrblocks * needed + 2;
-#ifdef CONFIG_QUOTA
needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
return needed;
}
@@ -3092,7 +3092,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
/*
* credits to insert 1 extent into extent tree
*/
- credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+ credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
mutex_lock(&inode->i_mutex);
retry:
while (ret >= 0 && ret < max_blocks) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5a394c8..d2832a1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1126,18 +1126,29 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
up_write((&EXT4_I(inode)->i_data_sem));
return retval;
}
+static int ext4_writeblocks_trans_credits_old(struct inode *, int);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
handle_t *handle = ext4_journal_current_handle();
int ret = 0, started = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
- int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+ int dio_credits;
if (create && !handle) {
/* Direct IO write... */
if (max_blocks > DIO_MAX_BLOCKS)
max_blocks = DIO_MAX_BLOCKS;
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ dio_credits = ext4_writeblocks_trans_credits_old(inode,
+ max_blocks);
+ else {
+ /*
+ * For extent format get_block return only one extent
+ */
+ dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+ }
+
handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
@@ -4310,14 +4321,18 @@ static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks)
if (ext4_should_journal_data(inode))
ret = 3 * (nrblocks + indirects) + 2;
- else
- ret = 2 * (nrblocks + indirects) + 2;
+ else {
+ /*
+ * We still need to journal update for the
+ * indirect, dindirect, and tindirect blocks
+ * only data blocks are not journalled
+ */
+ ret = 3 * (nrblocks + indirects) + 2 - nrblocks;
+ }
-#ifdef CONFIG_QUOTA
/* We know that structure was already allocated during DQUOT_INIT so
* we will be updating only the data blocks + inodes */
- ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
+ ret += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
return ret;
}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 3456094..72488ab 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -55,7 +55,8 @@ static int finish_range(handle_t *handle, struct inode *inode,
*
* extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
*/
- needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 4;
+ needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 2 +
+ 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);;
/*
* Make sure the credit we accumalated is not really high
在 2008-07-28一的 21:41 +0530,Aneesh Kumar K.V写道:
> On Fri, Jul 25, 2008 at 12:26:42PM -0700, Mingming Cao wrote:
> >
>
> ......
>
> > > > */
> > > > int ext4_ext_calc_credits_for_insert(struct inode *inode,
> > > > struct ext4_ext_path *path)
> > > > @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str
> > > > */
> > > > needed += (depth * 2) + (depth * 2);
> > > >
> > > > - /* any allocation modifies superblock */
> > > > - needed += 1;
> > > > -
> > >
> > >
> > > Why are we dropping the super block modification credit ? An insert of
> > > an extent can result in super block modification also. If the goal is to
> > > use ext4_writepages_trans_blocks everywhere then this change is correct.
> > > But i see many place not using ext4_writepages_trans_blocks.
> > >
> > ext4_writepages_trans_blocks() will take care of the credits need for
> > updating superblock, for just once.ext4_ext_calc_credits_for_insert()
> > calculate the credits needed for inserting a single extent, You could
> > see in ext4_writepages_trans_blocks(), it will multiple it will the
> > total numberof blocks. If we account for superblock credit in
> > ext4_ext_calc_credits_for_insert(), then super block updates for
> > multiple extents allocation will be overaccounted, and have to remove
> > that later in ext4_writepages_trans_blocks()
>
>
>
> ext4_ext_calc_credit for insert was not doing it currently with
> path != NULL. I attaching a patch which reflect some of the changes
> I suggested.
>
>
Acked. this fixed another bug in existing code.
> >
> > Other places calling ext4_ext_calc_credits_for_insert() (mostly
> > migrate.c and defrag,c) are updated to add extra 4 credits, including
> > superblock, inode block and quota blocks.
>
>
> I guess it should not be +4 but should be +2 + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
>
Yes.
> >
> > > You also need to update the function comment saying that super block
> > > update is not accounted.Also it doesn't account for block bitmap,
> > > group descriptor and inode block update.
> > >
> >
> > Credits for block bitmap and group descriptors are accounted in
> > ext4_ext_calc_credits_for_insert()
> >
> > inode block update is only accounted once per writepages, so it's
> > accounted in
> > ext4_writepages_trans_blocks()/ext4_ext_writepages_trans_blocks()
> >
>
> ....
>
> > > > *
> > > > @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
> > > > handle_t *handle = ext4_journal_current_handle();
> > > > int ret = 0, started = 0;
> > > > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > > > + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> > >
> > > Again this should be
> > > int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS);
> > >
> > No. DIO case is different than writepages(). When get_block() is called
> > from DIO path, the get_block() is called in a loop, and the credit is
> > reserved inside the loop. Each time get_block(),will return a single
> > extent, so we should use EXT4_DATA_TRANS_BLOCKS(inode->i_sb) which is
> > calculate a single chunk of allocation credits.
>
>
> That is true only for extent format. With noextents we need something
> like below
>
Not really, even with non extent format block allocation, ext4_get_block() will only allocate/map a chunk of contiguous blocks (i.e. an extent), so it will not hit the worse case.
> if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> dio_credits = ext4_writeblocks_trans_credits_old(inode, max_blocks);
> else {
> /*
> * For extent format get_block return only one extent
> */
> dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> }
>
>
...
>
> > >
> > > > - int bpp = ext4_journal_blocks_per_page(inode);
> > > > - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> > > > + int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3;
> > > > int ret;
> > > >
> > > > - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > > > - return ext4_ext_writepage_trans_blocks(inode, bpp);
> > > > -
> > > > if (ext4_should_journal_data(inode))
> > > > - ret = 3 * (bpp + indirects) + 2;
> > > > + ret = 3 * (num + indirects) + 2;
> > > > else
> > > > - ret = 2 * (bpp + indirects) + 2;
> > > > + ret = 2 * (num + indirects) + 2;
> > >
> > >
> > > With non journalled moded we should just decrease num. I guess the above
> > > should be
> > > if (ext4_should_journal_data(inode))
> > > ret = 3 * (num + indirects) + 2;
> > > else
> > > ret = 3 * (num + indirects) + 2 - num;
> > >
> > >
> > >
> >
> > Well, I think in the journalled mode we need to journal the content of
> > the indirect/double indirect blocks also, no?
> >
>
> With non journalled mode we still need to journal the indirect, dindirect
> and tindirect block so this should be
>
yes, changes of indirect blocks are also logged in all journalling
mode.
Mingming
> Attaching the patch for easy review
>
>
>
> diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c
> index 7c819fd..46e9600 100644
> --- a/fs/ext4/defrag.c
> +++ b/fs/ext4/defrag.c
> @@ -1385,8 +1385,9 @@ ext4_defrag_alloc_blocks(handle_t *handle, struct inode *org_inode,
> struct buffer_head *bh = NULL;
> int err, i, credits = 0;
>
> - credits = ext4_ext_calc_credits_for_single_extent(dest_inode, dest_path)
> - + 4;
> + credits = ext4_ext_calc_credits_for_single_extent(dest_inode,
> + dest_path) + 2 +
> + 2 * EXT4_QUOTA_TRANS_BLOCKS(dest_inode->i_sb);
> err = ext4_ext_journal_restart(handle,
> credits + EXT4_TRANS_META_BLOCKS);
> if (err)
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 77f4f94..969b1e6 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1898,6 +1898,9 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
> * for old tree and new tree, for every level we need to reserve
> * credits to log the bitmap and block group descriptors
> *
> + * This doesn't take into account credit needed for the update of
> + * super block + inode block + quota files
> + *
> */
> int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> struct ext4_ext_path *path)
> @@ -1909,7 +1912,8 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> depth = ext_depth(inode);
> if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> < le16_to_cpu(path[depth].p_hdr->eh_max))
> - return 1;
> + /* 1 group desc + 1 block bitmap for allocated blocks */
> + return 2;
> }
>
> /*
> @@ -1919,7 +1923,7 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> */
> depth = 5;
>
> - /* allocation of new data block(s) */
> + /* 1 group desc + 1 block bitmap for allocated blocks */
> needed = 2;
>
> /*
> @@ -2059,9 +2063,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
> correct_index = 1;
> credits += (ext_depth(inode)) + 1;
> }
> -#ifdef CONFIG_QUOTA
> credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
>
> err = ext4_ext_journal_restart(handle, credits);
> if (err)
> @@ -3023,9 +3025,7 @@ int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
> else
> needed = nrblocks * needed + 2;
>
> -#ifdef CONFIG_QUOTA
> needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
>
> return needed;
> }
> @@ -3092,7 +3092,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> /*
> * credits to insert 1 extent into extent tree
> */
> - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> mutex_lock(&inode->i_mutex);
> retry:
> while (ret >= 0 && ret < max_blocks) {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5a394c8..d2832a1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1126,18 +1126,29 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> up_write((&EXT4_I(inode)->i_data_sem));
> return retval;
> }
> +static int ext4_writeblocks_trans_credits_old(struct inode *, int);
> int ext4_get_block(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create)
> {
> handle_t *handle = ext4_journal_current_handle();
> int ret = 0, started = 0;
> unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> - int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> + int dio_credits;
>
> if (create && !handle) {
> /* Direct IO write... */
> if (max_blocks > DIO_MAX_BLOCKS)
> max_blocks = DIO_MAX_BLOCKS;
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + dio_credits = ext4_writeblocks_trans_credits_old(inode,
> + max_blocks);
> + else {
> + /*
> + * For extent format get_block return only one extent
> + */
> + dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> + }
> +
> handle = ext4_journal_start(inode, dio_credits);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> @@ -4310,14 +4321,18 @@ static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks)
>
> if (ext4_should_journal_data(inode))
> ret = 3 * (nrblocks + indirects) + 2;
> - else
> - ret = 2 * (nrblocks + indirects) + 2;
> + else {
> + /*
> + * We still need to journal update for the
> + * indirect, dindirect, and tindirect blocks
> + * only data blocks are not journalled
> + */
> + ret = 3 * (nrblocks + indirects) + 2 - nrblocks;
> + }
>
> -#ifdef CONFIG_QUOTA
> /* We know that structure was already allocated during DQUOT_INIT so
> * we will be updating only the data blocks + inodes */
> - ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
> + ret += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
>
> return ret;
> }
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index 3456094..72488ab 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -55,7 +55,8 @@ static int finish_range(handle_t *handle, struct inode *inode,
> *
> * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
> */
> - needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 4;
> + needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 2 +
> + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);;
>
> /*
> * Make sure the credit we accumalated is not really high
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 28, 2008 at 12:07:33PM -0700, Mingming Cao wrote:
.....
> > > > > *
> > > > > @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode,
> > > > > handle_t *handle = ext4_journal_current_handle();
> > > > > int ret = 0, started = 0;
> > > > > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > > > > + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> > > >
> > > > Again this should be
> > > > int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS);
> > > >
> > > No. DIO case is different than writepages(). When get_block() is called
> > > from DIO path, the get_block() is called in a loop, and the credit is
> > > reserved inside the loop. Each time get_block(),will return a single
> > > extent, so we should use EXT4_DATA_TRANS_BLOCKS(inode->i_sb) which is
> > > calculate a single chunk of allocation credits.
> >
> >
> > That is true only for extent format. With noextents we need something
> > like below
> >
>
> Not really, even with non extent format block allocation, ext4_get_block() will only allocate/map a chunk of contiguous blocks (i.e. an extent), so it will not hit the worse case.
>
> > if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > dio_credits = ext4_writeblocks_trans_credits_old(inode, max_blocks);
> > else {
> > /*
> > * For extent format get_block return only one extent
> > */
> > dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> > }
> >
> >
>
but dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); doesn't account
for updating the bitmap and group descriptor related to all indirect and
dindirect and tindirect blocks allocated.
We actually need
a) 1 bitmap for the blocks allocatted
b) 1 group desc for the blocks allocatted
c) 2 indirect
d) 2 dindirect
e) 1 tindirect
f) 5 bitmap for the indirect blocks allocated (2 + 2 + 1)
g) 5 group desc for the indirect blocks allocated
h) 1 inode
i) 1 super blocks
j) 2 * EXT4_QUOTA_TRANS_BLOCKS for quota
-aneesh
Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
From: Mingming Cao <[email protected]>
With delalloc, at writepages() time, we need to reserve enough credits to start
a new handle, to allow possible multiple segment of block allocations under a
single call mapge_da_writepages(), to fit metadata updates into the single
transaction. This patch fixed this by calculating the needed credits for
write-out given number of dirty pages, with the consideration of discontinues
block allocations. It fixed both extent files and non extent files.
This patch also fixed the journal credit reservation for DIO. Currently the
estimated credits for DIO is only based on non extent format file. That credit
is not enough for mballoc a single extent on extent based file. This patch
fixed that.
The fallocate double booking credits for modifying super block etc, this patch
fixed that.
This also fix credit reservation in migration and defrag code.
Changes since v2:
1) fix writepages() inefficency issue. sync() will invoke writepages()
twice( not sure exactly why), the second time all the pages are clean so
it waste the cpu time to walk though all pages and find they are not
dirty . But it's simple to workaround by skip writepages() if there is
no dirty pages pointed by the mapping.
2) extent based credit calculate is quit conservetive. It always use the
max possible depth to estimate the needed credits to support extent
insert/tree split. In fact the depth info for each inode is quite easy
to get, so we could use more accurate info to calculate
3) Limit the max number of pages that could flush at once from
ext4_da_writepages(), so that the max possible transaction credits could
fit under the allowed credits for starting a new transaction. Reduce
the number of pages to flush if necesary. Currently with 4K page size
and 4K block size, with extent file, it's possible to flush about 1K
pages under a single transaction.
Verified with memory pressure case and umount case,
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/ext4.h | 4 -
fs/ext4/ext4_extents.h | 3 -
fs/ext4/ext4_jbd2.h | 10 ++++
fs/ext4/extents.c | 78 ++++++++++++++++++-------------
fs/ext4/inode.c | 120 ++++++++++++++++++++++++++-----------------------
fs/ext4/migrate.c | 6 +-
6 files changed, 129 insertions(+), 92 deletions(-)
Index: linux-2.6.26git6/fs/ext4/ext4.h
===================================================================
--- linux-2.6.26git6.orig/fs/ext4/ext4.h 2008-07-28 22:47:22.000000000 -0700
+++ linux-2.6.26git6/fs/ext4/ext4.h 2008-07-29 17:40:40.000000000 -0700
@@ -1072,7 +1072,7 @@ extern void ext4_truncate (struct inode
extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
extern void ext4_set_aops(struct inode *inode);
-extern int ext4_writepage_trans_blocks(struct inode *);
+extern int ext4_writepages_trans_blocks(struct inode *, int nrpages);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
@@ -1227,7 +1227,7 @@ extern const struct inode_operations ext
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
-extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext4_ext_writeblocks_trans_credits(struct inode *inode, int);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
Index: linux-2.6.26git6/fs/ext4/extents.c
===================================================================
--- linux-2.6.26git6.orig/fs/ext4/extents.c 2008-07-28 22:53:20.000000000 -0700
+++ linux-2.6.26git6/fs/ext4/extents.c 2008-07-29 17:40:50.000000000 -0700
@@ -1747,34 +1747,43 @@ static int ext4_ext_rm_idx(handle_t *han
}
/*
- * ext4_ext_calc_credits_for_insert:
- * This routine returns max. credits that the extent tree can consume.
+ * ext4_ext_calc_credits_for_single_extent:
+ * This routine returns max. credits that needed to insert an extent
+ * to the extent tree.
* It should be OK for low-performance paths like ->writepage()
* To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under i_data_sem and
- * pass the actual path.
+ * When pass the actual path, the caller should calculate credits
+ * under i_data_sem.
+ *
+ * For inserting a single extent, in the worse case extent tree depth is 5
+ * for old tree and new tree, for every level we need to reserve
+ * credits to log the bitmap and block group descriptors
+ *
+ * credit needed for the update of super block + inode block + quota files
+ * are not included here. The caller of this function need to take care of this.
*/
-int ext4_ext_calc_credits_for_insert(struct inode *inode,
+int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
struct ext4_ext_path *path)
{
int depth, needed;
+ depth = ext_depth(inode);
+
if (path) {
/* probably there is space in leaf? */
- depth = ext_depth(inode);
if (le16_to_cpu(path[depth].p_hdr->eh_entries)
< le16_to_cpu(path[depth].p_hdr->eh_max))
- return 1;
+ /* 1 for block bitmap, 1 for group descriptor */
+ return 2;
}
- /*
- * given 32-bit logical block (4294967296 blocks), max. tree
- * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
- * Let's also add one more level for imbalance.
- */
- depth = 5;
+ /* add one more level in case of tree increase when insert a extent */
+ depth += 1;
- /* allocation of new data block(s) */
+ /*
+ * bitmap blocks and group descriptor block for
+ * allocation of new extent
+ */
needed = 2;
/*
@@ -1791,9 +1800,6 @@ int ext4_ext_calc_credits_for_insert(str
*/
needed += (depth * 2) + (depth * 2);
- /* any allocation modifies superblock */
- needed += 1;
-
return needed;
}
@@ -1917,9 +1923,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
correct_index = 1;
credits += (ext_depth(inode)) + 1;
}
-#ifdef CONFIG_QUOTA
credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
err = ext4_ext_journal_restart(handle, credits);
if (err)
@@ -2801,8 +2805,8 @@ void ext4_ext_truncate(struct inode *ino
/*
* probably first extent we're gonna free will be last in block
*/
- err = ext4_writepage_trans_blocks(inode) + 3;
- handle = ext4_journal_start(inode, err);
+ handle = ext4_journal_start(inode,
+ ext4_writepages_trans_blocks(inode, 1) + 3);
if (IS_ERR(handle))
return;
@@ -2855,22 +2859,32 @@ out_stop:
}
/*
- * ext4_ext_writepage_trans_blocks:
+ * ext4_ext_writeblocks_trans_credits:
* calculate max number of blocks we could modify
- * in order to allocate new block for an inode
+ * in order to allocate the required number of new blocks
+ *
+ * In the worse case, one block per extent.
+ *
*/
-int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
+int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
{
int needed;
- needed = ext4_ext_calc_credits_for_insert(inode, NULL);
-
- /* caller wants to allocate num blocks, but note it includes sb */
- needed = needed * num - (num - 1);
+ /* cost of adding a single extent:
+ * index blocks, leafs, bitmaps,
+ * groupdescp
+ */
+ needed = ext4_ext_calc_credits_for_single_extent(inode, NULL);
+ /*
+ * For data=journalled mode need to account for the data blocks
+ * Also need to add super block and inode block
+ */
+ if (ext4_should_journal_data(inode))
+ needed = nrblocks * (needed + 1) + 2;
+ else
+ needed = nrblocks * needed + 2;
-#ifdef CONFIG_QUOTA
needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
return needed;
}
@@ -2935,10 +2949,9 @@ long ext4_fallocate(struct inode *inode,
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- block;
/*
- * credits to insert 1 extent into extent tree + buffers to be able to
- * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ * credits to insert 1 extent into extent tree
*/
- credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+ credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
mutex_lock(&inode->i_mutex);
retry:
while (ret >= 0 && ret < max_blocks) {
Index: linux-2.6.26git6/fs/ext4/inode.c
===================================================================
--- linux-2.6.26git6.orig/fs/ext4/inode.c 2008-07-28 22:53:21.000000000 -0700
+++ linux-2.6.26git6/fs/ext4/inode.c 2008-07-29 17:45:43.000000000 -0700
@@ -1,5 +1,5 @@
/*
- * linux/fs/ext4/inode.c
+ * linux/fs/ext4/inode.c
*
* Copyright (C) 1992, 1993, 1994, 1995
* Remy Card ([email protected])
@@ -954,15 +954,6 @@ out:
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Number of credits we need for writing DIO_MAX_BLOCKS:
- * We need sb + group descriptor + bitmap + inode -> 4
- * For B blocks with A block pointers per block we need:
- * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
- * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
- */
-#define DIO_CREDITS 25
-
/*
*
@@ -1082,13 +1073,13 @@ static int ext4_get_block(struct inode *
handle_t *handle = ext4_journal_current_handle();
int ret = 0, started = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+ int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
if (create && !handle) {
/* Direct IO write... */
if (max_blocks > DIO_MAX_BLOCKS)
max_blocks = DIO_MAX_BLOCKS;
- handle = ext4_journal_start(inode, DIO_CREDITS +
- 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -1267,7 +1258,7 @@ static int ext4_write_begin(struct file
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
handle_t *handle;
int retries = 0;
struct page *page;
@@ -2153,20 +2144,6 @@ static int ext4_da_writepage(struct page
return ret;
}
-
-/*
- * For now just follow the DIO way to estimate the max credits
- * needed to write out EXT4_MAX_WRITEBACK_PAGES.
- * todo: need to calculate the max credits need for
- * extent based files, currently the DIO credits is based on
- * indirect-blocks mapping way.
- *
- * Probably should have a generic way to calculate credits
- * for DIO, writepages, and truncate
- */
-#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
-
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
@@ -2176,22 +2153,24 @@ static int ext4_da_writepages(struct add
int ret = 0;
long to_write;
loff_t range_start = 0;
+ int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits;
+ int max_credit_blocks = ext4_journal_max_transaction_buffers(inode);
+ int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1);
+ int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page;
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
* because that could violate lock ordering on umount
*/
- if (!mapping->nrpages)
+ if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
- /*
- * Estimate the worse case needed credits to write out
- * EXT4_MAX_BUF_BLOCKS pages
- */
- needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
+ if (wbc->nr_to_write > mapping->nrpages)
+ wbc->nr_to_write = mapping->nrpages;
to_write = wbc->nr_to_write;
+
if (!wbc->range_cyclic) {
/*
* If range_cyclic is not set force range_cont
@@ -2202,10 +2181,31 @@ static int ext4_da_writepages(struct add
}
while (!ret && to_write) {
+ /*
+ * set the max dirty pages could be write at a time
+ * to fit into the reserved transaction credits
+ */
+ if (wbc->nr_to_write > max_writeback_pages)
+ wbc->nr_to_write = max_writeback_pages;
+
+ /*
+ * Estimate the worse case needed credits to write out
+ * to_write pages
+ */
+ needed_blocks = ext4_writepages_trans_blocks(inode,
+ wbc->nr_to_write);
+ while (needed_blocks > max_credit_blocks) {
+ wbc->nr_to_write --;
+ needed_blocks = ext4_writepages_trans_blocks(inode,
+ wbc->nr_to_write);
+ }
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
+ printk(KERN_EMERG "%s: Not enough credits to flush %ld pages\n", __func__,
+ wbc->nr_to_write);
+ dump_stack();
goto out_writepages;
}
if (ext4_should_order_data(inode)) {
@@ -2221,12 +2221,6 @@ static int ext4_da_writepages(struct add
}
}
- /*
- * set the max dirty pages could be write at a time
- * to fit into the reserved transaction credits
- */
- if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
- wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
to_write -= wbc->nr_to_write;
ret = mpage_da_writepages(mapping, wbc,
@@ -2587,7 +2581,8 @@ static int __ext4_journalled_writepage(s
* references to buffers so we are safe */
unlock_page(page);
- handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+ handle = ext4_journal_start(inode,
+ ext4_writepages_trans_blocks(inode, 1));
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -4271,20 +4266,20 @@ int ext4_getattr(struct vfsmount *mnt, s
/*
* How many blocks doth make a writepage()?
*
- * With N blocks per page, it may be:
- * N data blocks
+ * With N blocks per page, and P pages, it may be:
+ * N*P data blocks
* 2 indirect block
* 2 dindirect
* 1 tindirect
- * N+5 bitmap blocks (from the above)
- * N+5 group descriptor summary blocks
+ * N*P+5 bitmap blocks (from the above)
+ * N*P+5 group descriptor summary blocks
* 1 inode block
* 1 superblock.
* 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
*
- * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
+ * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
*
- * With ordered or writeback data it's the same, less the N data blocks.
+ * With ordered or writeback data it's the same, less the N*P data blocks.
*
* If the inode's direct blocks can hold an integral number of pages then a
* page cannot straddle two indirect blocks, and we can only touch one indirect
@@ -4295,30 +4290,49 @@ int ext4_getattr(struct vfsmount *mnt, s
* block and work out the exact number of indirects which are touched. Pah.
*/
-int ext4_writepage_trans_blocks(struct inode *inode)
+static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks)
{
- int bpp = ext4_journal_blocks_per_page(inode);
- int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
+ int indirects = (EXT4_NDIR_BLOCKS % nrblocks) ? 5 : 3;
int ret;
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_writepage_trans_blocks(inode, bpp);
-
if (ext4_should_journal_data(inode))
- ret = 3 * (bpp + indirects) + 2;
+ ret = 3 * (nrblocks + indirects) + 2;
else
- ret = 2 * (bpp + indirects) + 2;
+ ret = 2 * nrblocks + 3* indirects + 2;
-#ifdef CONFIG_QUOTA
/* We know that structure was already allocated during DQUOT_INIT so
* we will be updating only the data blocks + inodes */
ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
return ret;
}
/*
+ * Calulate the total number of credits to reserve to fit
+ * the modification of @num pages into a single transaction
+ *
+ * This could be called via ext4_write_begin() or later
+ * ext4_da_writepages() in delalyed allocation case.
+ *
+ * In both case it's possible that we could allocating multiple
+ * chunks of blocks. We need to consider the worse case, when
+ * one new block per extent.
+ *
+ * For Direct IO and fallocate, the journal credits reservation
+ * is based on one single extent allocation, so they could use
+ * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
+ * chunk of allocation needs.
+ */
+int ext4_writepages_trans_blocks(struct inode *inode, int nrpages)
+{
+ int bpp = ext4_journal_blocks_per_page(inode);
+ int nrblocks = nrpages * bpp;
+
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return ext4_writeblocks_trans_credits_old(inode, nrblocks);
+ return ext4_ext_writeblocks_trans_credits(inode, nrblocks);
+}
+/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
*/
Index: linux-2.6.26git6/fs/ext4/migrate.c
===================================================================
--- linux-2.6.26git6.orig/fs/ext4/migrate.c 2008-07-13 14:51:29.000000000 -0700
+++ linux-2.6.26git6/fs/ext4/migrate.c 2008-07-28 22:53:21.000000000 -0700
@@ -52,9 +52,11 @@ static int finish_range(handle_t *handle
* Since we are doing this in loop we may accumalate extra
* credit. But below we try to not accumalate too much
* of them by restarting the journal.
+ *
+ * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
*/
- needed = ext4_ext_calc_credits_for_insert(inode, path);
-
+ needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 2
+ + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
/*
* Make sure the credit we accumalated is not really high
*/
Index: linux-2.6.26git6/fs/ext4/ext4_extents.h
===================================================================
--- linux-2.6.26git6.orig/fs/ext4/ext4_extents.h 2008-07-28 22:47:22.000000000 -0700
+++ linux-2.6.26git6/fs/ext4/ext4_extents.h 2008-07-28 22:55:40.000000000 -0700
@@ -216,7 +216,8 @@ extern int ext4_ext_calc_metadata_amount
extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
extern int ext4_extent_tree_init(handle_t *, struct inode *);
-extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
+ struct ext4_ext_path *path);
extern int ext4_ext_try_to_merge(struct inode *inode,
struct ext4_ext_path *path,
struct ext4_extent *);
Index: linux-2.6.26git6/fs/ext4/ext4_jbd2.h
===================================================================
--- linux-2.6.26git6.orig/fs/ext4/ext4_jbd2.h 2008-07-28 22:47:22.000000000 -0700
+++ linux-2.6.26git6/fs/ext4/ext4_jbd2.h 2008-07-28 22:53:21.000000000 -0700
@@ -231,4 +231,14 @@ static inline int ext4_should_writeback_
return 0;
}
+static inline int ext4_journal_max_transaction_buffers(struct inode *inode)
+{
+ /*
+ * max transaction buffers
+ * calculation based on
+ * journal->j_max_transaction_buffers = journal->j_maxlen / 4;
+ */
+ return (EXT4_JOURNAL(inode))->j_maxlen / 4;
+}
+
#endif /* _EXT4_JBD2_H */
While doing some perf test on flex bg, I tried to run bonnie++ on
2.6.27-rc1 + patch queue including your journal credit fix but I had a
very similar crash. Here are the details, I hope this help :
kernel 2.6.27-rc1
patch queue snapshot :
ext4-patch-queue-25fb9834f3814b3aa567c5af090fba688a86eea9
With latest e2fsprogs :
mkfs.ext4 -t ext4dev -b1024 -G256 /dev/sdb1 4G
mount -t ext4dev /dev/sdb1 /mnt/test
bonnie++ -u root -s 2g:256 -r 1024 -n 200 -d /mnt/test/
after a while, it ends up with :
kernel BUG at fs/jbd2/transaction.c:984!
invalid opcode: 0000 [#1] SMP
Modules linked in: ext4dev jbd2 crc16 kvm_intel kvm megaraid_mbox
megaraid_mm
Pid: 13965, comm: bonnie++ Not tainted (2.6.27-rc1 #3)
EIP: 0060:[<f8b186a6>] EFLAGS: 00010246 CPU: 4
EIP is at jbd2_journal_dirty_metadata+0xc6/0xd0 [jbd2]
EAX: 00000000 EBX: f0acc380 ECX: f0acc380 EDX: f0069f80
ESI: f3964700 EDI: f5daa1b0 EBP: f6dd7e00 ESP: f5949ebc
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bonnie++ (pid: 13965, ti=f5948000 task=f5404ba0
task.ti=f5948000)
Stack: f7cb0100 f5daa1b0 f0acc380 f8b8ca12 f8b7ef62 f7cb0000 f68a5d00
f7cb0100
00000000 f7183e00 f5daa1b0 f8b6a06e 00000040 f8b736db f7cb2134
f2c94238
0000000b 00000000 00008000 00000000 f0acc380 f7cb0000 f08b2ac0
f2c942c8
Call Trace:
[<f8b7ef62>] __ext4_journal_dirty_metadata+0x22/0x60 [ext4dev]
[<f8b6a06e>] ext4_free_inode+0x26e/0x2f0 [ext4dev]
[<f8b736db>] ext4_orphan_del+0xcb/0x180 [ext4dev]
[<f8b6fb3c>] ext4_delete_inode+0x11c/0x140 [ext4dev]
[<f8b6fa20>] ext4_delete_inode+0x0/0x140 [ext4dev]
[<c018fe6a>] generic_delete_inode+0x5a/0xc0
[<c018f4a4>] iput+0x44/0x50
[<c0186271>] do_unlinkat+0xd1/0x150
[<c017cdd6>] vfs_write+0x106/0x140
[<c02aa7b0>] tty_write+0x0/0x1e0
[<c017d2d1>] sys_write+0x41/0x70
[<c0102fc9>] sysenter_do_call+0x12/0x25
=======================
Code: 55 2c 8d 76 00 74 aa 0f 0b eb fe 0f 0b eb fe 8d b6 00 00 00 00 0f
0b eb fe f6 43 02 20 0f 84 5d ff ff ff f3 90 eb f2 0f 0b eb fe <0f> 0b
eb fe 8d b6 00 00 00 00 55 57 56 53 89 d3 83 ec 10 89 44
EIP: [<f8b186a6>] jbd2_journal_dirty_metadata+0xc6/0xd0 [jbd2] SS:ESP
0068:f5949ebc
Fred
Le mardi 29 juillet 2008 à 18:58 -0700, Mingming Cao a écrit :
> Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
>
> From: Mingming Cao <[email protected]>
>
> With delalloc, at writepages() time, we need to reserve enough credits to start
> a new handle, to allow possible multiple segment of block allocations under a
> single call mapge_da_writepages(), to fit metadata updates into the single
> transaction. This patch fixed this by calculating the needed credits for
> write-out given number of dirty pages, with the consideration of discontinues
> block allocations. It fixed both extent files and non extent files.
>
> This patch also fixed the journal credit reservation for DIO. Currently the
> estimated credits for DIO is only based on non extent format file. That credit
> is not enough for mballoc a single extent on extent based file. This patch
> fixed that.
>
> The fallocate double booking credits for modifying super block etc, this patch
> fixed that.
>
> This also fix credit reservation in migration and defrag code.
>
>
> Changes since v2:
>
> 1) fix writepages() inefficency issue. sync() will invoke writepages()
> twice( not sure exactly why), the second time all the pages are clean so
> it waste the cpu time to walk though all pages and find they are not
> dirty . But it's simple to workaround by skip writepages() if there is
> no dirty pages pointed by the mapping.
>
>
> 2) extent based credit calculate is quit conservetive. It always use the
> max possible depth to estimate the needed credits to support extent
> insert/tree split. In fact the depth info for each inode is quite easy
> to get, so we could use more accurate info to calculate
>
> 3) Limit the max number of pages that could flush at once from
> ext4_da_writepages(), so that the max possible transaction credits could
> fit under the allowed credits for starting a new transaction. Reduce
> the number of pages to flush if necesary. Currently with 4K page size
> and 4K block size, with extent file, it's possible to flush about 1K
> pages under a single transaction.
>
>
> Verified with memory pressure case and umount case,
>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/ext4/ext4.h | 4 -
> fs/ext4/ext4_extents.h | 3 -
> fs/ext4/ext4_jbd2.h | 10 ++++
> fs/ext4/extents.c | 78 ++++++++++++++++++-------------
> fs/ext4/inode.c | 120 ++++++++++++++++++++++++++-----------------------
> fs/ext4/migrate.c | 6 +-
> 6 files changed, 129 insertions(+), 92 deletions(-)
>
> Index: linux-2.6.26git6/fs/ext4/ext4.h
> ===================================================================
> --- linux-2.6.26git6.orig/fs/ext4/ext4.h 2008-07-28 22:47:22.000000000 -0700
> +++ linux-2.6.26git6/fs/ext4/ext4.h 2008-07-29 17:40:40.000000000 -0700
> @@ -1072,7 +1072,7 @@ extern void ext4_truncate (struct inode
> extern void ext4_set_inode_flags(struct inode *);
> extern void ext4_get_inode_flags(struct ext4_inode_info *);
> extern void ext4_set_aops(struct inode *inode);
> -extern int ext4_writepage_trans_blocks(struct inode *);
> +extern int ext4_writepages_trans_blocks(struct inode *, int nrpages);
> extern int ext4_block_truncate_page(handle_t *handle,
> struct address_space *mapping, loff_t from);
> extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> @@ -1227,7 +1227,7 @@ extern const struct inode_operations ext
>
> /* extents.c */
> extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
> -extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
> +extern int ext4_ext_writeblocks_trans_credits(struct inode *inode, int);
> extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> ext4_lblk_t iblock,
> unsigned long max_blocks, struct buffer_head *bh_result,
> Index: linux-2.6.26git6/fs/ext4/extents.c
> ===================================================================
> --- linux-2.6.26git6.orig/fs/ext4/extents.c 2008-07-28 22:53:20.000000000 -0700
> +++ linux-2.6.26git6/fs/ext4/extents.c 2008-07-29 17:40:50.000000000 -0700
> @@ -1747,34 +1747,43 @@ static int ext4_ext_rm_idx(handle_t *han
> }
>
> /*
> - * ext4_ext_calc_credits_for_insert:
> - * This routine returns max. credits that the extent tree can consume.
> + * ext4_ext_calc_credits_for_single_extent:
> + * This routine returns max. credits that needed to insert an extent
> + * to the extent tree.
> * It should be OK for low-performance paths like ->writepage()
> * To allow many writing processes to fit into a single transaction,
> - * the caller should calculate credits under i_data_sem and
> - * pass the actual path.
> + * When pass the actual path, the caller should calculate credits
> + * under i_data_sem.
> + *
> + * For inserting a single extent, in the worse case extent tree depth is 5
> + * for old tree and new tree, for every level we need to reserve
> + * credits to log the bitmap and block group descriptors
> + *
> + * credit needed for the update of super block + inode block + quota files
> + * are not included here. The caller of this function need to take care of this.
> */
> -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> +int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> struct ext4_ext_path *path)
> {
> int depth, needed;
>
> + depth = ext_depth(inode);
> +
> if (path) {
> /* probably there is space in leaf? */
> - depth = ext_depth(inode);
> if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> < le16_to_cpu(path[depth].p_hdr->eh_max))
> - return 1;
> + /* 1 for block bitmap, 1 for group descriptor */
> + return 2;
> }
>
> - /*
> - * given 32-bit logical block (4294967296 blocks), max. tree
> - * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
> - * Let's also add one more level for imbalance.
> - */
> - depth = 5;
> + /* add one more level in case of tree increase when insert a extent */
> + depth += 1;
>
> - /* allocation of new data block(s) */
> + /*
> + * bitmap blocks and group descriptor block for
> + * allocation of new extent
> + */
> needed = 2;
>
> /*
> @@ -1791,9 +1800,6 @@ int ext4_ext_calc_credits_for_insert(str
> */
> needed += (depth * 2) + (depth * 2);
>
> - /* any allocation modifies superblock */
> - needed += 1;
> -
> return needed;
> }
>
> @@ -1917,9 +1923,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
> correct_index = 1;
> credits += (ext_depth(inode)) + 1;
> }
> -#ifdef CONFIG_QUOTA
> credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
>
> err = ext4_ext_journal_restart(handle, credits);
> if (err)
> @@ -2801,8 +2805,8 @@ void ext4_ext_truncate(struct inode *ino
> /*
> * probably first extent we're gonna free will be last in block
> */
> - err = ext4_writepage_trans_blocks(inode) + 3;
> - handle = ext4_journal_start(inode, err);
> + handle = ext4_journal_start(inode,
> + ext4_writepages_trans_blocks(inode, 1) + 3);
> if (IS_ERR(handle))
> return;
>
> @@ -2855,22 +2859,32 @@ out_stop:
> }
>
> /*
> - * ext4_ext_writepage_trans_blocks:
> + * ext4_ext_writeblocks_trans_credits:
> * calculate max number of blocks we could modify
> - * in order to allocate new block for an inode
> + * in order to allocate the required number of new blocks
> + *
> + * In the worse case, one block per extent.
> + *
> */
> -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
> +int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
> {
> int needed;
>
> - needed = ext4_ext_calc_credits_for_insert(inode, NULL);
> -
> - /* caller wants to allocate num blocks, but note it includes sb */
> - needed = needed * num - (num - 1);
> + /* cost of adding a single extent:
> + * index blocks, leafs, bitmaps,
> + * groupdescp
> + */
> + needed = ext4_ext_calc_credits_for_single_extent(inode, NULL);
> + /*
> + * For data=journalled mode need to account for the data blocks
> + * Also need to add super block and inode block
> + */
> + if (ext4_should_journal_data(inode))
> + needed = nrblocks * (needed + 1) + 2;
> + else
> + needed = nrblocks * needed + 2;
>
> -#ifdef CONFIG_QUOTA
> needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
>
> return needed;
> }
> @@ -2935,10 +2949,9 @@ long ext4_fallocate(struct inode *inode,
> max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> - block;
> /*
> - * credits to insert 1 extent into extent tree + buffers to be able to
> - * modify 1 super block, 1 block bitmap and 1 group descriptor.
> + * credits to insert 1 extent into extent tree
> */
> - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> mutex_lock(&inode->i_mutex);
> retry:
> while (ret >= 0 && ret < max_blocks) {
> Index: linux-2.6.26git6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.26git6.orig/fs/ext4/inode.c 2008-07-28 22:53:21.000000000 -0700
> +++ linux-2.6.26git6/fs/ext4/inode.c 2008-07-29 17:45:43.000000000 -0700
> @@ -1,5 +1,5 @@
> /*
> - * linux/fs/ext4/inode.c
> + * linux/fs/ext4/inode.c
> *
> * Copyright (C) 1992, 1993, 1994, 1995
> * Remy Card ([email protected])
> @@ -954,15 +954,6 @@ out:
>
> /* Maximum number of blocks we map for direct IO at once. */
> #define DIO_MAX_BLOCKS 4096
> -/*
> - * Number of credits we need for writing DIO_MAX_BLOCKS:
> - * We need sb + group descriptor + bitmap + inode -> 4
> - * For B blocks with A block pointers per block we need:
> - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> - */
> -#define DIO_CREDITS 25
> -
>
> /*
> *
> @@ -1082,13 +1073,13 @@ static int ext4_get_block(struct inode *
> handle_t *handle = ext4_journal_current_handle();
> int ret = 0, started = 0;
> unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
>
> if (create && !handle) {
> /* Direct IO write... */
> if (max_blocks > DIO_MAX_BLOCKS)
> max_blocks = DIO_MAX_BLOCKS;
> - handle = ext4_journal_start(inode, DIO_CREDITS +
> - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> + handle = ext4_journal_start(inode, dio_credits);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> goto out;
> @@ -1267,7 +1258,7 @@ static int ext4_write_begin(struct file
> struct page **pagep, void **fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> + int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
> handle_t *handle;
> int retries = 0;
> struct page *page;
> @@ -2153,20 +2144,6 @@ static int ext4_da_writepage(struct page
>
> return ret;
> }
> -
> -/*
> - * For now just follow the DIO way to estimate the max credits
> - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> - * todo: need to calculate the max credits need for
> - * extent based files, currently the DIO credits is based on
> - * indirect-blocks mapping way.
> - *
> - * Probably should have a generic way to calculate credits
> - * for DIO, writepages, and truncate
> - */
> -#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> -
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> {
> @@ -2176,22 +2153,24 @@ static int ext4_da_writepages(struct add
> int ret = 0;
> long to_write;
> loff_t range_start = 0;
> + int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits;
> + int max_credit_blocks = ext4_journal_max_transaction_buffers(inode);
> + int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1);
> + int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page;
>
> /*
> * No pages to write? This is mainly a kludge to avoid starting
> * a transaction for special inodes like journal inode on last iput()
> * because that could violate lock ordering on umount
> */
> - if (!mapping->nrpages)
> + if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> return 0;
>
> - /*
> - * Estimate the worse case needed credits to write out
> - * EXT4_MAX_BUF_BLOCKS pages
> - */
> - needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> + if (wbc->nr_to_write > mapping->nrpages)
> + wbc->nr_to_write = mapping->nrpages;
>
> to_write = wbc->nr_to_write;
> +
> if (!wbc->range_cyclic) {
> /*
> * If range_cyclic is not set force range_cont
> @@ -2202,10 +2181,31 @@ static int ext4_da_writepages(struct add
> }
>
> while (!ret && to_write) {
> + /*
> + * set the max dirty pages could be write at a time
> + * to fit into the reserved transaction credits
> + */
> + if (wbc->nr_to_write > max_writeback_pages)
> + wbc->nr_to_write = max_writeback_pages;
> +
> + /*
> + * Estimate the worse case needed credits to write out
> + * to_write pages
> + */
> + needed_blocks = ext4_writepages_trans_blocks(inode,
> + wbc->nr_to_write);
> + while (needed_blocks > max_credit_blocks) {
> + wbc->nr_to_write --;
> + needed_blocks = ext4_writepages_trans_blocks(inode,
> + wbc->nr_to_write);
> + }
> /* start a new transaction*/
> handle = ext4_journal_start(inode, needed_blocks);
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> + printk(KERN_EMERG "%s: Not enough credits to flush %ld pages\n", __func__,
> + wbc->nr_to_write);
> + dump_stack();
> goto out_writepages;
> }
> if (ext4_should_order_data(inode)) {
> @@ -2221,12 +2221,6 @@ static int ext4_da_writepages(struct add
> }
>
> }
> - /*
> - * set the max dirty pages could be write at a time
> - * to fit into the reserved transaction credits
> - */
> - if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> - wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
>
> to_write -= wbc->nr_to_write;
> ret = mpage_da_writepages(mapping, wbc,
> @@ -2587,7 +2581,8 @@ static int __ext4_journalled_writepage(s
> * references to buffers so we are safe */
> unlock_page(page);
>
> - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> + handle = ext4_journal_start(inode,
> + ext4_writepages_trans_blocks(inode, 1));
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> goto out;
> @@ -4271,20 +4266,20 @@ int ext4_getattr(struct vfsmount *mnt, s
> /*
> * How many blocks doth make a writepage()?
> *
> - * With N blocks per page, it may be:
> - * N data blocks
> + * With N blocks per page, and P pages, it may be:
> + * N*P data blocks
> * 2 indirect block
> * 2 dindirect
> * 1 tindirect
> - * N+5 bitmap blocks (from the above)
> - * N+5 group descriptor summary blocks
> + * N*P+5 bitmap blocks (from the above)
> + * N*P+5 group descriptor summary blocks
> * 1 inode block
> * 1 superblock.
> * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
> *
> - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> *
> - * With ordered or writeback data it's the same, less the N data blocks.
> + * With ordered or writeback data it's the same, less the N*P data blocks.
> *
> * If the inode's direct blocks can hold an integral number of pages then a
> * page cannot straddle two indirect blocks, and we can only touch one indirect
> @@ -4295,30 +4290,49 @@ int ext4_getattr(struct vfsmount *mnt, s
> * block and work out the exact number of indirects which are touched. Pah.
> */
>
> -int ext4_writepage_trans_blocks(struct inode *inode)
> +static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks)
> {
> - int bpp = ext4_journal_blocks_per_page(inode);
> - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> + int indirects = (EXT4_NDIR_BLOCKS % nrblocks) ? 5 : 3;
> int ret;
>
> - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> - return ext4_ext_writepage_trans_blocks(inode, bpp);
> -
> if (ext4_should_journal_data(inode))
> - ret = 3 * (bpp + indirects) + 2;
> + ret = 3 * (nrblocks + indirects) + 2;
> else
> - ret = 2 * (bpp + indirects) + 2;
> + ret = 2 * nrblocks + 3* indirects + 2;
>
> -#ifdef CONFIG_QUOTA
> /* We know that structure was already allocated during DQUOT_INIT so
> * we will be updating only the data blocks + inodes */
> ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
>
> return ret;
> }
>
> /*
> + * Calulate the total number of credits to reserve to fit
> + * the modification of @num pages into a single transaction
> + *
> + * This could be called via ext4_write_begin() or later
> + * ext4_da_writepages() in delalyed allocation case.
> + *
> + * In both case it's possible that we could allocating multiple
> + * chunks of blocks. We need to consider the worse case, when
> + * one new block per extent.
> + *
> + * For Direct IO and fallocate, the journal credits reservation
> + * is based on one single extent allocation, so they could use
> + * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
> + * chunk of allocation needs.
> + */
> +int ext4_writepages_trans_blocks(struct inode *inode, int nrpages)
> +{
> + int bpp = ext4_journal_blocks_per_page(inode);
> + int nrblocks = nrpages * bpp;
> +
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + return ext4_writeblocks_trans_credits_old(inode, nrblocks);
> + return ext4_ext_writeblocks_trans_credits(inode, nrblocks);
> +}
> +/*
> * The caller must have previously called ext4_reserve_inode_write().
> * Give this, we know that the caller already has write access to iloc->bh.
> */
> Index: linux-2.6.26git6/fs/ext4/migrate.c
> ===================================================================
> --- linux-2.6.26git6.orig/fs/ext4/migrate.c 2008-07-13 14:51:29.000000000 -0700
> +++ linux-2.6.26git6/fs/ext4/migrate.c 2008-07-28 22:53:21.000000000 -0700
> @@ -52,9 +52,11 @@ static int finish_range(handle_t *handle
> * Since we are doing this in loop we may accumalate extra
> * credit. But below we try to not accumalate too much
> * of them by restarting the journal.
> + *
> + * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
> */
> - needed = ext4_ext_calc_credits_for_insert(inode, path);
> -
> + needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 2
> + + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> /*
> * Make sure the credit we accumalated is not really high
> */
> Index: linux-2.6.26git6/fs/ext4/ext4_extents.h
> ===================================================================
> --- linux-2.6.26git6.orig/fs/ext4/ext4_extents.h 2008-07-28 22:47:22.000000000 -0700
> +++ linux-2.6.26git6/fs/ext4/ext4_extents.h 2008-07-28 22:55:40.000000000 -0700
> @@ -216,7 +216,8 @@ extern int ext4_ext_calc_metadata_amount
> extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
> extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
> extern int ext4_extent_tree_init(handle_t *, struct inode *);
> -extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
> +extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> + struct ext4_ext_path *path);
> extern int ext4_ext_try_to_merge(struct inode *inode,
> struct ext4_ext_path *path,
> struct ext4_extent *);
> Index: linux-2.6.26git6/fs/ext4/ext4_jbd2.h
> ===================================================================
> --- linux-2.6.26git6.orig/fs/ext4/ext4_jbd2.h 2008-07-28 22:47:22.000000000 -0700
> +++ linux-2.6.26git6/fs/ext4/ext4_jbd2.h 2008-07-28 22:53:21.000000000 -0700
> @@ -231,4 +231,14 @@ static inline int ext4_should_writeback_
> return 0;
> }
>
> +static inline int ext4_journal_max_transaction_buffers(struct inode *inode)
> +{
> + /*
> + * max transaction buffers
> + * calculation based on
> + * journal->j_max_transaction_buffers = journal->j_maxlen / 4;
> + */
> + return (EXT4_JOURNAL(inode))->j_maxlen / 4;
> +}
> +
> #endif /* _EXT4_JBD2_H */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Jul 29, 2008 18:58 -0700, Mingming Cao wrote:
> + * For inserting a single extent, in the worse case extent tree depth is 5
> + * for old tree and new tree, for every level we need to reserve
> + * credits to log the bitmap and block group descriptors
> + *
> + * credit needed for the update of super block + inode block + quota files
> + * are not included here. The caller of this function need to take care of this.
> */
> -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> +int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> struct ext4_ext_path *path)
> {
> int depth, needed;
>
> + depth = ext_depth(inode);
> +
> if (path) {
> /* probably there is space in leaf? */
> if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> < le16_to_cpu(path[depth].p_hdr->eh_max))
Please fix code style here - '<' at end of previous line, align with "if (".
> + /* 1 for block bitmap, 1 for group descriptor */
> + return 2;
> }
>
> + /* add one more level in case of tree increase when insert a extent */
> + depth += 1;
Shouldn't this only be if depth < 5?
> /*
> * tree can be full, so it would need to grow in depth:
> * we need one credit to modify old root, credits for
> * new root will be added in split accounting
> */
> needed += 1;
Similarly, this should only be if depth < 5? We should have a
/*
* given 32-bit logical block (4294967296 blocks), max. tree
* can be 4 levels in depth -- 4 * 340^4 == 53453440000.
* Let's also add one more level for imbalance.
*/
#define EXT4_EXT_MAX_DEPTH 5
> /*
> * Index split can happen, we would need:
> * allocate intermediate indexes (bitmap + group)
> * + change two blocks at each level, but root (already included)
> */
> needed += (depth * 2) + (depth * 2);
Again, this can only happen if < EXT4_EXT_MAX_DEPTH.
> +int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
please remove duplicate space after "int"
> {
> int needed;
>
> + /* cost of adding a single extent:
> + * index blocks, leafs, bitmaps,
> + * groupdescp
> + */
> + needed = ext4_ext_calc_credits_for_single_extent(inode, NULL);
> + /*
> + * For data=journalled mode need to account for the data blocks
> + * Also need to add super block and inode block
> + */
> + if (ext4_should_journal_data(inode))
> + needed = nrblocks * (needed + 1) + 2;
> + else
> + needed = nrblocks * needed + 2;
It is also hard to understand why we need "nrblocks" times a single extent
insert. That would assume we need a full tree split on EVERY block
inserted, which I don't think is reasonable. Have you printed out some
of these values to see how large they actually are?
Instead, we shouldn't have ext4_ext_calc_credits_for_single_extent()
at all, and instead pass in the nrblocks parameter and work it out
on this basis. That means at most nrblocks/340 extents/bitmaps/groups
at one time, plus at most 4 more levels of split in worst case. We
don't need 4 * nrblocks for each write.
> @@ -2202,10 +2181,31 @@ static int ext4_da_writepages(struct add
> + /*
> + * Estimate the worse case needed credits to write out
> + * to_write pages
> + */
> + needed_blocks = ext4_writepages_trans_blocks(inode,
> + wbc->nr_to_write);
> + while (needed_blocks > max_credit_blocks) {
> + wbc->nr_to_write --;
> + needed_blocks = ext4_writepages_trans_blocks(inode,
> + wbc->nr_to_write);
> + }
This isn't a very efficient loop to decrement nr_to_write by one each time.
It is best to pass multiples of the RAID stripe size to mballoc to avoid
extra overhead. I'd check something like below:
needed_blocks = ~0U;
while (needed_blocks > max_credit_blocks) {
needed_blocks = ext4_writepages_trans_blocks(inode,
wbc->nr_to_write);
/* We are more than twice the max_credit_blocks */
if (needed_blocks + max_credit_blocks / 2 >
2 * max_credit_blocks)
wbc->nr_to_write /= 2;
else
wbc->nr_to_write -=
(needed_blocks-max_credit_blocks+3) / 4;
> +static inline int ext4_journal_max_transaction_buffers(struct inode *inode)
> +{
> + /*
> + * max transaction buffers
> + * calculation based on
> + * journal->j_max_transaction_buffers = journal->j_maxlen / 4;
> + */
> + return (EXT4_JOURNAL(inode))->j_maxlen / 4;
Why does this not use "j_max_transaction_buffers" directly? That is what
start_this_handle() checks against. Also, this function should probably
be in fs/jbd2 instead of in fs/ext4.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Hi Mingming,
This is the patch i was working on. The patch is still not complete.
But sending it early so that i can get some feedback.
commit a4546f0034fcfb0a20991378fe4a3cf6c157ad72
Author: Aneesh Kumar K.V <[email protected]>
Date: Wed Jul 30 17:34:25 2008 +0530
ext4: Rework the ext4_da_writepages
With the below changes we reserve credit needed to insert only one extent
resulting from a call to single get_block. That make sure we don't take
too much journal credits during writeout. We also don't limit the pages
to write. That means we loop through the dirty pages building largest
possible contiguous block request. Then we issue a single get_block request.
We may get less block that we requested. If so we would end up not mapping
some of the buffer_heads. That means those buffer_heads are still marked delay.
Later in the writepage callback via __mpage_writepage we redirty those pages.
TODO:
a) Some pages are leaked without unlock. So fsstress deadlock on lock_page
b) range_start is not handled correctly in loop
Signed-off-by: Aneesh Kumar K.V <[email protected]>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5a394c8..23f55cf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -41,6 +41,8 @@
#include "acl.h"
#include "ext4_extents.h"
+#define MPAGE_DA_EXTENT_TAIL 0x01
+
static inline int ext4_begin_ordered_truncate(struct inode *inode,
loff_t new_size)
{
@@ -1604,9 +1606,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd)
.get_block = mpd->get_block,
.use_writepage = 1,
};
- int ret = 0, err, nr_pages, i;
+ int err, nr_pages, i;
unsigned long index, end;
struct pagevec pvec;
+ int written_pages = 0;
BUG_ON(mpd->next_page <= mpd->first_page);
@@ -1628,21 +1631,20 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd)
index++;
err = __mpage_writepage(page, mpd->wbc, &mpd_pp);
-
+ if (!err)
+ written_pages++;
/*
* In error case, we have to continue because
* remaining pages are still locked
* XXX: unlock and re-dirty them?
*/
- if (ret == 0)
- ret = err;
}
pagevec_release(&pvec);
}
if (mpd_pp.bio)
mpage_bio_submit(WRITE, mpd_pp.bio);
- return ret;
+ return written_pages;
}
/*
@@ -1745,10 +1747,10 @@ static inline void __unmap_underlying_blocks(struct inode *inode,
* The function ignores errors ->get_block() returns, thus real
* error handling is postponed to __mpage_writepage()
*/
-static void mpage_da_map_blocks(struct mpage_da_data *mpd)
+static int mpage_da_map_blocks(struct mpage_da_data *mpd)
{
+ int err = 0;
struct buffer_head *lbh = &mpd->lbh;
- int err = 0, remain = lbh->b_size;
sector_t next = lbh->b_blocknr;
struct buffer_head new;
@@ -1756,37 +1758,33 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd)
* We consider only non-mapped and non-allocated blocks
*/
if (buffer_mapped(lbh) && !buffer_delay(lbh))
- return;
-
- while (remain) {
- new.b_state = lbh->b_state;
- new.b_blocknr = 0;
- new.b_size = remain;
- err = mpd->get_block(mpd->inode, next, &new, 1);
- if (err) {
- /*
- * Rather than implement own error handling
- * here, we just leave remaining blocks
- * unallocated and try again with ->writepage()
- */
- break;
- }
- BUG_ON(new.b_size == 0);
-
- if (buffer_new(&new))
- __unmap_underlying_blocks(mpd->inode, &new);
+ return 0;
+ new.b_state = lbh->b_state;
+ new.b_blocknr = 0;
+ new.b_size = lbh->b_size;
+ err = mpd->get_block(mpd->inode, next, &new, 1);
+ if (err) {
/*
- * If blocks are delayed marked, we need to
- * put actual blocknr and drop delayed bit
+ * We failed to do block allocation. All
+ * the pages will be redirtied later in
+ * ext4_da_writepage
*/
- if (buffer_delay(lbh))
- mpage_put_bnr_to_bhs(mpd, next, &new);
-
- /* go for the remaining blocks */
- next += new.b_size >> mpd->inode->i_blkbits;
- remain -= new.b_size;
+ return 0;
}
+ BUG_ON(new.b_size == 0);
+
+ if (buffer_new(&new))
+ __unmap_underlying_blocks(mpd->inode, &new);
+
+ /*
+ * If blocks are delayed marked, we need to
+ * put actual blocknr and drop delayed bit
+ */
+ if (buffer_delay(lbh))
+ mpage_put_bnr_to_bhs(mpd, next, &new);
+
+ return new.b_size;
}
#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | (1 << BH_Delay))
@@ -1800,7 +1798,7 @@ static void mpage_da_map_blocks(struct mpage_da_data *mpd)
*
* the function is used to collect contig. blocks in same state
*/
-static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
+static int mpage_add_bh_to_extent(struct mpage_da_data *mpd,
sector_t logical, struct buffer_head *bh)
{
struct buffer_head *lbh = &mpd->lbh;
@@ -1815,7 +1813,7 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
lbh->b_blocknr = logical;
lbh->b_size = bh->b_size;
lbh->b_state = bh->b_state & BH_FLAGS;
- return;
+ return 0;
}
/*
@@ -1823,21 +1821,14 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
*/
if (logical == next && (bh->b_state & BH_FLAGS) == lbh->b_state) {
lbh->b_size += bh->b_size;
- return;
+ return 0;
}
/*
* We couldn't merge the block to our extent, so we
* need to flush current extent and start new one
*/
- mpage_da_map_blocks(mpd);
-
- /*
- * Now start a new extent
- */
- lbh->b_size = bh->b_size;
- lbh->b_state = bh->b_state & BH_FLAGS;
- lbh->b_blocknr = logical;
+ return MPAGE_DA_EXTENT_TAIL;
}
/*
@@ -1852,6 +1843,7 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
static int __mpage_da_writepage(struct page *page,
struct writeback_control *wbc, void *data)
{
+ int ret = 0;
struct mpage_da_data *mpd = data;
struct inode *inode = mpd->inode;
struct buffer_head *bh, *head, fake;
@@ -1866,12 +1858,13 @@ static int __mpage_da_writepage(struct page *page,
* and start IO on them using __mpage_writepage()
*/
if (mpd->next_page != mpd->first_page) {
- mpage_da_map_blocks(mpd);
- mpage_da_submit_io(mpd);
+ unlock_page(page);
+ ret = MPAGE_DA_EXTENT_TAIL;
+ goto finish_extent;
}
/*
- * Start next extent of pages ...
+ * Start extent of pages ...
*/
mpd->first_page = page->index;
@@ -1897,7 +1890,7 @@ static int __mpage_da_writepage(struct page *page,
bh->b_state = 0;
set_buffer_dirty(bh);
set_buffer_uptodate(bh);
- mpage_add_bh_to_extent(mpd, logical, bh);
+ ret = mpage_add_bh_to_extent(mpd, logical, bh);
} else {
/*
* Page with regular buffer heads, just add all dirty ones
@@ -1906,13 +1899,17 @@ static int __mpage_da_writepage(struct page *page,
bh = head;
do {
BUG_ON(buffer_locked(bh));
- if (buffer_dirty(bh))
- mpage_add_bh_to_extent(mpd, logical, bh);
+ if (buffer_dirty(bh)) {
+ ret = mpage_add_bh_to_extent(mpd, logical, bh);
+ if (ret == MPAGE_DA_EXTENT_TAIL)
+ goto finish_extent;
+ }
logical++;
} while ((bh = bh->b_this_page) != head);
}
- return 0;
+finish_extent:
+ return ret;
}
/*
@@ -1941,8 +1938,8 @@ static int mpage_da_writepages(struct address_space *mapping,
struct writeback_control *wbc,
get_block_t get_block)
{
+ int ret, to_write, written_pages;
struct mpage_da_data mpd;
- int ret;
if (!get_block)
return generic_writepages(mapping, wbc);
@@ -1956,16 +1953,20 @@ static int mpage_da_writepages(struct address_space *mapping,
mpd.next_page = 0;
mpd.get_block = get_block;
+ to_write = wbc->nr_to_write;
+
ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd);
/*
- * Handle last extent of pages
+ * Allocate blocks for the dirty delayed
+ * pages found
*/
if (mpd.next_page != mpd.first_page) {
mpage_da_map_blocks(&mpd);
- mpage_da_submit_io(&mpd);
+ written_pages = mpage_da_submit_io(&mpd);
+ /* update nr_to_write correctly */
+ wbc->nr_to_write = to_write - written_pages;
}
-
return ret;
}
@@ -2118,9 +2119,11 @@ static int ext4_da_writepage(struct page *page,
* If we don't have mapping block we just ignore
* them. We can also reach here via shrink_page_list
*/
+ start_pdflush = 1;
redirty_page_for_writepage(wbc, page);
unlock_page(page);
- return 0;
+ ret = 0;
+ goto finish_ret;
}
} else {
/*
@@ -2143,9 +2146,11 @@ static int ext4_da_writepage(struct page *page,
/* check whether all are mapped and non delay */
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
ext4_bh_unmapped_or_delay)) {
+ start_pdflush = 1;
redirty_page_for_writepage(wbc, page);
unlock_page(page);
- return 0;
+ ret = 0;
+ goto finish_ret;
}
} else {
/*
@@ -2153,9 +2158,11 @@ static int ext4_da_writepage(struct page *page,
* so just redity the page and unlock
* and return
*/
+ start_pdflush = 1;
redirty_page_for_writepage(wbc, page);
unlock_page(page);
- return 0;
+ ret = 0;
+ goto finish_ret;
}
}
@@ -2165,7 +2172,7 @@ static int ext4_da_writepage(struct page *page,
ret = block_write_full_page(page,
ext4_normal_get_block_write,
wbc);
-
+finish_ret:
return ret;
}
@@ -2201,19 +2208,14 @@ static int ext4_da_writepages(struct address_space *mapping,
}
while (!ret && to_write) {
- /*
- * set the max dirty pages could be write at a time
- * to fit into the reserved transaction credits
- */
- if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
- wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
/*
- * Estimate the worse case needed credits to write out
- * to_write pages
+ * We write only a single extent in a loop.
+ * So allocate credit needed to write a single
+ * extent. journalled mode is not supported.
*/
- needed_blocks = ext4_writepages_trans_blocks(inode,
- wbc->nr_to_write);
+ BUG_ON(ext4_should_journal_data(inode));
+ needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
@@ -2239,7 +2241,19 @@ static int ext4_da_writepages(struct address_space *mapping,
ret = mpage_da_writepages(mapping, wbc,
ext4_da_get_block_write);
ext4_journal_stop(handle);
- if (wbc->nr_to_write) {
+ if (ret == MPAGE_DA_EXTENT_TAIL) {
+ /*
+ * got one extent now try with
+ * rest of the pages
+ */
+ to_write += wbc->nr_to_write;
+ /*
+ * Try to write from the start
+ * The pages already written will
+ * not be tagged dirty
+ */
+ //wbc->range_start = range_start;
+ } else if (wbc->nr_to_write) {
/*
* There is no more writeout needed
* or we requested for a noblocking writeout
在 2008-07-30三的 13:29 +0200,Frédéric Bohé写道:
> While doing some perf test on flex bg, I tried to run bonnie++ on
> 2.6.27-rc1 + patch queue including your journal credit fix but I had a
> very similar crash. Here are the details, I hope this help :
>
> kernel 2.6.27-rc1
> patch queue snapshot :
> ext4-patch-queue-25fb9834f3814b3aa567c5af090fba688a86eea9
>
> With latest e2fsprogs :
> mkfs.ext4 -t ext4dev -b1024 -G256 /dev/sdb1 4G
Looks like a 1k blocksize ext4, I have tested 1k briefly it seems okay
for single test. I will try bonnie myself. The stack shows there isn't
enought credit to delete an file. But the journal credit fix mostly fix
the code path on writepages(), so it should not affact the unlink case.
Is this a regression with this patch or it's a existing issue that this
patch did not fix?
There is one bug Aneesh pointed out today, I will update the patch, but
I don't think this matters to this issue.
> mount -t ext4dev /dev/sdb1 /mnt/test
> bonnie++ -u root -s 2g:256 -r 1024 -n 200 -d /mnt/test/
>
> after a while, it ends up with :
>
> kernel BUG at fs/jbd2/transaction.c:984!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: ext4dev jbd2 crc16 kvm_intel kvm megaraid_mbox
> megaraid_mm
>
> Pid: 13965, comm: bonnie++ Not tainted (2.6.27-rc1 #3)
> EIP: 0060:[<f8b186a6>] EFLAGS: 00010246 CPU: 4
> EIP is at jbd2_journal_dirty_metadata+0xc6/0xd0 [jbd2]
> EAX: 00000000 EBX: f0acc380 ECX: f0acc380 EDX: f0069f80
> ESI: f3964700 EDI: f5daa1b0 EBP: f6dd7e00 ESP: f5949ebc
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process bonnie++ (pid: 13965, ti=f5948000 task=f5404ba0
> task.ti=f5948000)
> Stack: f7cb0100 f5daa1b0 f0acc380 f8b8ca12 f8b7ef62 f7cb0000 f68a5d00
> f7cb0100
> 00000000 f7183e00 f5daa1b0 f8b6a06e 00000040 f8b736db f7cb2134
> f2c94238
> 0000000b 00000000 00008000 00000000 f0acc380 f7cb0000 f08b2ac0
> f2c942c8
> Call Trace:
> [<f8b7ef62>] __ext4_journal_dirty_metadata+0x22/0x60 [ext4dev]
> [<f8b6a06e>] ext4_free_inode+0x26e/0x2f0 [ext4dev]
> [<f8b736db>] ext4_orphan_del+0xcb/0x180 [ext4dev]
> [<f8b6fb3c>] ext4_delete_inode+0x11c/0x140 [ext4dev]
> [<f8b6fa20>] ext4_delete_inode+0x0/0x140 [ext4dev]
> [<c018fe6a>] generic_delete_inode+0x5a/0xc0
> [<c018f4a4>] iput+0x44/0x50
> [<c0186271>] do_unlinkat+0xd1/0x150
> [<c017cdd6>] vfs_write+0x106/0x140
> [<c02aa7b0>] tty_write+0x0/0x1e0
> [<c017d2d1>] sys_write+0x41/0x70
> [<c0102fc9>] sysenter_do_call+0x12/0x25
> =======================
> Code: 55 2c 8d 76 00 74 aa 0f 0b eb fe 0f 0b eb fe 8d b6 00 00 00 00 0f
> 0b eb fe f6 43 02 20 0f 84 5d ff ff ff f3 90 eb f2 0f 0b eb fe <0f> 0b
> eb fe 8d b6 00 00 00 00 55 57 56 53 89 d3 83 ec 10 89 44
> EIP: [<f8b186a6>] jbd2_journal_dirty_metadata+0xc6/0xd0 [jbd2] SS:ESP
> 0068:f5949ebc
>
>
> Fred
>
>
>
> Le mardi 29 juillet 2008 à 18:58 -0700, Mingming Cao a écrit :
> > Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages
> >
> > From: Mingming Cao <[email protected]>
> >
> > With delalloc, at writepages() time, we need to reserve enough credits to start
> > a new handle, to allow possible multiple segment of block allocations under a
> > single call mapge_da_writepages(), to fit metadata updates into the single
> > transaction. This patch fixed this by calculating the needed credits for
> > write-out given number of dirty pages, with the consideration of discontinues
> > block allocations. It fixed both extent files and non extent files.
> >
> > This patch also fixed the journal credit reservation for DIO. Currently the
> > estimated credits for DIO is only based on non extent format file. That credit
> > is not enough for mballoc a single extent on extent based file. This patch
> > fixed that.
> >
> > The fallocate double booking credits for modifying super block etc, this patch
> > fixed that.
> >
> > This also fix credit reservation in migration and defrag code.
> >
> >
> > Changes since v2:
> >
> > 1) fix writepages() inefficency issue. sync() will invoke writepages()
> > twice( not sure exactly why), the second time all the pages are clean so
> > it waste the cpu time to walk though all pages and find they are not
> > dirty . But it's simple to workaround by skip writepages() if there is
> > no dirty pages pointed by the mapping.
> >
> >
> > 2) extent based credit calculate is quit conservetive. It always use the
> > max possible depth to estimate the needed credits to support extent
> > insert/tree split. In fact the depth info for each inode is quite easy
> > to get, so we could use more accurate info to calculate
> >
> > 3) Limit the max number of pages that could flush at once from
> > ext4_da_writepages(), so that the max possible transaction credits could
> > fit under the allowed credits for starting a new transaction. Reduce
> > the number of pages to flush if necesary. Currently with 4K page size
> > and 4K block size, with extent file, it's possible to flush about 1K
> > pages under a single transaction.
> >
> >
> > Verified with memory pressure case and umount case,
> >
> > Signed-off-by: Mingming Cao <[email protected]>
> > ---
> > fs/ext4/ext4.h | 4 -
> > fs/ext4/ext4_extents.h | 3 -
> > fs/ext4/ext4_jbd2.h | 10 ++++
> > fs/ext4/extents.c | 78 ++++++++++++++++++-------------
> > fs/ext4/inode.c | 120 ++++++++++++++++++++++++++-----------------------
> > fs/ext4/migrate.c | 6 +-
> > 6 files changed, 129 insertions(+), 92 deletions(-)
> >
> > Index: linux-2.6.26git6/fs/ext4/ext4.h
> > ===================================================================
> > --- linux-2.6.26git6.orig/fs/ext4/ext4.h 2008-07-28 22:47:22.000000000 -0700
> > +++ linux-2.6.26git6/fs/ext4/ext4.h 2008-07-29 17:40:40.000000000 -0700
> > @@ -1072,7 +1072,7 @@ extern void ext4_truncate (struct inode
> > extern void ext4_set_inode_flags(struct inode *);
> > extern void ext4_get_inode_flags(struct ext4_inode_info *);
> > extern void ext4_set_aops(struct inode *inode);
> > -extern int ext4_writepage_trans_blocks(struct inode *);
> > +extern int ext4_writepages_trans_blocks(struct inode *, int nrpages);
> > extern int ext4_block_truncate_page(handle_t *handle,
> > struct address_space *mapping, loff_t from);
> > extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> > @@ -1227,7 +1227,7 @@ extern const struct inode_operations ext
> >
> > /* extents.c */
> > extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
> > -extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
> > +extern int ext4_ext_writeblocks_trans_credits(struct inode *inode, int);
> > extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
> > ext4_lblk_t iblock,
> > unsigned long max_blocks, struct buffer_head *bh_result,
> > Index: linux-2.6.26git6/fs/ext4/extents.c
> > ===================================================================
> > --- linux-2.6.26git6.orig/fs/ext4/extents.c 2008-07-28 22:53:20.000000000 -0700
> > +++ linux-2.6.26git6/fs/ext4/extents.c 2008-07-29 17:40:50.000000000 -0700
> > @@ -1747,34 +1747,43 @@ static int ext4_ext_rm_idx(handle_t *han
> > }
> >
> > /*
> > - * ext4_ext_calc_credits_for_insert:
> > - * This routine returns max. credits that the extent tree can consume.
> > + * ext4_ext_calc_credits_for_single_extent:
> > + * This routine returns max. credits that needed to insert an extent
> > + * to the extent tree.
> > * It should be OK for low-performance paths like ->writepage()
> > * To allow many writing processes to fit into a single transaction,
> > - * the caller should calculate credits under i_data_sem and
> > - * pass the actual path.
> > + * When pass the actual path, the caller should calculate credits
> > + * under i_data_sem.
> > + *
> > + * For inserting a single extent, in the worse case extent tree depth is 5
> > + * for old tree and new tree, for every level we need to reserve
> > + * credits to log the bitmap and block group descriptors
> > + *
> > + * credit needed for the update of super block + inode block + quota files
> > + * are not included here. The caller of this function need to take care of this.
> > */
> > -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> > +int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> > struct ext4_ext_path *path)
> > {
> > int depth, needed;
> >
> > + depth = ext_depth(inode);
> > +
> > if (path) {
> > /* probably there is space in leaf? */
> > - depth = ext_depth(inode);
> > if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> > < le16_to_cpu(path[depth].p_hdr->eh_max))
> > - return 1;
> > + /* 1 for block bitmap, 1 for group descriptor */
> > + return 2;
> > }
> >
> > - /*
> > - * given 32-bit logical block (4294967296 blocks), max. tree
> > - * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
> > - * Let's also add one more level for imbalance.
> > - */
> > - depth = 5;
> > + /* add one more level in case of tree increase when insert a extent */
> > + depth += 1;
> >
> > - /* allocation of new data block(s) */
> > + /*
> > + * bitmap blocks and group descriptor block for
> > + * allocation of new extent
> > + */
> > needed = 2;
> >
> > /*
> > @@ -1791,9 +1800,6 @@ int ext4_ext_calc_credits_for_insert(str
> > */
> > needed += (depth * 2) + (depth * 2);
> >
> > - /* any allocation modifies superblock */
> > - needed += 1;
> > -
> > return needed;
> > }
> >
> > @@ -1917,9 +1923,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
> > correct_index = 1;
> > credits += (ext_depth(inode)) + 1;
> > }
> > -#ifdef CONFIG_QUOTA
> > credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > -#endif
> >
> > err = ext4_ext_journal_restart(handle, credits);
> > if (err)
> > @@ -2801,8 +2805,8 @@ void ext4_ext_truncate(struct inode *ino
> > /*
> > * probably first extent we're gonna free will be last in block
> > */
> > - err = ext4_writepage_trans_blocks(inode) + 3;
> > - handle = ext4_journal_start(inode, err);
> > + handle = ext4_journal_start(inode,
> > + ext4_writepages_trans_blocks(inode, 1) + 3);
> > if (IS_ERR(handle))
> > return;
> >
> > @@ -2855,22 +2859,32 @@ out_stop:
> > }
> >
> > /*
> > - * ext4_ext_writepage_trans_blocks:
> > + * ext4_ext_writeblocks_trans_credits:
> > * calculate max number of blocks we could modify
> > - * in order to allocate new block for an inode
> > + * in order to allocate the required number of new blocks
> > + *
> > + * In the worse case, one block per extent.
> > + *
> > */
> > -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
> > +int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks)
> > {
> > int needed;
> >
> > - needed = ext4_ext_calc_credits_for_insert(inode, NULL);
> > -
> > - /* caller wants to allocate num blocks, but note it includes sb */
> > - needed = needed * num - (num - 1);
> > + /* cost of adding a single extent:
> > + * index blocks, leafs, bitmaps,
> > + * groupdescp
> > + */
> > + needed = ext4_ext_calc_credits_for_single_extent(inode, NULL);
> > + /*
> > + * For data=journalled mode need to account for the data blocks
> > + * Also need to add super block and inode block
> > + */
> > + if (ext4_should_journal_data(inode))
> > + needed = nrblocks * (needed + 1) + 2;
> > + else
> > + needed = nrblocks * needed + 2;
> >
> > -#ifdef CONFIG_QUOTA
> > needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > -#endif
> >
> > return needed;
> > }
> > @@ -2935,10 +2949,9 @@ long ext4_fallocate(struct inode *inode,
> > max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > - block;
> > /*
> > - * credits to insert 1 extent into extent tree + buffers to be able to
> > - * modify 1 super block, 1 block bitmap and 1 group descriptor.
> > + * credits to insert 1 extent into extent tree
> > */
> > - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> > + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> > mutex_lock(&inode->i_mutex);
> > retry:
> > while (ret >= 0 && ret < max_blocks) {
> > Index: linux-2.6.26git6/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.26git6.orig/fs/ext4/inode.c 2008-07-28 22:53:21.000000000 -0700
> > +++ linux-2.6.26git6/fs/ext4/inode.c 2008-07-29 17:45:43.000000000 -0700
> > @@ -1,5 +1,5 @@
> > /*
> > - * linux/fs/ext4/inode.c
> > + * linux/fs/ext4/inode.c
> > *
> > * Copyright (C) 1992, 1993, 1994, 1995
> > * Remy Card ([email protected])
> > @@ -954,15 +954,6 @@ out:
> >
> > /* Maximum number of blocks we map for direct IO at once. */
> > #define DIO_MAX_BLOCKS 4096
> > -/*
> > - * Number of credits we need for writing DIO_MAX_BLOCKS:
> > - * We need sb + group descriptor + bitmap + inode -> 4
> > - * For B blocks with A block pointers per block we need:
> > - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> > - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> > - */
> > -#define DIO_CREDITS 25
> > -
> >
> > /*
> > *
> > @@ -1082,13 +1073,13 @@ static int ext4_get_block(struct inode *
> > handle_t *handle = ext4_journal_current_handle();
> > int ret = 0, started = 0;
> > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> >
> > if (create && !handle) {
> > /* Direct IO write... */
> > if (max_blocks > DIO_MAX_BLOCKS)
> > max_blocks = DIO_MAX_BLOCKS;
> > - handle = ext4_journal_start(inode, DIO_CREDITS +
> > - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> > + handle = ext4_journal_start(inode, dio_credits);
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > goto out;
> > @@ -1267,7 +1258,7 @@ static int ext4_write_begin(struct file
> > struct page **pagep, void **fsdata)
> > {
> > struct inode *inode = mapping->host;
> > - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > + int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1);
> > handle_t *handle;
> > int retries = 0;
> > struct page *page;
> > @@ -2153,20 +2144,6 @@ static int ext4_da_writepage(struct page
> >
> > return ret;
> > }
> > -
> > -/*
> > - * For now just follow the DIO way to estimate the max credits
> > - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> > - * todo: need to calculate the max credits need for
> > - * extent based files, currently the DIO credits is based on
> > - * indirect-blocks mapping way.
> > - *
> > - * Probably should have a generic way to calculate credits
> > - * for DIO, writepages, and truncate
> > - */
> > -#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> > -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> > -
> > static int ext4_da_writepages(struct address_space *mapping,
> > struct writeback_control *wbc)
> > {
> > @@ -2176,22 +2153,24 @@ static int ext4_da_writepages(struct add
> > int ret = 0;
> > long to_write;
> > loff_t range_start = 0;
> > + int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits;
> > + int max_credit_blocks = ext4_journal_max_transaction_buffers(inode);
> > + int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1);
> > + int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page;
> >
> > /*
> > * No pages to write? This is mainly a kludge to avoid starting
> > * a transaction for special inodes like journal inode on last iput()
> > * because that could violate lock ordering on umount
> > */
> > - if (!mapping->nrpages)
> > + if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > return 0;
> >
> > - /*
> > - * Estimate the worse case needed credits to write out
> > - * EXT4_MAX_BUF_BLOCKS pages
> > - */
> > - needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> > + if (wbc->nr_to_write > mapping->nrpages)
> > + wbc->nr_to_write = mapping->nrpages;
> >
> > to_write = wbc->nr_to_write;
> > +
> > if (!wbc->range_cyclic) {
> > /*
> > * If range_cyclic is not set force range_cont
> > @@ -2202,10 +2181,31 @@ static int ext4_da_writepages(struct add
> > }
> >
> > while (!ret && to_write) {
> > + /*
> > + * set the max dirty pages could be write at a time
> > + * to fit into the reserved transaction credits
> > + */
> > + if (wbc->nr_to_write > max_writeback_pages)
> > + wbc->nr_to_write = max_writeback_pages;
> > +
> > + /*
> > + * Estimate the worse case needed credits to write out
> > + * to_write pages
> > + */
> > + needed_blocks = ext4_writepages_trans_blocks(inode,
> > + wbc->nr_to_write);
> > + while (needed_blocks > max_credit_blocks) {
> > + wbc->nr_to_write --;
> > + needed_blocks = ext4_writepages_trans_blocks(inode,
> > + wbc->nr_to_write);
> > + }
> > /* start a new transaction*/
> > handle = ext4_journal_start(inode, needed_blocks);
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > + printk(KERN_EMERG "%s: Not enough credits to flush %ld pages\n", __func__,
> > + wbc->nr_to_write);
> > + dump_stack();
> > goto out_writepages;
> > }
> > if (ext4_should_order_data(inode)) {
> > @@ -2221,12 +2221,6 @@ static int ext4_da_writepages(struct add
> > }
> >
> > }
> > - /*
> > - * set the max dirty pages could be write at a time
> > - * to fit into the reserved transaction credits
> > - */
> > - if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> > - wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> >
> > to_write -= wbc->nr_to_write;
> > ret = mpage_da_writepages(mapping, wbc,
> > @@ -2587,7 +2581,8 @@ static int __ext4_journalled_writepage(s
> > * references to buffers so we are safe */
> > unlock_page(page);
> >
> > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
> > + handle = ext4_journal_start(inode,
> > + ext4_writepages_trans_blocks(inode, 1));
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > goto out;
> > @@ -4271,20 +4266,20 @@ int ext4_getattr(struct vfsmount *mnt, s
> > /*
> > * How many blocks doth make a writepage()?
> > *
> > - * With N blocks per page, it may be:
> > - * N data blocks
> > + * With N blocks per page, and P pages, it may be:
> > + * N*P data blocks
> > * 2 indirect block
> > * 2 dindirect
> > * 1 tindirect
> > - * N+5 bitmap blocks (from the above)
> > - * N+5 group descriptor summary blocks
> > + * N*P+5 bitmap blocks (from the above)
> > + * N*P+5 group descriptor summary blocks
> > * 1 inode block
> > * 1 superblock.
> > * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
> > *
> > - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> > + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
> > *
> > - * With ordered or writeback data it's the same, less the N data blocks.
> > + * With ordered or writeback data it's the same, less the N*P data blocks.
> > *
> > * If the inode's direct blocks can hold an integral number of pages then a
> > * page cannot straddle two indirect blocks, and we can only touch one indirect
> > @@ -4295,30 +4290,49 @@ int ext4_getattr(struct vfsmount *mnt, s
> > * block and work out the exact number of indirects which are touched. Pah.
> > */
> >
> > -int ext4_writepage_trans_blocks(struct inode *inode)
> > +static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks)
> > {
> > - int bpp = ext4_journal_blocks_per_page(inode);
> > - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> > + int indirects = (EXT4_NDIR_BLOCKS % nrblocks) ? 5 : 3;
> > int ret;
> >
> > - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > - return ext4_ext_writepage_trans_blocks(inode, bpp);
> > -
> > if (ext4_should_journal_data(inode))
> > - ret = 3 * (bpp + indirects) + 2;
> > + ret = 3 * (nrblocks + indirects) + 2;
> > else
> > - ret = 2 * (bpp + indirects) + 2;
> > + ret = 2 * nrblocks + 3* indirects + 2;
> >
> > -#ifdef CONFIG_QUOTA
> > /* We know that structure was already allocated during DQUOT_INIT so
> > * we will be updating only the data blocks + inodes */
> > ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > -#endif
> >
> > return ret;
> > }
> >
> > /*
> > + * Calulate the total number of credits to reserve to fit
> > + * the modification of @num pages into a single transaction
> > + *
> > + * This could be called via ext4_write_begin() or later
> > + * ext4_da_writepages() in delalyed allocation case.
> > + *
> > + * In both case it's possible that we could allocating multiple
> > + * chunks of blocks. We need to consider the worse case, when
> > + * one new block per extent.
> > + *
> > + * For Direct IO and fallocate, the journal credits reservation
> > + * is based on one single extent allocation, so they could use
> > + * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
> > + * chunk of allocation needs.
> > + */
> > +int ext4_writepages_trans_blocks(struct inode *inode, int nrpages)
> > +{
> > + int bpp = ext4_journal_blocks_per_page(inode);
> > + int nrblocks = nrpages * bpp;
> > +
> > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > + return ext4_writeblocks_trans_credits_old(inode, nrblocks);
> > + return ext4_ext_writeblocks_trans_credits(inode, nrblocks);
> > +}
> > +/*
> > * The caller must have previously called ext4_reserve_inode_write().
> > * Give this, we know that the caller already has write access to iloc->bh.
> > */
> > Index: linux-2.6.26git6/fs/ext4/migrate.c
> > ===================================================================
> > --- linux-2.6.26git6.orig/fs/ext4/migrate.c 2008-07-13 14:51:29.000000000 -0700
> > +++ linux-2.6.26git6/fs/ext4/migrate.c 2008-07-28 22:53:21.000000000 -0700
> > @@ -52,9 +52,11 @@ static int finish_range(handle_t *handle
> > * Since we are doing this in loop we may accumalate extra
> > * credit. But below we try to not accumalate too much
> > * of them by restarting the journal.
> > + *
> > + * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas
> > */
> > - needed = ext4_ext_calc_credits_for_insert(inode, path);
> > -
> > + needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 2
> > + + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > /*
> > * Make sure the credit we accumalated is not really high
> > */
> > Index: linux-2.6.26git6/fs/ext4/ext4_extents.h
> > ===================================================================
> > --- linux-2.6.26git6.orig/fs/ext4/ext4_extents.h 2008-07-28 22:47:22.000000000 -0700
> > +++ linux-2.6.26git6/fs/ext4/ext4_extents.h 2008-07-28 22:55:40.000000000 -0700
> > @@ -216,7 +216,8 @@ extern int ext4_ext_calc_metadata_amount
> > extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
> > extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
> > extern int ext4_extent_tree_init(handle_t *, struct inode *);
> > -extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
> > +extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> > + struct ext4_ext_path *path);
> > extern int ext4_ext_try_to_merge(struct inode *inode,
> > struct ext4_ext_path *path,
> > struct ext4_extent *);
> > Index: linux-2.6.26git6/fs/ext4/ext4_jbd2.h
> > ===================================================================
> > --- linux-2.6.26git6.orig/fs/ext4/ext4_jbd2.h 2008-07-28 22:47:22.000000000 -0700
> > +++ linux-2.6.26git6/fs/ext4/ext4_jbd2.h 2008-07-28 22:53:21.000000000 -0700
> > @@ -231,4 +231,14 @@ static inline int ext4_should_writeback_
> > return 0;
> > }
> >
> > +static inline int ext4_journal_max_transaction_buffers(struct inode *inode)
> > +{
> > + /*
> > + * max transaction buffers
> > + * calculation based on
> > + * journal->j_max_transaction_buffers = journal->j_maxlen / 4;
> > + */
> > + return (EXT4_JOURNAL(inode))->j_maxlen / 4;
> > +}
> > +
> > #endif /* _EXT4_JBD2_H */
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
On Thu, Jul 31, 2008 at 11:07:11AM -0700, Mingming Cao wrote:
>
> Looks like a 1k blocksize ext4, I have tested 1k briefly it seems okay
> for single test. I will try bonnie myself. The stack shows there isn't
> enought credit to delete an file. But the journal credit fix mostly fix
> the code path on writepages(), so it should not affact the unlink case.
Yep, different bug. I think this patch should fix things.
There's a larger question here which is should the extents code really
be requesting only a tiny amount of transaction credits at a time; the
advantage is that by doing so, it reduces the chance of provoking a
transaction commit before its time. On the other hand, for a very
fragmented file with lots of extents, this will cause lots of extra
calls jbd2_journal_extend(), which does end up taking a bit more cpu
time as well as grabbing both the journal and the transaction spin
locks.
The original non-extents truncate code massively overestimates the
number of credits needed to complete the truncate (to the point where
it is probably needlessly causing transactions to close early) but it
means many fewer calls to jbd2_journal_extend().
- Ted
commit 0e71ff5fc4cf98c44014a1d3c8ccffed846e7ee1
Author: Theodore Ts'o <[email protected]>
Date: Fri Aug 1 01:40:08 2008 -0400
ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
The extents codepath for ext4_truncate() requests journal transaction
credits in very small chunks, requesting only what is needed. This
means there may not be enough credits left on the transaction handle
after ext4_truncate() returns and then when ext4_delete_inode() tries
finish up its work, it may not have enough transaction credits,
causing a BUG() oops in the jbd2 core.
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c7fb647..6d27e78 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -215,6 +215,18 @@ void ext4_delete_inode (struct inode * inode)
inode->i_size = 0;
if (inode->i_blocks)
ext4_truncate(inode);
+
+ /*
+ * ext4_ext_truncate() doesn't reserve any slop when it
+ * restarts journal transactions; therefore there may not be
+ * enough credits left in the handle to remove the inode from
+ * the orphan list and set the dtime field.
+ */
+ if (ext4_ext_journal_restart(handle, 3)) {
+ ext4_journal_stop(handle);
+ goto no_delete;
+ }
+
/*
* Kill off the orphan record which ext4_truncate created.
* AKPM: I think this can be inside the above `if'.
Le vendredi 01 août 2008 à 01:49 -0400, Theodore Tso a écrit :
> On Thu, Jul 31, 2008 at 11:07:11AM -0700, Mingming Cao wrote:
> >
> > Looks like a 1k blocksize ext4, I have tested 1k briefly it seems okay
> > for single test. I will try bonnie myself. The stack shows there isn't
> > enought credit to delete an file. But the journal credit fix mostly fix
> > the code path on writepages(), so it should not affact the unlink case.
>
> Yep, different bug. I think this patch should fix things.
>
> There's a larger question here which is should the extents code really
> be requesting only a tiny amount of transaction credits at a time; the
> advantage is that by doing so, it reduces the chance of provoking a
> transaction commit before its time. On the other hand, for a very
> fragmented file with lots of extents, this will cause lots of extra
> calls jbd2_journal_extend(), which does end up taking a bit more cpu
> time as well as grabbing both the journal and the transaction spin
> locks.
>
> The original non-extents truncate code massively overestimates the
> number of credits needed to complete the truncate (to the point where
> it is probably needlessly causing transactions to close early) but it
> means many fewer calls to jbd2_journal_extend().
>
> - Ted
>
> commit 0e71ff5fc4cf98c44014a1d3c8ccffed846e7ee1
> Author: Theodore Ts'o <[email protected]>
> Date: Fri Aug 1 01:40:08 2008 -0400
>
> ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
>
> The extents codepath for ext4_truncate() requests journal transaction
> credits in very small chunks, requesting only what is needed. This
> means there may not be enough credits left on the transaction handle
> after ext4_truncate() returns and then when ext4_delete_inode() tries
> finish up its work, it may not have enough transaction credits,
> causing a BUG() oops in the jbd2 core.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7fb647..6d27e78 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -215,6 +215,18 @@ void ext4_delete_inode (struct inode * inode)
> inode->i_size = 0;
> if (inode->i_blocks)
> ext4_truncate(inode);
> +
> + /*
> + * ext4_ext_truncate() doesn't reserve any slop when it
> + * restarts journal transactions; therefore there may not be
> + * enough credits left in the handle to remove the inode from
> + * the orphan list and set the dtime field.
> + */
> + if (ext4_ext_journal_restart(handle, 3)) {
> + ext4_journal_stop(handle);
> + goto no_delete;
> + }
> +
> /*
> * Kill off the orphan record which ext4_truncate created.
> * AKPM: I think this can be inside the above `if'.
>
Thanks Ted. With this patch, I don't have a crash anymore, but I have
some messages :
EXT4 Inode eedb4228: orphan list check failed!
eedb4228: 0001f30a 00010004 00000000 00000000 ................
eedb4238: 0002b001 00010000 00001800 00008000 ................
eedb4248: 00012801 00009800 00008000 0001a801 .(..............
eedb4258: 00011800 00008000 00022801 00080000 .........(......
eedb4268: 00000000 00000000 00000000 00000000 ................
eedb4278: 00000000 00000000 00000000 f7d62134 ............4!..
eedb4288: f7d62134 00000000 00000000 0000001c 4!..............
eedb4298: 00000000 00004747 eedb42a0 eedb42a0 ....GG...B...B..
eedb42a8: 00000000 00000000 eedb42b0 eedb42b0 .........B...B..
eedb42b8: eedb42b8 eedb42b8 eedb42c0 eedb42c0 .B...B...B...B..
eedb42c8: 0000000c 00000000 00000000 00000000 ................
eedb42d8: 00000000 00000000 00000001 00000000 ................
eedb42e8: 00000000 00000000 00200000 4892c5f9 .......... ....H
eedb42f8: 35f4b4f0 4892c63c 03562cad 4892c63c ...5<..H.,V.<..H
eedb4308: 03562cad 0000000a 0009f002 00000000 .,V.............
eedb4318: 81a40000 00000202 00000001 00000000 ................
eedb4328: eedb4328 eedb4328 00000000 00000000 (C..(C..........
eedb4338: eedb4338 eedb4338 f8b85fe0 f8b85f60 8C..8C..._..`_..
eedb4348: f70b6a00 00000000 eedb4354 eedb42a8 .j......TC...B..
eedb4358: 00000000 00000020 00000000 00001616 .... ...........
eedb4368: 00000000 00000000 00010001 eedb4374 ............tC..
eedb4378: eedb4374 00000000 00000000 00000000 tC..............
eedb4388: 00040000 f8b861a0 001200d2 f55108ec .....a........Q.
eedb4398: 00000000 eedb439c eedb439c 00000000 .....C...C......
eedb43a8: eedb43a8 eedb43a8 00000000 00000000 .C...C..........
eedb43b8: aa93f4c8 00000000 00000000 eedb43c4 .............C..
eedb43c8: eedb43c4 00000001 00000000 eedb43d4 .C...........C..
eedb43d8: eedb43d4 00000040 00001050 00000000 [email protected].......
eedb43e8: 00000000 00000000 00000000 00000000 ................
eedb43f8: 00100100 00200200 eedb42a8 00000000 ...... ..B......
eedb4408: c15c754b 00000000 00000000 000fffff Ku\.............
eedb4418: fff00000 00000000 4892c5f9 35f4b4f0 ...........H...5
eedb4428: eedb4428 eedb4428 00000202 00000000 (D..(D..........
eedb4438: 00000000 00000000 00000000 0000ecec ................
Pid: 7092, comm: bonnie++ Not tainted 2.6.27-rc1 #6
[<f8b719ee>] ext4_destroy_inode+0x5e/0x70 [ext4dev]
[<c018f820>] destroy_inode+0x20/0x40
[<c018f4a4>] iput+0x44/0x50
[<c0186271>] do_unlinkat+0xd1/0x150
[<c017cdd6>] vfs_write+0x106/0x140
[<c02aa7b0>] tty_write+0x0/0x1e0
[<c017d2d1>] sys_write+0x41/0x70
[<c0102fc9>] sysenter_do_call+0x12/0x25
=======================
EXT4 Inode eed70228: orphan list check failed!
eed70228: 0001f30a 00010004 00000000 00000000 ................
eed70238: 00012112 00010000 00000094 00000002 .!..............
eed70248: 00012295 00000098 00000003 00012299 ."..........."..
eed70258: 0000009c 00000002 0001229d 00080000 ........."......
eed70268: 00000000 00000000 0000000c 00000000 ................
eed70278: 00000000 00000000 00000000 eedb4284 .............B..
eed70288: f7d62134 00000000 00000000 0000001c 4!..............
eed70298: 00000000 00000707 eed702a0 eed702a0 ................
eed702a8: 00000000 00000000 eed702b0 eed702b0 ................
eed702b8: eed702b8 eed702b8 eed702c0 eed702c0 ................
eed702c8: 0000000d 00000000 00000000 00000000 ................
eed702d8: 00000000 00000000 00000001 00000000 ................
eed702e8: 00000000 00000000 00200000 4892c5f9 .......... ....H
eed702f8: 35f4b4f0 4892c63f 0a441733 4892c63f ...5?..H3.D.?..H
eed70308: 0a441733 0000000a 000752f0 00000000 3.D......R......
eed70318: 81a40000 0000a2a2 00000001 00000000 ................
eed70328: eed70328 eed70328 00000000 00000000 (...(...........
eed70338: eed70338 eed70338 f8b85fe0 f8b85f60 8...8...._..`_..
eed70348: f70b6a00 00000000 eed70354 eed702a8 .j......T.......
eed70358: 00000000 00000020 00000000 00006464 .... .......dd..
eed70368: 00000000 00000000 00010001 eed70374 ............t...
eed70378: eed70374 00000000 00000000 00000000 t...............
eed70388: 0003cea2 f8b861a0 001200d2 f55108ec .....a........Q.
eed70398: 00000000 eed7039c eed7039c 00000000 ................
eed703a8: eed703a8 eed703a8 00000000 00000000 ................
eed703b8: aa93f4c9 00000000 00000000 eed703c4 ................
eed703c8: eed703c4 00000001 00000000 eed703d4 ................
eed703d8: eed703d4 00000040 fffff1db 00000000 ....@...........
eed703e8: 00000000 00000000 00000000 00000000 ................
eed703f8: 00100100 00200200 eed702a8 00000000 ...... .........
eed70408: c15c51c1 002101a9 00000000 000f39a8 .Q\...!......9..
eed70418: 000000e0 00000000 4892c5f9 35f4b4f0 ...........H...5
eed70428: eed70428 eed70428 0000d3d3 00000000 (...(...........
eed70438: 00000000 00000000 00000000 00003e3e ............>>..
Pid: 7092, comm: bonnie++ Not tainted 2.6.27-rc1 #6
[<f8b719ee>] ext4_destroy_inode+0x5e/0x70 [ext4dev]
[<c018f820>] destroy_inode+0x20/0x40
[<c018f4a4>] iput+0x44/0x50
[<c0186271>] do_unlinkat+0xd1/0x150
[<c017ab4c>] do_sys_open+0xbc/0xe0
[<c0102fc9>] sysenter_do_call+0x12/0x25
=======================
EXT4 Inode eed70450: orphan list check failed!
eed70450: 0001f30a 00010004 00000000 00000000 ................
eed70460: 00083001 00060000 00008000 00008000 .0..............
eed70470: 0006a801 00010000 00008000 00072801 .............(..
eed70480: 00018000 00008000 0007a801 00080000 ................
eed70490: 00000000 00000000 0000000d 00000000 ................
eed704a0: 00000000 00000000 00000000 eed70284 ................
eed704b0: f7d62134 00000000 00000000 0000001c 4!..............
eed704c0: 00000000 0000d8d8 eed704c8 eed704c8 ................
eed704d0: 00000000 00000000 eed704d8 eed704d8 ................
eed704e0: eed704e0 eed704e0 eed704e8 eed704e8 ................
eed704f0: 0000000e 00000000 00000000 00000000 ................
eed70500: 00000000 00000000 00000001 00000000 ................
eed70510: 00000000 00000000 00800000 4892c6b8 ...............H
eed70520: 36d5deb5 4892c6b9 12b32098 4892c6b9 ...6...H. .....H
eed70530: 12b32098 0000000a 000ae002 00000000 . ..............
eed70540: 81800000 0000ffff 00000001 0000c6c6 ................
eed70550: eed70550 eed70550 00000000 00000000 P...P...........
eed70560: eed70560 eed70560 f8b85fe0 f8b85f60 `...`...._..`_..
eed70570: f70b6a00 00000000 eed7057c eed704d0 .j......|.......
eed70580: 00000000 00000020 00000000 00005353 .... .......SS..
eed70590: 00000000 00000000 00010001 eed7059c ................
eed705a0: eed7059c 00000000 00000000 00000000 ................
eed705b0: 00040000 f8b861a0 001200d2 f55108ec .....a........Q.
eed705c0: 00000000 eed705c4 eed705c4 00000000 ................
eed705d0: eed705d0 eed705d0 00000000 00000000 ................
eed705e0: aa93f4ca 00000000 00000000 eed705ec ................
eed705f0: eed705ec 00000001 00000000 eed705fc ................
eed70600: eed705fc 00000040 00008200 00000000 ....@...........
eed70610: 00000000 00000000 00000000 00000000 ................
eed70620: 00100100 00200200 eed704d0 00000000 ...... .........
eed70630: 000013d3 001a6f39 00000000 000fff38 ....9o......8...
eed70640: 000000c8 00000000 4892c63e 3120bd63 ........>..Hc. 1
eed70650: eed70650 eed70650 00000707 00000000 P...P...........
eed70660: 00000000 00000000 eed70000 00000606 ................
Pid: 7092, comm: bonnie++ Not tainted 2.6.27-rc1 #6
[<f8b719ee>] ext4_destroy_inode+0x5e/0x70 [ext4dev]
[<c018f820>] destroy_inode+0x20/0x40
[<c018f4a4>] iput+0x44/0x50
[<c0186271>] do_unlinkat+0xd1/0x150
[<c017cdd6>] vfs_write+0x106/0x140
[<c017d2d1>] sys_write+0x41/0x70
[<c0102fc9>] sysenter_do_call+0x12/0x25
=======================
EXT4 Inode eb1f4000: orphan list check failed!
eb1f4000: 0001f30a 00010004 00000000 00000000 ................
eb1f4010: 00012113 00010000 00000058 00000028 .!......X...(...
eb1f4020: 000121d9 00000080 000001cd 00012301 .!...........#..
eb1f4030: 00000250 000000f1 000124d1 00080000 P........$......
eb1f4040: 00000000 00000000 0000000e 00000000 ................
eb1f4050: 00000000 00000000 00000000 eed704ac ................
eb1f4060: f7d62134 00000000 00000000 eb1f001c 4!..............
eb1f4070: 00000000 00003c3c eb1f4078 eb1f4078 ....<<[email protected]@..
eb1f4080: 00000000 00000000 eb1f4088 eb1f4088 .........@...@..
eb1f4090: eb1f4090 eb1f4090 eb1f4098 eb1f4098 .@...@...@...@..
eb1f40a0: 0000000f 00000000 00000000 00000000 ................
eb1f40b0: 00000000 00000000 00000001 00000000 ................
eb1f40c0: 00000000 00000000 00800000 4892c6b8 ...............H
eb1f40d0: 36d5deb5 4892c6bc 15cc967e 4892c6bc ...6...H~......H
eb1f40e0: 15cc967e 0000000a 000a0002 00000000 ~...............
eb1f40f0: 81800000 00007070 00000001 0000aeae ....pp..........
eb1f4100: eb1f4100 eb1f4100 00000000 00000000 .A...A..........
eb1f4110: eb1f4110 eb1f4110 f8b85fe0 f8b85f60 .A...A..._..`_..
eb1f4120: f70b6a00 00000000 eb1f412c eb1f4080 .j......,A...@..
eb1f4130: 00000000 00000020 00000000 00001414 .... ...........
eb1f4140: 00000000 00000000 00010001 eb1f414c ............LA..
eb1f4150: eb1f414c 00000000 00000000 00000000 LA..............
eb1f4160: 0003dc9f f8b861a0 001200d2 f55108ec .....a........Q.
eb1f4170: 0000ecec eb1f4174 eb1f4174 00000000 ....tA..tA......
eb1f4180: eb1f4180 eb1f4180 00000000 00000000 .A...A..........
eb1f4190: aa93f4cb 00000000 00000000 eb1f419c .............A..
eb1f41a0: eb1f419c 00000001 00000000 eb1f41ac .A...........A..
eb1f41b0: eb1f41ac 00000040 00006f91 00000000 [email protected]......
eb1f41c0: 00000000 00000000 00000000 00000000 ................
eb1f41d0: 00100100 00200200 eb1f4080 00000000 ...... ..@......
eb1f41e0: 00001450 00238801 00000000 00091800 P.....#.........
eb1f41f0: 00008000 00000000 4892c641 33832bba ........A..H.+.3
eb1f4200: eb1f4200 eb1f4200 00000c0c 00000000 .B...B..........
eb1f4210: 00000000 00000000 00000000 00004747 ............GG..
Pid: 7092, comm: bonnie++ Not tainted 2.6.27-rc1 #6
[<f8b719ee>] ext4_destroy_inode+0x5e/0x70 [ext4dev]
[<c018f820>] destroy_inode+0x20/0x40
[<c018f4a4>] iput+0x44/0x50
[<c0186271>] do_unlinkat+0xd1/0x150
[<c017d2d1>] sys_write+0x41/0x70
[<c0102fc9>] sysenter_do_call+0x12/0x25
=======================
Fred
Ted, thanks for going through this,
在 2008-08-01五的 01:49 -0400,Theodore Tso写道:
> ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
>
> The extents codepath for ext4_truncate() requests journal transaction
> credits in very small chunks, requesting only what is needed. This
> means there may not be enough credits left on the transaction handle
> after ext4_truncate() returns and then when ext4_delete_inode() tries
> finish up its work, it may not have enough transaction credits,
> causing a BUG() oops in the jbd2 core.
But, ext4_delete_inode()'s transaction and ext4_truncate()'s transaction
are different, ext4_truncate() transaction is nested inside
ext4_delete_inode.
Inside ext4_delete_inode, the transaction is to log the changes in
ext4_orhan_del() and ext4_free_inode(). If I get right,
ext4_orhan_del() need credit to modify superblock and inode block,
ext4_free_inode() needs credit for modifying quota,xattr and inode
bitmap, group descriptor, superblock, and inode block.
Currently ext4_delete_inode() uses blocks_for_truncate(inode) to
calculate credits, by the time ext4_delete_inode() is called, i_blocks
seems to 0 (I need to double check this). So the credits is just
EXT4_DATA_TRANS_BLOCKS, which take care of the credits for inode,
superblock, xattr and quota, but missing is inode bitmap...I will double
check before I post a patch...
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7fb647..6d27e78 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -215,6 +215,18 @@ void ext4_delete_inode (struct inode * inode)
> inode->i_size = 0;
> if (inode->i_blocks)
> ext4_truncate(inode);
> +
> + /*
> + * ext4_ext_truncate() doesn't reserve any slop when it
> + * restarts journal transactions; therefore there may not be
> + * enough credits left in the handle to remove the inode from
> + * the orphan list and set the dtime field.
> + */
> + if (ext4_ext_journal_restart(handle, 3)) {
> + ext4_journal_stop(handle);
> + goto no_delete;
> + }
> +
> /*
> * Kill off the orphan record which ext4_truncate created.
> * AKPM: I think this can be inside the above `if'.
在 2008-08-01五的 12:51 +0200,Frédéric Bohé写道:
> Thanks Ted. With this patch, I don't have a crash anymore, but I have
> some messages :
>
> EXT4 Inode eedb4228: orphan list check failed!
> Pid: 7092, comm: bonnie++ Not tainted 2.6.27-rc1 #6
> [<f8b719ee>] ext4_destroy_inode+0x5e/0x70 [ext4dev]
> [<c018f820>] destroy_inode+0x20/0x40
> [<c018f4a4>] iput+0x44/0x50
> [<c0186271>] do_unlinkat+0xd1/0x150
> [<c017cdd6>] vfs_write+0x106/0x140
> [<c02aa7b0>] tty_write+0x0/0x1e0
> [<c017d2d1>] sys_write+0x41/0x70
> [<c0102fc9>] sysenter_do_call+0x12/0x25
The warining is inode is not removed from orphan list at the time
destroy_inode(). I suspect there is failure returned from
ext4_delete_inode()->ext4_orphan_del(), but not sure how Ted's change
triggered this. Still looking.
Mingming
On Fri, Aug 01, 2008 at 11:03:15AM -0700, Mingming Cao wrote:
>
> 在 2008-08-01五的 01:49 -0400,Theodore Tso写道:
Playing with the Chinese I18N support, I see? :-)
Hmm....
> 在 2008-08-01五的 01:49 -0400,曹子德 写道:
> But, ext4_delete_inode()'s transaction and ext4_truncate()'s transaction
> are different, ext4_truncate() transaction is nested inside
> ext4_delete_inode.
Yes, the way transaction nesting works is that the amount of credits
requested in the nested transactions is completely ignored. The only
thing that matters is the amount of credits requested in the top-level
transaction handl, which in the case of ext4_delete_inode(), is in the
start_transaction(inode) call of ext4_delete_inode. This amount is
blocks_for_truncate(inode).
> Currently ext4_delete_inode() uses blocks_for_truncate(inode) to
> calculate credits, by the time ext4_delete_inode() is called, i_blocks
> seems to 0 (I need to double check this).
No, i_blocks will not be 0, because the inode will not have been
truncated yet. That happens later; ext4_delete_inode() calls
ext4_truncate(). The problem is that blocks_for_truncate() caps the
number of credits at EXT4_MAX_TRANS_DATA, which is 64 credits. So for
a very big, fragmented file, the number of credits will be
EXT4_DATA_TRANS_BLOCK + EXT4_DATA_TRANS_BLOCKS(sb).
If more credits are needed the truncate code is supposed to request
more. This is true both for the extents and non-extents case. The
difference is that in the non-extents case, the amount which is
requested to extend the transaction is determined by another call to
blocks_for_truncate(), which will be smaller because i_blocks will
decrease as part of the truncate operation. In the extents case, the
amount needed is calculated precisely each time a leaf is removed.
That is I think the problem. In practice it doesn't happen because
most of the time, the massive over-estimation of blocks_for_truncate()
of 64 credits is more than enough. But bonnie++ creates some very
highly fragmented files, apparently. (We later may want to see if we
can improve on its block layout). At least, that's what I think is
happening...
- Ted
I'm noticing that when I reboot, presumably either just before or just
after remounting the filesystem read-only, I'm getting a whole series
of
ext4_da_writepages: Not enough credits to flush N pages
.. where N is mostly 1, but occasionally will be 2.
I'm not sure why this is happening; I did run sync before rebooting,
and I haven't noticed any files written just before the reboot getting
corrupted, but there is something strange happening here.
- Ted
On Fri, Aug 01, 2008 at 03:10:15PM -0400, Theodore Tso wrote:
> > Currently ext4_delete_inode() uses blocks_for_truncate(inode) to
> > calculate credits, by the time ext4_delete_inode() is called, i_blocks
> > seems to 0 (I need to double check this).
You were right; I didn't realize bonnie's create/delete files was
doing so with zero length blocks. So yes, there is no need to call
truncate. I'm still confused why we're running out of credits, given
that we allocate EXT4_DATA_TRANS_BLOCKS(sb) credits which is:
#define EXT4_DATA_TRANS_BLOCKS(sb) (EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + \
EXT4_XATTR_TRANS_BLOCKS - 2 + \
2*EXT4_QUOTA_TRANS_BLOCKS(sb))
EXT4_SINGLEDATA_TRANS_BLOCKS is 27 credits when extents are enabled,
and XATTR_TRANS_BLOCKS is another 6, so we are reserving 31 credits
with quotas disabled. Given that bonnie++ doesn't create extended
attributes, and the 27 credits were for 5 levels of extent tree, why
did we run out?
We can make ext4_delete_inode() request an extra 3 blocks (one for the
inode bitmap, and two to update the orphaned inode linked list), but
I'm not sure why the 31 credits wasn't enough.
In any case, I figurd out why my patch wasn't enough. There was a bug
in ext4_ext_journal_restart:
int ext4_ext_journal_restart(handle_t *handle, int needed)
{
int err;
if (handle->h_buffer_credits > needed)
return 0;
err = ext4_journal_extend(handle, needed);
if (err)
return err;
return ext4_journal_restart(handle, needed);
}
This is buggy; ext4_journal_extend returns < 0 on an error, 0 if the
handle was successfully extended without needing a journal restart,
and > 0 if the ext4_journal_restart() needs to be called. So the
current code returns a failure and doesn't restart the journal when it
is needed, and calls ext4_journal_restart() needlessly when it is not
needed and the handle could be extended without closing the current
transaction.
The fix is a simple one-liner:
int ext4_ext_journal_restart(handle_t *handle, int needed)
{
int ret;
if (handle->h_buffer_credits > needed)
return 0;
err = ext4_journal_extend(handle, needed);
if (ret < = 0)
return ret;
return ext4_journal_restart(handle, needed);
}
This seems to indicate ext4_ext_journal_restart() has never been
called in anger by the ext4_ext_truncate() code. We may want to
double check it with a really big, mongo extent tree and make sure it
does the right thing one of these days.
- Ted
On Fri, Aug 01, 2008 at 03:29:47PM -0400, Theodore Tso wrote:
> I'm noticing that when I reboot, presumably either just before or just
> after remounting the filesystem read-only, I'm getting a whole series
> of
>
> ext4_da_writepages: Not enough credits to flush N pages
>
> .. where N is mostly 1, but occasionally will be 2.
>
> I'm not sure why this is happening; I did run sync before rebooting,
> and I haven't noticed any files written just before the reboot getting
> corrupted, but there is something strange happening here.
I instrumented the check, and it is returning an error -30 --- EROFS.
So it's not the fault journal credits patch, but I'm a bit concerned
why ext4_da_writepages is getting called right before the system is
rebooted, and presumably after the filesystem is remounted read/only.
- Ted
Le vendredi 01 août 2008 à 20:03 -0400, Theodore Tso a écrit :
> In any case, I figurd out why my patch wasn't enough. There was a bug
> in ext4_ext_journal_restart:
>
> int ext4_ext_journal_restart(handle_t *handle, int needed)
> {
> int err;
>
> if (handle->h_buffer_credits > needed)
> return 0;
> err = ext4_journal_extend(handle, needed);
> if (err)
> return err;
> return ext4_journal_restart(handle, needed);
> }
>
> This is buggy; ext4_journal_extend returns < 0 on an error, 0 if the
> handle was successfully extended without needing a journal restart,
> and > 0 if the ext4_journal_restart() needs to be called. So the
> current code returns a failure and doesn't restart the journal when it
> is needed, and calls ext4_journal_restart() needlessly when it is not
> needed and the handle could be extended without closing the current
> transaction.
>
> The fix is a simple one-liner:
>
> int ext4_ext_journal_restart(handle_t *handle, int needed)
> {
> int ret;
>
> if (handle->h_buffer_credits > needed)
> return 0;
> err = ext4_journal_extend(handle, needed);
> if (ret < = 0)
> return ret;
> return ext4_journal_restart(handle, needed);
> }
>
> This seems to indicate ext4_ext_journal_restart() has never been
> called in anger by the ext4_ext_truncate() code. We may want to
> double check it with a really big, mongo extent tree and make sure it
> does the right thing one of these days.
>
> - Ted
>
FYI, I have just tried this fix, and i don't see the "orphan list check
failed" messages any more.
Fred
On Mon, Aug 04, 2008 at 01:23:53PM +0200, Fr?d?ric Boh? wrote:
>
> FYI, I have just tried this fix, and i don't see the "orphan list check
> failed" messages any more.
>
Thanks for checking! I was able to reproduce the problem myself, and
confirm that it made it go away, so the patch is already in the patch
queue, and in fact has already been pushed to Linus.
- Ted
This is a rework of journal credits fix patch in the ext4 patch queue.
The patch series contains
- patch 1: helper funtions for journal credits calculation and fix the
writepage/write_begin on nonextent files
- patch 2: journal credit fix wirtepahe/write_begin for extents files,
and migration
- patch 3: credit fix for dio, fallocate
-patch 4: rebase ext4_da_writepages_rework patch
- patch 5: credit fix for delalloc writepages
-patch 6: credit fix for defrag
Mingming
Ext4: journal credits caclulation cleanup and fix that for nonextent writepage
From: Mingming Cao <[email protected]>
When considering how many journal credits are needed for modifying a chunk of
data, we need to account for the super block, inode
block, quota blocks and xattr block, indirect/index blocks,
also, group bitmap and group descriptor blocks for new allocation (including
data and indirect/index blocks). There are many places in ext4 do the calculation
their own and often missed one or two meta blocks, and often they assume single
block allocation, and did not considering the multile chunk of allocation case.
This patch is trying to cleanup current journal credit code,
provides some common helper function to calculate the journal credits, to
be used for writepage, writepages, DIO, fallocate, migration, defrag, and for
both nonextent and extent files.
This patch modified the writepage/write_begin credit calculation for
nonextent files, to use the new helper function. It also fixed the problem
that writepage on nonextent files did not consider the case blocksize <pagesize,
thus could possibelly need multiple block allocation in a single transaction.
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext4/ext4.h | 1
fs/ext4/ext4_jbd2.h | 8 +++
fs/ext4/inode.c | 136 ++++++++++++++++++++++++++++++++++++++--------------
3 files changed, 110 insertions(+), 35 deletions(-)
Index: linux-2.6.27-rc1/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-11 15:11:51.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-11 15:11:55.000000000 -0700
@@ -4333,56 +4333,125 @@ int ext4_getattr(struct vfsmount *mnt, s
}
/*
- * How many blocks doth make a writepage()?
+ * Account for block groups bitmaps and block group descriptor blocks
+ * if modify datablocks and indexing blocks
+ * worse case, the nrblocks and indexs blocks spread
+ * over different block groups
*
- * With N blocks per page, it may be:
- * N data blocks
- * 2 indirect block
- * 2 dindirect
- * 1 tindirect
- * N+5 bitmap blocks (from the above)
- * N+5 group descriptor summary blocks
- * 1 inode block
- * 1 superblock.
- * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
- *
- * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
- *
- * With ordered or writeback data it's the same, less the N data blocks.
- *
- * If the inode's direct blocks can hold an integral number of pages then a
- * page cannot straddle two indirect blocks, and we can only touch one indirect
- * and dindirect block, and the "5" above becomes "3".
+ * Also account for superblock, inode, quota and xattr blocks
+ */
+int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
+{
+ int groups, gdpblocks;
+ int ret = 0;
+
+ groups = nrblocks + idxblocks;
+ gdpblocks = groups;
+ if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
+ groups = EXT4_SB(inode->i_sb)->s_groups_count;
+ if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
+ gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
+
+ /* bitmaps and block group descriptor blocks */
+ ret += groups + gdpblocks;
+
+ ret += idxblocks;
+
+ /* journalled mode, include buffer to modify data blocks */
+ if (ext4_should_journal_data(inode))
+ ret += nrblocks;
+
+ /* Blocks for super block, inode, quota and xattr blocks */
+ ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
+
+ return ret;
+}
+
+static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
+ int chunk)
+{
+ int indirects;
+
+ /* if nrblocks are contigous */
+ if (chunk) {
+ /*
+ * With N contigous data blocks, it need at most
+ * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) indirect blocks
+ * 2 dindirect blocks
+ * 1 tindirect block
+ */
+ indirects = nrblocks / EXT4_ADDR_PER_BLOCK(inode->i_sb);
+ return indirects + 3;
+ }
+ /*
+ * if nrblocks are not contigous, worse case, each block touch
+ * a indirect block, and each indirect block touch a double indirect
+ * block, plus a triple indirect block
+ */
+ indirects = nrblocks * 2 + 1;
+ return indirects;
+}
+/*
+ * How many journal blocks are need to modify N blocks contigous data()?
+ *
+ * It need to account indirect blocks, data blocks, and
+ * bitmap blocks and group descriptor blocks.
*
* This still overestimates under most circumstances. If we were to pass the
* start and end offsets in here as well we could do block_to_path() on each
* block and work out the exact number of indirects which are touched. Pah.
*/
-int ext4_writepage_trans_blocks(struct inode *inode)
+static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
+ int chunk)
{
- int bpp = ext4_journal_blocks_per_page(inode);
- int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
+ int indirects;
int ret;
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_writepage_trans_blocks(inode, bpp);
-
- if (ext4_should_journal_data(inode))
- ret = 3 * (bpp + indirects) + 2;
- else
- ret = 2 * (bpp + indirects) + 2;
+ /*
+ * How many index blocks need to touch to modify nrblocks?
+ * The "Chunk" flag indicating whether the nrblocks is
+ * physically contigous on disk
+ *
+ * For Direct IO and fallocate, they calls get_block to allocate
+ * one single extent at a time, so they could set the "Chunk" flag
+ */
+ indirects = ext4_indirect_trans_blocks(inode, nrblocks, chunk);
-#ifdef CONFIG_QUOTA
- /* We know that structure was already allocated during DQUOT_INIT so
- * we will be updating only the data blocks + inodes */
- ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
+ /* Account for block group bitmaps and block groups
+ * descriptors.Worse case, the nrblocks+indirects blocks spread
+ * over different block groups
+ */
+ ret = ext4_meta_trans_blocks(inode, nrblocks, indirects);
return ret;
}
/*
+ * Calulate the total number of credits to reserve to fit
+ * the modification of a single pages into a single transaction
+ *
+ * This could be called via ext4_write_begin() or later
+ * ext4_da_writepages() in delalyed allocation case.
+ *
+ * In both case it's possible that we could allocating multiple
+ * chunks of blocks. We need to consider the worse case, when
+ * one new block per extent.
+ *
+ * For Direct IO and fallocate, the journal credits reservation
+ * is based on one single extent allocation, so they could use
+ * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
+ * chunk of allocation needs.
+ */
+int ext4_writepage_trans_blocks(struct inode *inode)
+{
+ int bpp = ext4_journal_blocks_per_page(inode);
+
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return ext4_writeblocks_trans_credits_old(inode, bpp, 0);
+ return ext4_ext_writepage_trans_blocks(inode, bpp);
+}
+/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
*/
Index: linux-2.6.27-rc1/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/ext4.h 2008-08-11 15:11:51.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/ext4.h 2008-08-11 15:11:55.000000000 -0700
@@ -1072,6 +1072,7 @@ extern void ext4_set_inode_flags(struct
extern void ext4_get_inode_flags(struct ext4_inode_info *);
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
+extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
Index: linux-2.6.27-rc1/fs/ext4/ext4_jbd2.h
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/ext4_jbd2.h 2008-08-11 15:11:51.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/ext4_jbd2.h 2008-08-11 15:11:55.000000000 -0700
@@ -51,6 +51,14 @@
EXT4_XATTR_TRANS_BLOCKS - 2 + \
2*EXT4_QUOTA_TRANS_BLOCKS(sb))
+/*
+ * Define the number of metadata blocks we need to account to modify data.
+ *
+ * This include super block, inode block, quota blocks and xattr blocks
+ */
+#define EXT4_META_TRANS_BLOCKS(sb) (EXT4_XATTR_TRANS_BLOCKS + \
+ 2*EXT4_QUOTA_TRANS_BLOCKS(sb))
+
/* Delete operations potentially hit one directory's namespace plus an
* entire inode, plus arbitrary amounts of bitmap/indirection data. Be
* generous. We can grow the delete transaction later if necessary. */
Ext4: journal credits reservation fixes for extent file writepage
From: Mingming Cao <[email protected]>
This patch modified the writepage/write_begin credit calculation for
extent files, to use the credits calculation helper function.
The current calculation of how many index/leaf blocks should be
accounted is too Conservative, it always consider the worse case, where
the tree level is 5, and in the case of multiple chunk allocation, it
always multiple the needed credits. This path uses the accurate depth of
the inode with some extras to calculate the index blocks, and also less
Conservative in the case of multiple allocation accounting.
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext4/ext4.h | 2
fs/ext4/ext4_extents.h | 3 -
fs/ext4/extents.c | 100 +++++++++++++++++++++++++++----------------------
fs/ext4/inode.c | 2
fs/ext4/migrate.c | 3 -
5 files changed, 62 insertions(+), 48 deletions(-)
Index: linux-2.6.27-rc1/fs/ext4/ext4_extents.h
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/ext4_extents.h 2008-08-12 07:26:48.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/ext4_extents.h 2008-08-12 07:46:29.000000000 -0700
@@ -216,7 +216,8 @@ extern int ext4_ext_calc_metadata_amount
extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
extern int ext4_extent_tree_init(handle_t *, struct inode *);
-extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
+ struct ext4_ext_path *path);
extern int ext4_ext_try_to_merge(struct inode *inode,
struct ext4_ext_path *path,
struct ext4_extent *);
Index: linux-2.6.27-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/extents.c 2008-08-12 07:26:48.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/extents.c 2008-08-12 07:50:27.000000000 -0700
@@ -1747,54 +1747,50 @@ static int ext4_ext_rm_idx(handle_t *han
}
/*
- * ext4_ext_calc_credits_for_insert:
- * This routine returns max. credits that the extent tree can consume.
- * It should be OK for low-performance paths like ->writepage()
- * To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under i_data_sem and
- * pass the actual path.
+ * ext4_ext_calc_credits_for_single_extent:
+ * This routine returns max. credits that needed to insert an extent
+ * to the extent tree.
+ * When pass the actual path, the caller should calculate credits
+ * under i_data_sem.
*/
-int ext4_ext_calc_credits_for_insert(struct inode *inode,
+int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
struct ext4_ext_path *path)
{
- int depth, needed;
+ int depth = ext_depth(inode);
if (path) {
/* probably there is space in leaf? */
- depth = ext_depth(inode);
if (le16_to_cpu(path[depth].p_hdr->eh_entries)
< le16_to_cpu(path[depth].p_hdr->eh_max))
- return 1;
+ return 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
}
- /*
- * given 32-bit logical block (4294967296 blocks), max. tree
- * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
- * Let's also add one more level for imbalance.
- */
- depth = 5;
-
- /* allocation of new data block(s) */
- needed = 2;
+ return ext4_ext_writepage_trans_blocks(inode, num, 1);
+}
- /*
- * tree can be full, so it would need to grow in depth:
- * we need one credit to modify old root, credits for
- * new root will be added in split accounting
- */
- needed += 1;
+/*
+ * How many index/leaf blocks need to change/allocate to modify nrblocks?
+ *
+ * if nrblocks are fit in a single extent (chunk flag is 1), then
+ * in the worse case, each tree level index/leaf need to be changed
+ * if the tree split due to insert a new extent, then the old tree
+ * index/leaf need to be updated too
+ *
+ * If the nrblocks are discontigous, they could cause
+ * the whole tree split more than once, but this is really rare.
+ */
+static int ext4_ext_index_trans_blocks(struct inode *inode, int num, int chunk)
+{
+ int index;
+ int depth = ext_depth(inode);
- /*
- * Index split can happen, we would need:
- * allocate intermediate indexes (bitmap + group)
- * + change two blocks at each level, but root (already included)
- */
- needed += (depth * 2) + (depth * 2);
- /* any allocation modifies superblock */
- needed += 1;
+ if (chunk)
+ index = depth * 2;
+ else
+ index = depth * 3;
- return needed;
+ return index;
}
static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
@@ -1921,9 +1917,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
correct_index = 1;
credits += (ext_depth(inode)) + 1;
}
-#ifdef CONFIG_QUOTA
credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
err = ext4_ext_journal_restart(handle, credits);
if (err)
@@ -2861,20 +2855,38 @@ out_stop:
/*
* ext4_ext_writepage_trans_blocks:
* calculate max number of blocks we could modify
- * in order to allocate new block for an inode
+ * in order to allocate nrblocks of blocks.
+ *
+ * The chunk flag indicating whether the nrblocks are a single extent
+ * or discountigous on disk, that is used to determine how many index/leaf
+ * blocks needs credit for logging.
+ *
+ * Based on the index blocks and the nrblocks data blocks, we need to
+ * see how many bitmapblocks and block group descriptor groups need to accounted
+ * At last adds up the superblock, inode, quotao and xattr blocks. These
+ * all take care of in ext4_meta_trans_blocks()
*/
-int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
+int ext4_ext_writepage_trans_blocks(struct inode *inode, int num, int chunk)
{
int needed;
+ int index_blocks;
- needed = ext4_ext_calc_credits_for_insert(inode, NULL);
+ /*
+ * How many index/leaf blocks need to modify/allocate to
+ * insert a single extent with num blocks(chunk == 1)
+ * or @num extents (chunk ==0)
+ */
+ index_blocks = ext4_ext_index_trans_blocks(inode, num, chunk);
- /* caller wants to allocate num blocks, but note it includes sb */
- needed = needed * num - (num - 1);
+ /* How many metadat blocks need to modify to modify the @num
+ * of data blocks and index_blocks? Include, index/leaf blocks,
+ * bitmaps,block group descriptor block for modifying both data
+ * and index/leaf blocks, superblock, inode, quota and xattrs
+ */
+ needed = ext4_meta_trans_blocks(inode, num, index_blocks);
-#ifdef CONFIG_QUOTA
- needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
+ if (ext4_should_journal_data(inode))
+ needed += num;
return needed;
}
Index: linux-2.6.27-rc1/fs/ext4/migrate.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/migrate.c 2008-08-12 07:26:48.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/migrate.c 2008-08-12 07:46:29.000000000 -0700
@@ -53,7 +53,8 @@ static int finish_range(handle_t *handle
* credit. But below we try to not accumalate too much
* of them by restarting the journal.
*/
- needed = ext4_ext_calc_credits_for_insert(inode, path);
+ needed = ext4_ext_calc_credits_for_single_extent(inode,
+ lb->last_block - lb->first_block + 1, path);
/*
* Make sure the credit we accumalated is not really high
Index: linux-2.6.27-rc1/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/ext4.h 2008-08-12 07:26:48.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/ext4.h 2008-08-12 07:46:29.000000000 -0700
@@ -1227,7 +1227,7 @@ extern const struct inode_operations ext
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
-extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext4_ext_writepage_trans_blocks(struct inode *, int num, int chunk);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
Index: linux-2.6.27-rc1/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-12 07:26:48.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-12 07:52:30.000000000 -0700
@@ -4449,7 +4449,7 @@ int ext4_writepage_trans_blocks(struct i
if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
return ext4_writeblocks_trans_credits_old(inode, bpp, 0);
- return ext4_ext_writepage_trans_blocks(inode, bpp);
+ return ext4_ext_writepage_trans_blocks(inode, bpp, 0);
}
/*
* The caller must have previously called ext4_reserve_inode_write().
Ext4: journal credits reservation fixes for DIO, fallocate
From: Mingming Cao <[email protected]>
DIO and fallocate credit calculation is different than writepage, as
they do start a new journal right for each call to ext4_get_blocks_wrap().
This patch uses the helper function in DIO and fallocate case, passing
a flag indicating that the modified data are contiguous thus could account
less indirect/index blocks.
This patch also fixed the journal credit reservation for direct I/O
(DIO). Previously the estimated credits for DIO only was calculated for
non-extent files, which was not enough if the file is extent-based.
Also fixed was fallocate double-counting credits for modifying the the
superblock.
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 7 +++----
fs/ext4/inode.c | 49 ++++++++++++++++++++++++++++---------------------
3 files changed, 32 insertions(+), 25 deletions(-)
===================================================================
Index: linux-2.6.27-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/extents.c 2008-08-11 22:25:39.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/extents.c 2008-08-11 22:25:55.000000000 -0700
@@ -2799,7 +2799,7 @@ void ext4_ext_truncate(struct inode *ino
/*
* probably first extent we're gonna free will be last in block
*/
- err = ext4_writepage_trans_blocks(inode) + 3;
+ err = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, err);
if (IS_ERR(handle))
return;
@@ -2951,10 +2951,9 @@ long ext4_fallocate(struct inode *inode,
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- block;
/*
- * credits to insert 1 extent into extent tree + buffers to be able to
- * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ * credits to insert 1 extent into extent tree
*/
- credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+ credits = ext4_data_trans_blocks(inode, max_blocks);
mutex_lock(&inode->i_mutex);
retry:
while (ret >= 0 && ret < max_blocks) {
Index: linux-2.6.27-rc1/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-11 22:18:31.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-11 22:25:55.000000000 -0700
@@ -1041,18 +1041,6 @@ static void ext4_da_update_reserve_space
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
}
-/* Maximum number of blocks we map for direct IO at once. */
-#define DIO_MAX_BLOCKS 4096
-/*
- * Number of credits we need for writing DIO_MAX_BLOCKS:
- * We need sb + group descriptor + bitmap + inode -> 4
- * For B blocks with A block pointers per block we need:
- * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
- * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
- */
-#define DIO_CREDITS 25
-
-
/*
* The ext4_get_blocks_wrap() function try to look up the requested blocks,
* and returns if the blocks are already mapped.
@@ -1164,19 +1152,23 @@ int ext4_get_blocks_wrap(handle_t *handl
return retval;
}
+/* Maximum number of blocks we map for direct IO at once. */
+#define DIO_MAX_BLOCKS 4096
+
static int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
handle_t *handle = ext4_journal_current_handle();
int ret = 0, started = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+ int dio_credits;
if (create && !handle) {
/* Direct IO write... */
if (max_blocks > DIO_MAX_BLOCKS)
max_blocks = DIO_MAX_BLOCKS;
- handle = ext4_journal_start(inode, DIO_CREDITS +
- 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ dio_credits = ext4_data_trans_blocks(inode, max_blocks);
+ handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -2222,7 +2214,7 @@ static int ext4_da_writepage(struct page
* for DIO, writepages, and truncate
*/
#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
+#define EXT4_MAX_WRITEBACK_CREDITS 25
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -4429,7 +4421,8 @@ static int ext4_writeblocks_trans_credit
/*
* Calulate the total number of credits to reserve to fit
- * the modification of a single pages into a single transaction
+ * the modification of a single pages into a single transaction,
+ * which may include multile chunk of block allocations.
*
* This could be called via ext4_write_begin() or later
* ext4_da_writepages() in delalyed allocation case.
@@ -4437,11 +4430,6 @@ static int ext4_writeblocks_trans_credit
* In both case it's possible that we could allocating multiple
* chunks of blocks. We need to consider the worse case, when
* one new block per extent.
- *
- * For Direct IO and fallocate, the journal credits reservation
- * is based on one single extent allocation, so they could use
- * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
- * chunk of allocation needs.
*/
int ext4_writepage_trans_blocks(struct inode *inode)
{
@@ -4451,6 +4439,25 @@ int ext4_writepage_trans_blocks(struct i
return ext4_writeblocks_trans_credits_old(inode, bpp, 0);
return ext4_ext_writepage_trans_blocks(inode, bpp, 0);
}
+
+/*
+ * Calculate the journal credits for a chunk of data modification.
+ *
+ * For Direct IO and fallocate, the journal credits reservation
+ * is based on one single extent allocation, so they could use
+ * this function to get the needed credit to log a single
+ * chunk of allocation needs.
+ *
+ * This is called from DIO, fallocate or whoever calling
+ * ext4_get_blocks_wrap() to map/allocate a chunk of contigous disk blocks
+ */
+int ext4_data_trans_blocks(struct inode *inode, int nrblocks)
+{
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return ext4_writeblocks_trans_credits_old(inode, nrblocks, 1);
+ return ext4_ext_writepage_trans_blocks(inode, nrblocks, 1);
+}
+
/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
Index: linux-2.6.27-rc1/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/ext4.h 2008-08-11 22:18:31.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/ext4.h 2008-08-11 22:25:55.000000000 -0700
@@ -1073,6 +1073,7 @@ extern void ext4_get_inode_flags(struct
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
+extern int ext4_data_trans_blocks(struct inode *, int nrblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
Updated delalloc writepages rework patch for patch queue, as some of the old journal credits patch are reworked.
ext4: Rework the ext4_da_writepages
From: "Aneesh Kumar K.V" <[email protected]>
With the below changes we reserve credit needed to insert only one extent
resulting from a call to single get_block. That make sure we don't take
too much journal credits during writeout. We also don't limit the pages
to write. That means we loop through the dirty pages building largest
possible contiguous block request. Then we issue a single get_block request.
We may get less block that we requested. If so we would end up not mapping
some of the buffer_heads. That means those buffer_heads are still marked delay.
Later in the writepage callback via __mpage_writepage we redirty those pages.
We should also not limit/throttle wbc->nr_to_write in the filesystem writepages
callback. That cause wrong behaviour in generic_sync_sb_inodes caused by
wbc->nr_to_write being <= 0
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 201 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 113 insertions(+), 88 deletions(-)
Index: linux-2.6.27-rc1/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-11 15:12:11.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-11 15:38:09.000000000 -0700
@@ -41,6 +41,8 @@
#include "acl.h"
#include "ext4_extents.h"
+#define MPAGE_DA_EXTENT_TAIL 0x01
+
static inline int ext4_begin_ordered_truncate(struct inode *inode,
loff_t new_size)
{
@@ -1605,11 +1607,13 @@ struct mpage_da_data {
unsigned long first_page, next_page; /* extent of pages */
get_block_t *get_block;
struct writeback_control *wbc;
+ int io_done;
+ long pages_written;
};
/*
* mpage_da_submit_io - walks through extent of pages and try to write
- * them with __mpage_writepage()
+ * them with writepage() call back
*
* @mpd->inode: inode
* @mpd->first_page: first page of the extent
@@ -1624,18 +1628,11 @@ struct mpage_da_data {
static int mpage_da_submit_io(struct mpage_da_data *mpd)
{
struct address_space *mapping = mpd->inode->i_mapping;
- struct mpage_data mpd_pp = {
- .bio = NULL,
- .last_block_in_bio = 0,
- .get_block = mpd->get_block,
- .use_writepage = 1,
- };
int ret = 0, err, nr_pages, i;
unsigned long index, end;
struct pagevec pvec;
BUG_ON(mpd->next_page <= mpd->first_page);
-
pagevec_init(&pvec, 0);
index = mpd->first_page;
end = mpd->next_page - 1;
@@ -1653,8 +1650,9 @@ static int mpage_da_submit_io(struct mpa
break;
index++;
- err = __mpage_writepage(page, mpd->wbc, &mpd_pp);
-
+ err = mapping->a_ops->writepage(page, mpd->wbc);
+ if (!err)
+ mpd->pages_written++;
/*
* In error case, we have to continue because
* remaining pages are still locked
@@ -1665,9 +1663,6 @@ static int mpage_da_submit_io(struct mpa
}
pagevec_release(&pvec);
}
- if (mpd_pp.bio)
- mpage_bio_submit(WRITE, mpd_pp.bio);
-
return ret;
}
@@ -1690,7 +1685,7 @@ static void mpage_put_bnr_to_bhs(struct
int blocks = exbh->b_size >> inode->i_blkbits;
sector_t pblock = exbh->b_blocknr, cur_logical;
struct buffer_head *head, *bh;
- unsigned long index, end;
+ pgoff_t index, end;
struct pagevec pvec;
int nr_pages, i;
@@ -1775,13 +1770,11 @@ static inline void __unmap_underlying_bl
*
* The function skips space we know is already mapped to disk blocks.
*
- * The function ignores errors ->get_block() returns, thus real
- * error handling is postponed to __mpage_writepage()
*/
static void mpage_da_map_blocks(struct mpage_da_data *mpd)
{
+ int err = 0;
struct buffer_head *lbh = &mpd->lbh;
- int err = 0, remain = lbh->b_size;
sector_t next = lbh->b_blocknr;
struct buffer_head new;
@@ -1791,35 +1784,32 @@ static void mpage_da_map_blocks(struct m
if (buffer_mapped(lbh) && !buffer_delay(lbh))
return;
- while (remain) {
- new.b_state = lbh->b_state;
- new.b_blocknr = 0;
- new.b_size = remain;
- err = mpd->get_block(mpd->inode, next, &new, 1);
- if (err) {
- /*
- * Rather than implement own error handling
- * here, we just leave remaining blocks
- * unallocated and try again with ->writepage()
- */
- break;
- }
- BUG_ON(new.b_size == 0);
+ new.b_state = lbh->b_state;
+ new.b_blocknr = 0;
+ new.b_size = lbh->b_size;
+
+ /*
+ * If we didn't accumulate anything
+ * to write simply return
+ */
+ if (!new.b_size)
+ return;
+ err = mpd->get_block(mpd->inode, next, &new, 1);
+ if (err)
+ return;
+ BUG_ON(new.b_size == 0);
- if (buffer_new(&new))
- __unmap_underlying_blocks(mpd->inode, &new);
+ if (buffer_new(&new))
+ __unmap_underlying_blocks(mpd->inode, &new);
- /*
- * If blocks are delayed marked, we need to
- * put actual blocknr and drop delayed bit
- */
- if (buffer_delay(lbh) || buffer_unwritten(lbh))
- mpage_put_bnr_to_bhs(mpd, next, &new);
+ /*
+ * If blocks are delayed marked, we need to
+ * put actual blocknr and drop delayed bit
+ */
+ if (buffer_delay(lbh) || buffer_unwritten(lbh))
+ mpage_put_bnr_to_bhs(mpd, next, &new);
- /* go for the remaining blocks */
- next += new.b_size >> mpd->inode->i_blkbits;
- remain -= new.b_size;
- }
+ return;
}
#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
@@ -1865,13 +1855,9 @@ static void mpage_add_bh_to_extent(struc
* need to flush current extent and start new one
*/
mpage_da_map_blocks(mpd);
-
- /*
- * Now start a new extent
- */
- lbh->b_size = bh->b_size;
- lbh->b_state = bh->b_state & BH_FLAGS;
- lbh->b_blocknr = logical;
+ mpage_da_submit_io(mpd);
+ mpd->io_done = 1;
+ return;
}
/*
@@ -1891,17 +1877,35 @@ static int __mpage_da_writepage(struct p
struct buffer_head *bh, *head, fake;
sector_t logical;
+ if (mpd->io_done) {
+ /*
+ * Rest of the page in the page_vec
+ * redirty then and skip then. We will
+ * try to to write them again after
+ * starting a new transaction
+ */
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return MPAGE_DA_EXTENT_TAIL;
+ }
/*
* Can we merge this page to current extent?
*/
if (mpd->next_page != page->index) {
/*
* Nope, we can't. So, we map non-allocated blocks
- * and start IO on them using __mpage_writepage()
+ * and start IO on them using writepage()
*/
if (mpd->next_page != mpd->first_page) {
mpage_da_map_blocks(mpd);
mpage_da_submit_io(mpd);
+ /*
+ * skip rest of the page in the page_vec
+ */
+ mpd->io_done = 1;
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return MPAGE_DA_EXTENT_TAIL;
}
/*
@@ -1932,6 +1936,8 @@ static int __mpage_da_writepage(struct p
set_buffer_dirty(bh);
set_buffer_uptodate(bh);
mpage_add_bh_to_extent(mpd, logical, bh);
+ if (mpd->io_done)
+ return MPAGE_DA_EXTENT_TAIL;
} else {
/*
* Page with regular buffer heads, just add all dirty ones
@@ -1940,8 +1946,12 @@ static int __mpage_da_writepage(struct p
bh = head;
do {
BUG_ON(buffer_locked(bh));
- if (buffer_dirty(bh))
+ if (buffer_dirty(bh) &&
+ (!buffer_mapped(bh) || buffer_delay(bh))) {
mpage_add_bh_to_extent(mpd, logical, bh);
+ if (mpd->io_done)
+ return MPAGE_DA_EXTENT_TAIL;
+ }
logical++;
} while ((bh = bh->b_this_page) != head);
}
@@ -1960,22 +1970,13 @@ static int __mpage_da_writepage(struct p
*
* This is a library function, which implements the writepages()
* address_space_operation.
- *
- * In order to avoid duplication of logic that deals with partial pages,
- * multiple bio per page, etc, we find non-allocated blocks, allocate
- * them with minimal calls to ->get_block() and re-use __mpage_writepage()
- *
- * It's important that we call __mpage_writepage() only once for each
- * involved page, otherwise we'd have to implement more complicated logic
- * to deal with pages w/o PG_lock or w/ PG_writeback and so on.
- *
- * See comments to mpage_writepages()
*/
static int mpage_da_writepages(struct address_space *mapping,
struct writeback_control *wbc,
get_block_t get_block)
{
struct mpage_da_data mpd;
+ long to_write;
int ret;
if (!get_block)
@@ -1989,17 +1990,22 @@ static int mpage_da_writepages(struct ad
mpd.first_page = 0;
mpd.next_page = 0;
mpd.get_block = get_block;
+ mpd.io_done = 0;
+ mpd.pages_written = 0;
+
+ to_write = wbc->nr_to_write;
ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd);
/*
* Handle last extent of pages
*/
- if (mpd.next_page != mpd.first_page) {
+ if (!mpd.io_done && mpd.next_page != mpd.first_page) {
mpage_da_map_blocks(&mpd);
mpage_da_submit_io(&mpd);
}
+ wbc->nr_to_write = to_write - mpd.pages_written;
return ret;
}
@@ -2217,7 +2223,7 @@ static int ext4_da_writepage(struct page
#define EXT4_MAX_WRITEBACK_CREDITS 25
static int ext4_da_writepages(struct address_space *mapping,
- struct writeback_control *wbc)
+ struct writeback_control *wbc)
{
struct inode *inode = mapping->host;
handle_t *handle = NULL;
@@ -2225,42 +2231,53 @@ static int ext4_da_writepages(struct add
int ret = 0;
long to_write;
loff_t range_start = 0;
+ long pages_skipped = 0;
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
* because that could violate lock ordering on umount
*/
- if (!mapping->nrpages)
+ if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
- /*
- * Estimate the worse case needed credits to write out
- * EXT4_MAX_BUF_BLOCKS pages
- */
- needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
-
- to_write = wbc->nr_to_write;
- if (!wbc->range_cyclic) {
+ if (!wbc->range_cyclic)
/*
* If range_cyclic is not set force range_cont
* and save the old writeback_index
*/
wbc->range_cont = 1;
- range_start = wbc->range_start;
- }
- while (!ret && to_write) {
+ range_start = wbc->range_start;
+ pages_skipped = wbc->pages_skipped;
+
+restart_loop:
+ to_write = wbc->nr_to_write;
+ while (!ret && to_write > 0) {
+
+ /*
+ * we insert one extent at a time. So we need
+ * credit needed for single extent allocation.
+ * journalled mode is currently not supported
+ * by delalloc
+ */
+ BUG_ON(ext4_should_journal_data(inode));
+ needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
+ printk(KERN_EMERG "%s: jbd2_start: "
+ "%ld pages, ino %lu; err %d\n", __func__,
+ wbc->nr_to_write, inode->i_ino, ret);
+ dump_stack();
goto out_writepages;
}
if (ext4_should_order_data(inode)) {
/*
* With ordered mode we need to add
- * the inode to the journal handle
+ * the inode to the journal handl
* when we do block allocation.
*/
ret = ext4_jbd2_file_inode(handle, inode);
@@ -2268,20 +2285,20 @@ static int ext4_da_writepages(struct add
ext4_journal_stop(handle);
goto out_writepages;
}
-
}
- /*
- * set the max dirty pages could be write at a time
- * to fit into the reserved transaction credits
- */
- if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
- wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
to_write -= wbc->nr_to_write;
ret = mpage_da_writepages(mapping, wbc,
- ext4_da_get_block_write);
+ ext4_da_get_block_write);
ext4_journal_stop(handle);
- if (wbc->nr_to_write) {
+ if (ret == MPAGE_DA_EXTENT_TAIL) {
+ /*
+ * got one extent now try with
+ * rest of the pages
+ */
+ to_write += wbc->nr_to_write;
+ ret = 0;
+ } else if (wbc->nr_to_write) {
/*
* There is no more writeout needed
* or we requested for a noblocking writeout
@@ -2293,10 +2310,18 @@ static int ext4_da_writepages(struct add
wbc->nr_to_write = to_write;
}
+ if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
+ /* We skipped pages in this loop */
+ wbc->range_start = range_start;
+ wbc->nr_to_write = to_write +
+ wbc->pages_skipped - pages_skipped;
+ wbc->pages_skipped = pages_skipped;
+ goto restart_loop;
+ }
+
out_writepages:
wbc->nr_to_write = to_write;
- if (range_start)
- wbc->range_start = range_start;
+ wbc->range_start = range_start;
return ret;
}
Ext4: journal credit fix the delalloc writepages
From: Mingming Cao <[email protected]>
Previous delalloc writepages implementation start a new transaction outside
a loop call of get_block() to do the block allocation. Due to lack of information
of how many blocks to be allocated, the estimate of the journal credits is very
Conservative and caused many issues.
With the reworked delayed allocation, a new transaction is created for
each get_block(), thus we don't need to guess how many credits for the multiple
chunk of allocation. Start every transaction with credits for insert a single
extent is enough. But we still need to consider the journalled mode, where
it need to account for the number of data blocks. So we guess max number of
data blocks for each allocation. Due to the current VFS implementation
writepages() could only flush PAGEVEC of pages at a time, the max block
allocation is limited and calculated based on that, an the total number
of reserved delalloc datablocks, whichever is smaller.
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/inode.c | 39 ++++++++++++++++++++++++---------------
1 file changed, 24 insertions(+), 15 deletions(-)
Index: linux-2.6.27-rc1/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-12 08:15:59.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-12 08:30:41.000000000 -0700
@@ -2210,17 +2210,28 @@ static int ext4_da_writepage(struct page
}
/*
- * For now just follow the DIO way to estimate the max credits
- * needed to write out EXT4_MAX_WRITEBACK_PAGES.
- * todo: need to calculate the max credits need for
- * extent based files, currently the DIO credits is based on
- * indirect-blocks mapping way.
- *
- * Probably should have a generic way to calculate credits
- * for DIO, writepages, and truncate
+ * This is called via ext4_da_writepages() to
+ * calulate the total number of credits to reserve to fit
+ * a single extent allocation into a single transaction,
+ * ext4_da_writpeages() will loop calling this before
+ * the block allocation.
+ *
+ * The page vector size limited the max number of pages could
+ * be writeout at a time. Based on this, the max blocks to pass to
+ * get_block is calculated
*/
-#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS 25
+
+#define EXT4_MAX_WRITEPAGES_SIZE PAGEVEC_SIZE
+static int ext4_writepages_trans_blocks(struct inode *inode)
+{
+ int bpp = ext4_journal_blocks_per_page(inode);
+ int max_blocks = EXT4_MAX_WRITEPAGES_SIZE * bpp;
+
+ if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
+ max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
+
+ return ext4_data_trans_blocks(inode, max_blocks);
+}
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -2262,7 +2273,7 @@ restart_loop:
* by delalloc
*/
BUG_ON(ext4_should_journal_data(inode));
- needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+ needed_blocks = ext4_writepages_trans_blocks(inode);
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
@@ -4449,11 +4460,9 @@ static int ext4_writeblocks_trans_credit
* the modification of a single pages into a single transaction,
* which may include multile chunk of block allocations.
*
- * This could be called via ext4_write_begin() or later
- * ext4_da_writepages() in delalyed allocation case.
+ * This could be called via ext4_write_begin()
*
- * In both case it's possible that we could allocating multiple
- * chunks of blocks. We need to consider the worse case, when
+ * We need to consider the worse case, when
* one new block per extent.
*/
int ext4_writepage_trans_blocks(struct inode *inode)
defrag part change. This patch should be merged with
ext4-online-defrag-alloc-contiguous-blks.patch
Cc: Akira Fujita <[email protected]>
Cc: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
Index: linux-2.6.27-rc1/fs/ext4/defrag.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/defrag.c 2008-08-11 17:15:49.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/defrag.c 2008-08-11 19:46:59.000000000 -0700
@@ -186,9 +186,9 @@ ext4_defrag_alloc_blocks(handle_t *handl
struct buffer_head *bh = NULL;
int err, i, credits = 0;
- credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path);
- err = ext4_ext_journal_restart(handle,
- credits + EXT4_TRANS_META_BLOCKS);
+ credits = ext4_ext_calc_credits_for_single_extent(dest_inode,
+ ar->len, dest_path);
+ err = ext4_ext_journal_restart(handle, credits);
if (err)
return err;
On Tue, Aug 12, 2008 at 09:25:54AM -0700, Mingming Cao wrote:
....
> - * and dindirect block, and the "5" above becomes "3".
> + * Also account for superblock, inode, quota and xattr blocks
> + */
> +int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
> +{
> + int groups, gdpblocks;
> + int ret = 0;
> +
> + groups = nrblocks + idxblocks;
> + gdpblocks = groups;
> + if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
> + groups = EXT4_SB(inode->i_sb)->s_groups_count;
> + if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
> + gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
> +
> + /* bitmaps and block group descriptor blocks */
> + ret += groups + gdpblocks;
ret = groups + gdpblocks;
> +
> + ret += idxblocks;
> +
> + /* journalled mode, include buffer to modify data blocks */
> + if (ext4_should_journal_data(inode))
> + ret += nrblocks;
> +
> + /* Blocks for super block, inode, quota and xattr blocks */
> + ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
> +
> + return ret;
> +}
> +
> +static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
> + int chunk)
> +{
.....
.....
>
> +/*
> + * Define the number of metadata blocks we need to account to modify data.
> + *
> + * This include super block, inode block, quota blocks and xattr blocks
> + */
> +#define EXT4_META_TRANS_BLOCKS(sb) (EXT4_XATTR_TRANS_BLOCKS + \
> + 2*EXT4_QUOTA_TRANS_BLOCKS(sb))
> +
Do EXT4_XATTR_TRANS_BLOCKS blocks really account for super block
and inode block ? Looking at EXT4_DATA_TRANS_BLOCKS which also use
EXT4_XATTR_TRANS_BLOCKS I think it doesn't. But the comment above
EXT4_XATTR_TRANS_BLOCKS is confusing. I guess we would need a +2 here.
-aneesh
On Tue, Aug 12, 2008 at 09:27:53AM -0700, Mingming Cao wrote:
....
....
>
> /*
> - * ext4_ext_calc_credits_for_insert:
> - * This routine returns max. credits that the extent tree can consume.
> - * It should be OK for low-performance paths like ->writepage()
> - * To allow many writing processes to fit into a single transaction,
> - * the caller should calculate credits under i_data_sem and
> - * pass the actual path.
> + * ext4_ext_calc_credits_for_single_extent:
> + * This routine returns max. credits that needed to insert an extent
> + * to the extent tree.
> + * When pass the actual path, the caller should calculate credits
> + * under i_data_sem.
> */
> -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> +int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
> struct ext4_ext_path *path)
> {
> - int depth, needed;
> + int depth = ext_depth(inode);
>
> if (path) {
> /* probably there is space in leaf? */
> - depth = ext_depth(inode);
> if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> < le16_to_cpu(path[depth].p_hdr->eh_max))
> - return 1;
> + return 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
> }
>
> - /*
> - * given 32-bit logical block (4294967296 blocks), max. tree
> - * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
> - * Let's also add one more level for imbalance.
> - */
> - depth = 5;
> -
> - /* allocation of new data block(s) */
> - needed = 2;
> + return ext4_ext_writepage_trans_blocks(inode, num, 1);
single extent insert should not look at journaling mode or not
In the above with path specified we don't look at journaling mode
but ext4_ext_writepage_trans_blocks will do
> +}
>
> - /*
> - * tree can be full, so it would need to grow in depth:
> - * we need one credit to modify old root, credits for
> - * new root will be added in split accounting
> - */
> - needed += 1;
> +/*
> + * How many index/leaf blocks need to change/allocate to modify nrblocks?
> + *
> + * if nrblocks are fit in a single extent (chunk flag is 1), then
> + * in the worse case, each tree level index/leaf need to be changed
> + * if the tree split due to insert a new extent, then the old tree
> + * index/leaf need to be updated too
> + *
> + * If the nrblocks are discontigous, they could cause
> + * the whole tree split more than once, but this is really rare.
> + */
> +static int ext4_ext_index_trans_blocks(struct inode *inode, int num, int chunk)
> +{
> + int index;
> + int depth = ext_depth(inode);
>
> - /*
> - * Index split can happen, we would need:
> - * allocate intermediate indexes (bitmap + group)
> - * + change two blocks at each level, but root (already included)
> - */
> - needed += (depth * 2) + (depth * 2);
>
> - /* any allocation modifies superblock */
> - needed += 1;
> + if (chunk)
> + index = depth * 2;
> + else
> + index = depth * 3;
>
> - return needed;
> + return index;
> }
>
....
-aneesh
On Tue, Aug 12, 2008 at 09:29:50AM -0700, Mingming Cao wrote:
>
......
....
>
> ===================================================================
> Index: linux-2.6.27-rc1/fs/ext4/extents.c
> ===================================================================
> --- linux-2.6.27-rc1.orig/fs/ext4/extents.c 2008-08-11 22:25:39.000000000 -0700
> +++ linux-2.6.27-rc1/fs/ext4/extents.c 2008-08-11 22:25:55.000000000 -0700
> @@ -2799,7 +2799,7 @@ void ext4_ext_truncate(struct inode *ino
> /*
> * probably first extent we're gonna free will be last in block
> */
> - err = ext4_writepage_trans_blocks(inode) + 3;
> + err = ext4_writepage_trans_blocks(inode);
> handle = ext4_journal_start(inode, err);
> if (IS_ERR(handle))
> return;
> @@ -2951,10 +2951,9 @@ long ext4_fallocate(struct inode *inode,
> max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> - block;
> /*
> - * credits to insert 1 extent into extent tree + buffers to be able to
> - * modify 1 super block, 1 block bitmap and 1 group descriptor.
> + * credits to insert 1 extent into extent tree
> */
> - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> + credits = ext4_data_trans_blocks(inode, max_blocks);
Why do we need to consider data=journaled mode here. We are not writing
any data here. Instead we are just inserting an extent.
> mutex_lock(&inode->i_mutex);
> retry:
> while (ret >= 0 && ret < max_blocks) {
> Index: linux-2.6.27-rc1/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-11 22:18:31.000000000 -0700
> +++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-11 22:25:55.000000000 -0700
> @@ -1041,18 +1041,6 @@ static void ext4_da_update_reserve_space
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> }
>
> -/* Maximum number of blocks we map for direct IO at once. */
> -#define DIO_MAX_BLOCKS 4096
> -/*
> - * Number of credits we need for writing DIO_MAX_BLOCKS:
> - * We need sb + group descriptor + bitmap + inode -> 4
> - * For B blocks with A block pointers per block we need:
> - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> - */
> -#define DIO_CREDITS 25
> -
> -
> /*
> * The ext4_get_blocks_wrap() function try to look up the requested blocks,
> * and returns if the blocks are already mapped.
> @@ -1164,19 +1152,23 @@ int ext4_get_blocks_wrap(handle_t *handl
> return retval;
> }
>
> +/* Maximum number of blocks we map for direct IO at once. */
> +#define DIO_MAX_BLOCKS 4096
> +
> static int ext4_get_block(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create)
> {
> handle_t *handle = ext4_journal_current_handle();
> int ret = 0, started = 0;
> unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> + int dio_credits;
>
> if (create && !handle) {
> /* Direct IO write... */
> if (max_blocks > DIO_MAX_BLOCKS)
> max_blocks = DIO_MAX_BLOCKS;
> - handle = ext4_journal_start(inode, DIO_CREDITS +
> - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> + dio_credits = ext4_data_trans_blocks(inode, max_blocks);
> + handle = ext4_journal_start(inode, dio_credits);
Even in data=journal mode directIO will put the buffer_heads to journal
right ? . So should we use ext4_data_trans_blocks here ?
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> goto out;
> @@ -2222,7 +2214,7 @@ static int ext4_da_writepage(struct page
> * for DIO, writepages, and truncate
> */
> #define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> +#define EXT4_MAX_WRITEBACK_CREDITS 25
>
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> @@ -4429,7 +4421,8 @@ static int ext4_writeblocks_trans_credit
>
> /*
>
....
....
-aneesh
On Tue, Aug 12, 2008 at 09:35:38AM -0700, Mingming Cao wrote:
> Ext4: journal credit fix the delalloc writepages
>
> From: Mingming Cao <[email protected]>
>
> Previous delalloc writepages implementation start a new transaction outside
> a loop call of get_block() to do the block allocation. Due to lack of information
> of how many blocks to be allocated, the estimate of the journal credits is very
> Conservative and caused many issues.
>
> With the reworked delayed allocation, a new transaction is created for
> each get_block(), thus we don't need to guess how many credits for the multiple
> chunk of allocation. Start every transaction with credits for insert a single
> extent is enough. But we still need to consider the journalled mode, where
> it need to account for the number of data blocks. So we guess max number of
> data blocks for each allocation.
But we don't currently support data=journal with delalloc.
> Due to the current VFS implementation
> writepages() could only flush PAGEVEC of pages at a time, the max block
> allocation is limited and calculated based on that, an the total number
> of reserved delalloc datablocks, whichever is smaller.
That is not correct. Currently write_cache_pages do
while (!done && (index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_DIRTY,
min(end - index,
(pgoff_t)PAGEVEC_SIZE-1) + 1)))
{
and mpage_da_submit_io does
while (index <= end) {
/* XXX: optimize tail */
nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
ie we iterate till index > end. So we can very well have more than
PAGEVEC number of pages in a single transaction.
>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/ext4/inode.c | 39 ++++++++++++++++++++++++---------------
> 1 file changed, 24 insertions(+), 15 deletions(-)
>
> Index: linux-2.6.27-rc1/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-12 08:15:59.000000000 -0700
> +++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-12 08:30:41.000000000 -0700
> @@ -2210,17 +2210,28 @@ static int ext4_da_writepage(struct page
> }
>
> /*
> - * For now just follow the DIO way to estimate the max credits
> - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> - * todo: need to calculate the max credits need for
> - * extent based files, currently the DIO credits is based on
> - * indirect-blocks mapping way.
> - *
> - * Probably should have a generic way to calculate credits
> - * for DIO, writepages, and truncate
> + * This is called via ext4_da_writepages() to
> + * calulate the total number of credits to reserve to fit
> + * a single extent allocation into a single transaction,
> + * ext4_da_writpeages() will loop calling this before
> + * the block allocation.
> + *
> + * The page vector size limited the max number of pages could
> + * be writeout at a time. Based on this, the max blocks to pass to
> + * get_block is calculated
> */
> -#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> -#define EXT4_MAX_WRITEBACK_CREDITS 25
> +
> +#define EXT4_MAX_WRITEPAGES_SIZE PAGEVEC_SIZE
> +static int ext4_writepages_trans_blocks(struct inode *inode)
> +{
> + int bpp = ext4_journal_blocks_per_page(inode);
> + int max_blocks = EXT4_MAX_WRITEPAGES_SIZE * bpp;
> +
> + if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
> + max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
Why are we limiting max_blocks to i_reserved_data_blocks ?
> +
> + return ext4_data_trans_blocks(inode, max_blocks);
> +}
>
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> @@ -2262,7 +2273,7 @@ restart_loop:
> * by delalloc
> */
> BUG_ON(ext4_should_journal_data(inode));
> - needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> + needed_blocks = ext4_writepages_trans_blocks(inode);
>
The BUG_ON above is added to make sure we update this when start
supporting data=journal mode with delalloc.
> /* start a new transaction*/
> handle = ext4_journal_start(inode, needed_blocks);
> @@ -4449,11 +4460,9 @@ static int ext4_writeblocks_trans_credit
> * the modification of a single pages into a single transaction,
> * which may include multile chunk of block allocations.
> *
> - * This could be called via ext4_write_begin() or later
> - * ext4_da_writepages() in delalyed allocation case.
> + * This could be called via ext4_write_begin()
> *
> - * In both case it's possible that we could allocating multiple
> - * chunks of blocks. We need to consider the worse case, when
> + * We need to consider the worse case, when
> * one new block per extent.
> */
> int ext4_writepage_trans_blocks(struct inode *inode)
>
>
On Wed, Aug 13, 2008 at 02:23:07PM +0530, Aneesh Kumar K.V wrote:
> >
> > if (create && !handle) {
> > /* Direct IO write... */
> > if (max_blocks > DIO_MAX_BLOCKS)
> > max_blocks = DIO_MAX_BLOCKS;
> > - handle = ext4_journal_start(inode, DIO_CREDITS +
> > - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> > + dio_credits = ext4_data_trans_blocks(inode, max_blocks);
> > + handle = ext4_journal_start(inode, dio_credits);
>
> Even in data=journal mode directIO will put the buffer_heads to journal
> right ? . So should we use ext4_data_trans_blocks here ?
>
That should be
Even in data=journal mode directIO will NOT put the buffer_heads to journal
>
>
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > goto out;
> > @@ -2222,7 +2214,7 @@ static int ext4_da_writepage(struct page
> > * for DIO, writepages, and truncate
> > */
> > #define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> > -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> > +#define EXT4_MAX_WRITEBACK_CREDITS 25
> >
> > static int ext4_da_writepages(struct address_space *mapping,
> > struct writeback_control *wbc)
> > @@ -4429,7 +4421,8 @@ static int ext4_writeblocks_trans_credit
-aneesh
On Tue, Aug 12, 2008 at 09:25:54AM -0700, Mingming Cao wrote:
> +int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
> +{
> + int groups, gdpblocks;
> + int ret = 0;
> +
> + groups = nrblocks + idxblocks;
> + gdpblocks = groups;
> + if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
> + groups = EXT4_SB(inode->i_sb)->s_groups_count;
> + if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
> + gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
I guess we should pass chunk to ext4_meta_trans_blocks also.
if (chunk) we should update only one group descriptor and one
bitmap block ie,
if(chunk)
groups = 1 + idxblocks;
> +
> + /* bitmaps and block group descriptor blocks */
> + ret += groups + gdpblocks;
> +
> + ret += idxblocks;
> +
> + /* journalled mode, include buffer to modify data blocks */
> + if (ext4_should_journal_data(inode))
> + ret += nrblocks;
> +
> + /* Blocks for super block, inode, quota and xattr blocks */
> + ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
> +
> + return ret;
> +}
> +
....
...
> +static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
> + int chunk)
> {
> - int bpp = ext4_journal_blocks_per_page(inode);
> - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> + int indirects;
> int ret;
>
> - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> - return ext4_ext_writepage_trans_blocks(inode, bpp);
> -
> - if (ext4_should_journal_data(inode))
> - ret = 3 * (bpp + indirects) + 2;
> - else
> - ret = 2 * (bpp + indirects) + 2;
> + /*
> + * How many index blocks need to touch to modify nrblocks?
> + * The "Chunk" flag indicating whether the nrblocks is
> + * physically contigous on disk
> + *
> + * For Direct IO and fallocate, they calls get_block to allocate
> + * one single extent at a time, so they could set the "Chunk" flag
> + */
> + indirects = ext4_indirect_trans_blocks(inode, nrblocks, chunk);
>
> -#ifdef CONFIG_QUOTA
> - /* We know that structure was already allocated during DQUOT_INIT so
> - * we will be updating only the data blocks + inodes */
> - ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> -#endif
> + /* Account for block group bitmaps and block groups
> + * descriptors.Worse case, the nrblocks+indirects blocks spread
> + * over different block groups
> + */
> + ret = ext4_meta_trans_blocks(inode, nrblocks, indirects);
>
> return ret;
> }
-aneesh
在 2008-08-13三的 14:07 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 09:27:53AM -0700, Mingming Cao wrote:
>
> ....
> ....
>
> >
> > /*
> > - * ext4_ext_calc_credits_for_insert:
> > - * This routine returns max. credits that the extent tree can consume.
> > - * It should be OK for low-performance paths like ->writepage()
> > - * To allow many writing processes to fit into a single transaction,
> > - * the caller should calculate credits under i_data_sem and
> > - * pass the actual path.
> > + * ext4_ext_calc_credits_for_single_extent:
> > + * This routine returns max. credits that needed to insert an extent
> > + * to the extent tree.
> > + * When pass the actual path, the caller should calculate credits
> > + * under i_data_sem.
> > */
> > -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> > +int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
> > struct ext4_ext_path *path)
> > {
> > - int depth, needed;
> > + int depth = ext_depth(inode);
> >
> > if (path) {
> > /* probably there is space in leaf? */
> > - depth = ext_depth(inode);
> > if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> > < le16_to_cpu(path[depth].p_hdr->eh_max))
> > - return 1;
> > + return 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
> > }
> >
> > - /*
> > - * given 32-bit logical block (4294967296 blocks), max. tree
> > - * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
> > - * Let's also add one more level for imbalance.
> > - */
> > - depth = 5;
> > -
> > - /* allocation of new data block(s) */
> > - needed = 2;
> > + return ext4_ext_writepage_trans_blocks(inode, num, 1);
>
>
> single extent insert should not look at journaling mode or not
> In the above with path specified we don't look at journaling mode
> but ext4_ext_writepage_trans_blocks will do
Are you sure about this?
ext4_ext_calc_credits_for_single_extent is actually only called from
migration and defrag with this set of changes now. In the
data=journalling mode, since we move data around (for migration and
defrag), I think we still need to log the data buffers. I might be
over-causous.
Mingming
在 2008-08-13三的 14:01 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 09:25:54AM -0700, Mingming Cao wrote:
>
> ....
>
>
> > - * and dindirect block, and the "5" above becomes "3".
> > + * Also account for superblock, inode, quota and xattr blocks
> > + */
> > +int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
> > +{
> > + int groups, gdpblocks;
> > + int ret = 0;
> > +
> > + groups = nrblocks + idxblocks;
> > + gdpblocks = groups;
> > + if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
> > + groups = EXT4_SB(inode->i_sb)->s_groups_count;
> > + if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
> > + gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
> > +
> > + /* bitmaps and block group descriptor blocks */
> > + ret += groups + gdpblocks;
>
>
> ret = groups + gdpblocks;
>
> > +
> > + ret += idxblocks;
> > +
> > + /* journalled mode, include buffer to modify data blocks */
> > + if (ext4_should_journal_data(inode))
> > + ret += nrblocks;
> > +
> > + /* Blocks for super block, inode, quota and xattr blocks */
> > + ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
> > +
> > + return ret;
> > +}
> > +
> > +static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
> > + int chunk)
> > +{
>
> .....
> .....
>
> >
> > +/*
> > + * Define the number of metadata blocks we need to account to modify data.
> > + *
> > + * This include super block, inode block, quota blocks and xattr blocks
> > + */
> > +#define EXT4_META_TRANS_BLOCKS(sb) (EXT4_XATTR_TRANS_BLOCKS + \
> > + 2*EXT4_QUOTA_TRANS_BLOCKS(sb))
> > +
>
> Do EXT4_XATTR_TRANS_BLOCKS blocks really account for super block
> and inode block ?
Yes, here is the define
/* Extended attribute operations touch at most two data buffers,
* two bitmap buffers, and two group summaries, in addition to the inode
* and the superblock, which are already accounted for. */
#define EXT4_XATTR_TRANS_BLOCKS 6U
> Looking at EXT4_DATA_TRANS_BLOCKS which also use
> EXT4_XATTR_TRANS_BLOCKS I think it doesn't.
It's pretty clear from the comment.
/* Define the minimum size for a transaction which modifies data. This
* needs to take into account the fact that we may end up modifying two
* quota files too (one for the group, one for the user quota). The
* superblock only gets updated once, of course, so don't bother
* counting that again for the quota updates. */
#define EXT4_DATA_TRANS_BLOCKS(sb)
(EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + \
EXT4_XATTR_TRANS_BLOCKS - 2 + \
2*EXT4_QUOTA_TRANS_BLOCKS(sb))
It does have "-2" in EXT4_DATA_TRANS_BLOCKS to remove double accounted
super and inode block
Mingming
在 2008-08-13三的 14:23 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 09:29:50AM -0700, Mingming Cao wrote:
> >
> ......
> ....
>
> >
> > ===================================================================
> > Index: linux-2.6.27-rc1/fs/ext4/extents.c
> > ===================================================================
> > --- linux-2.6.27-rc1.orig/fs/ext4/extents.c 2008-08-11 22:25:39.000000000 -0700
> > +++ linux-2.6.27-rc1/fs/ext4/extents.c 2008-08-11 22:25:55.000000000 -0700
> > @@ -2799,7 +2799,7 @@ void ext4_ext_truncate(struct inode *ino
> > /*
> > * probably first extent we're gonna free will be last in block
> > */
> > - err = ext4_writepage_trans_blocks(inode) + 3;
> > + err = ext4_writepage_trans_blocks(inode);
> > handle = ext4_journal_start(inode, err);
> > if (IS_ERR(handle))
> > return;
> > @@ -2951,10 +2951,9 @@ long ext4_fallocate(struct inode *inode,
> > max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > - block;
> > /*
> > - * credits to insert 1 extent into extent tree + buffers to be able to
> > - * modify 1 super block, 1 block bitmap and 1 group descriptor.
> > + * credits to insert 1 extent into extent tree
> > */
> > - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
> > + credits = ext4_data_trans_blocks(inode, max_blocks);
>
>
> Why do we need to consider data=journaled mode here. We are not writing
> any data here. Instead we are just inserting an extent.
>
Actually the change here is not mean to support data=journalled here.
The ext4_data_trans_blocks() is intended for calculate credits for a
chunk of allocation, used for DIO and fallocate, regardless of delalloc
or not.
We should remove the considering of data journal in the
ext4_data_trans_blocks(), I agree.
Now that I realize the data=journalled code doesn't work for delalloc
(or delalloc da writepages doesn' t support the journalled mode, due to
the lock ordering issue), I am not sure if there is plan to do so (or
there is need to support journalled mode on delalloc). We still need to
keep the data=journalled consideration for writepage/write_begin, just
to help user move from ext3 to ext4 I guess.
>
> > mutex_lock(&inode->i_mutex);
> > retry:
> > while (ret >= 0 && ret < max_blocks) {
> > Index: linux-2.6.27-rc1/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-11 22:18:31.000000000 -0700
> > +++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-11 22:25:55.000000000 -0700
> > @@ -1041,18 +1041,6 @@ static void ext4_da_update_reserve_space
> > spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> > }
> >
> > -/* Maximum number of blocks we map for direct IO at once. */
> > -#define DIO_MAX_BLOCKS 4096
> > -/*
> > - * Number of credits we need for writing DIO_MAX_BLOCKS:
> > - * We need sb + group descriptor + bitmap + inode -> 4
> > - * For B blocks with A block pointers per block we need:
> > - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
> > - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
> > - */
> > -#define DIO_CREDITS 25
> > -
> > -
> > /*
> > * The ext4_get_blocks_wrap() function try to look up the requested blocks,
> > * and returns if the blocks are already mapped.
> > @@ -1164,19 +1152,23 @@ int ext4_get_blocks_wrap(handle_t *handl
> > return retval;
> > }
> >
> > +/* Maximum number of blocks we map for direct IO at once. */
> > +#define DIO_MAX_BLOCKS 4096
> > +
> > static int ext4_get_block(struct inode *inode, sector_t iblock,
> > struct buffer_head *bh_result, int create)
> > {
> > handle_t *handle = ext4_journal_current_handle();
> > int ret = 0, started = 0;
> > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > + int dio_credits;
> >
> > if (create && !handle) {
> > /* Direct IO write... */
> > if (max_blocks > DIO_MAX_BLOCKS)
> > max_blocks = DIO_MAX_BLOCKS;
> > - handle = ext4_journal_start(inode, DIO_CREDITS +
> > - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
> > + dio_credits = ext4_data_trans_blocks(inode, max_blocks);
> > + handle = ext4_journal_start(inode, dio_credits);
>
> Even in data=journal mode directIO will put the buffer_heads to journal
> right ? . So should we use ext4_data_trans_blocks here ?
>
>
>
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > goto out;
> > @@ -2222,7 +2214,7 @@ static int ext4_da_writepage(struct page
> > * for DIO, writepages, and truncate
> > */
> > #define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> > -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
> > +#define EXT4_MAX_WRITEBACK_CREDITS 25
> >
> > static int ext4_da_writepages(struct address_space *mapping,
> > struct writeback_control *wbc)
> > @@ -4429,7 +4421,8 @@ static int ext4_writeblocks_trans_credit
> >
> > /*
> >
>
> ....
> ....
>
> -aneesh
在 2008-08-13三的 15:16 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 09:35:38AM -0700, Mingming Cao wrote:
> > Ext4: journal credit fix the delalloc writepages
> >
> > From: Mingming Cao <[email protected]>
> >
> > Previous delalloc writepages implementation start a new transaction outside
> > a loop call of get_block() to do the block allocation. Due to lack of information
> > of how many blocks to be allocated, the estimate of the journal credits is very
> > Conservative and caused many issues.
> >
> > With the reworked delayed allocation, a new transaction is created for
> > each get_block(), thus we don't need to guess how many credits for the multiple
> > chunk of allocation. Start every transaction with credits for insert a single
> > extent is enough. But we still need to consider the journalled mode, where
> > it need to account for the number of data blocks. So we guess max number of
> > data blocks for each allocation.
>
>
> But we don't currently support data=journal with delalloc.
>
Ok, I realize that.
But even if we want just a chunk of allocation, we still need to know
how much data blocks to allocate, in order to guess how many credits
need for indirect/index blocks..:(
>
> > Due to the current VFS implementation
> > writepages() could only flush PAGEVEC of pages at a time, the max block
> > allocation is limited and calculated based on that, an the total number
> > of reserved delalloc datablocks, whichever is smaller.
>
> That is not correct. Currently write_cache_pages do
> while (!done && (index <= end) &&
> (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> PAGECACHE_TAG_DIRTY,
> min(end - index,
> (pgoff_t)PAGEVEC_SIZE-1) + 1)))
> {
>
> and mpage_da_submit_io does
> while (index <= end) {
> /* XXX: optimize tail */
> nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
>
>
>
> ie we iterate till index > end. So we can very well have more than
> PAGEVEC number of pages in a single transaction.
>
Ok, I am glad to see this is not a limit
> >
> > Signed-off-by: Mingming Cao <[email protected]>
> > ---
> > fs/ext4/inode.c | 39 ++++++++++++++++++++++++---------------
> > 1 file changed, 24 insertions(+), 15 deletions(-)
> >
> > Index: linux-2.6.27-rc1/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-12 08:15:59.000000000 -0700
> > +++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-12 08:30:41.000000000 -0700
> > @@ -2210,17 +2210,28 @@ static int ext4_da_writepage(struct page
> > }
> >
> > /*
> > - * For now just follow the DIO way to estimate the max credits
> > - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> > - * todo: need to calculate the max credits need for
> > - * extent based files, currently the DIO credits is based on
> > - * indirect-blocks mapping way.
> > - *
> > - * Probably should have a generic way to calculate credits
> > - * for DIO, writepages, and truncate
> > + * This is called via ext4_da_writepages() to
> > + * calulate the total number of credits to reserve to fit
> > + * a single extent allocation into a single transaction,
> > + * ext4_da_writpeages() will loop calling this before
> > + * the block allocation.
> > + *
> > + * The page vector size limited the max number of pages could
> > + * be writeout at a time. Based on this, the max blocks to pass to
> > + * get_block is calculated
> > */
> > -#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> > -#define EXT4_MAX_WRITEBACK_CREDITS 25
> > +
> > +#define EXT4_MAX_WRITEPAGES_SIZE PAGEVEC_SIZE
> > +static int ext4_writepages_trans_blocks(struct inode *inode)
> > +{
> > + int bpp = ext4_journal_blocks_per_page(inode);
> > + int max_blocks = EXT4_MAX_WRITEPAGES_SIZE * bpp;
> > +
> > + if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
> > + max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
>
>
>
> Why are we limiting max_blocks to i_reserved_data_blocks ?
>
i_reserved_data_blocks is the total number of "delayed" blocks that need
block allocation. That's a counter being adds up at each write_begin()
when the block allocation is defered. That's a accurate counter to
indicate the max number of allocation we need to flush all dirty pages
to disk for this inode, fits well when we need to calculate the credits
for da_writepages.
Now that we don't have PAGEVEC limit, we could use this to limit the
total number of blocks to allocate when estimate the credit. But if
this i_reserved_data_blocks gets too large, that can't fit into one
single transaction, later get_block() will overflow the journal, we need
someway to limit the number of pages to flush still:(
>
> > +
> > + return ext4_data_trans_blocks(inode, max_blocks);
> > +}
> >
> > static int ext4_da_writepages(struct address_space *mapping,
> > struct writeback_control *wbc)
> > @@ -2262,7 +2273,7 @@ restart_loop:
> > * by delalloc
> > */
> > BUG_ON(ext4_should_journal_data(inode));
> > - needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> > + needed_blocks = ext4_writepages_trans_blocks(inode);
> >
>
> The BUG_ON above is added to make sure we update this when start
> supporting data=journal mode with delalloc.
>
>
> > /* start a new transaction*/
> > handle = ext4_journal_start(inode, needed_blocks);
> > @@ -4449,11 +4460,9 @@ static int ext4_writeblocks_trans_credit
> > * the modification of a single pages into a single transaction,
> > * which may include multile chunk of block allocations.
> > *
> > - * This could be called via ext4_write_begin() or later
> > - * ext4_da_writepages() in delalyed allocation case.
> > + * This could be called via ext4_write_begin()
> > *
> > - * In both case it's possible that we could allocating multiple
> > - * chunks of blocks. We need to consider the worse case, when
> > + * We need to consider the worse case, when
> > * one new block per extent.
> > */
> > int ext4_writepage_trans_blocks(struct inode *inode)
> >
> >
在 2008-08-13三的 15:49 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 09:25:54AM -0700, Mingming Cao wrote:
> > +int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
> > +{
> > + int groups, gdpblocks;
> > + int ret = 0;
> > +
> > + groups = nrblocks + idxblocks;
> > + gdpblocks = groups;
> > + if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
> > + groups = EXT4_SB(inode->i_sb)->s_groups_count;
> > + if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
> > + gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
>
>
> I guess we should pass chunk to ext4_meta_trans_blocks also.
> if (chunk) we should update only one group descriptor and one
> bitmap block ie,
> if(chunk)
> groups = 1 + idxblocks;
>
>
Agree. will update the patch to do that
Mingming
>
> > +
> > + /* bitmaps and block group descriptor blocks */
> > + ret += groups + gdpblocks;
> > +
> > + ret += idxblocks;
> > +
> > + /* journalled mode, include buffer to modify data blocks */
> > + if (ext4_should_journal_data(inode))
> > + ret += nrblocks;
> > +
> > + /* Blocks for super block, inode, quota and xattr blocks */
> > + ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
> > +
> > + return ret;
> > +}
> > +
> ....
> ...
>
> > +static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
> > + int chunk)
> > {
> > - int bpp = ext4_journal_blocks_per_page(inode);
> > - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
> > + int indirects;
> > int ret;
> >
> > - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
> > - return ext4_ext_writepage_trans_blocks(inode, bpp);
> > -
> > - if (ext4_should_journal_data(inode))
> > - ret = 3 * (bpp + indirects) + 2;
> > - else
> > - ret = 2 * (bpp + indirects) + 2;
> > + /*
> > + * How many index blocks need to touch to modify nrblocks?
> > + * The "Chunk" flag indicating whether the nrblocks is
> > + * physically contigous on disk
> > + *
> > + * For Direct IO and fallocate, they calls get_block to allocate
> > + * one single extent at a time, so they could set the "Chunk" flag
> > + */
> > + indirects = ext4_indirect_trans_blocks(inode, nrblocks, chunk);
> >
> > -#ifdef CONFIG_QUOTA
> > - /* We know that structure was already allocated during DQUOT_INIT so
> > - * we will be updating only the data blocks + inodes */
> > - ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
> > -#endif
> > + /* Account for block group bitmaps and block groups
> > + * descriptors.Worse case, the nrblocks+indirects blocks spread
> > + * over different block groups
> > + */
> > + ret = ext4_meta_trans_blocks(inode, nrblocks, indirects);
> >
> > return ret;
> > }
>
> -aneesh
On Wed, Aug 13, 2008 at 05:26:58PM -0700, Mingming Cao wrote:
>
> 在 2008-08-13三的 14:07 +0530,Aneesh Kumar K.V写道:
> > On Tue, Aug 12, 2008 at 09:27:53AM -0700, Mingming Cao wrote:
> >
> > ....
> > ....
> >
> > >
> > > /*
> > > - * ext4_ext_calc_credits_for_insert:
> > > - * This routine returns max. credits that the extent tree can consume.
> > > - * It should be OK for low-performance paths like ->writepage()
> > > - * To allow many writing processes to fit into a single transaction,
> > > - * the caller should calculate credits under i_data_sem and
> > > - * pass the actual path.
> > > + * ext4_ext_calc_credits_for_single_extent:
> > > + * This routine returns max. credits that needed to insert an extent
> > > + * to the extent tree.
> > > + * When pass the actual path, the caller should calculate credits
> > > + * under i_data_sem.
> > > */
> > > -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> > > +int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
> > > struct ext4_ext_path *path)
> > > {
> > > - int depth, needed;
> > > + int depth = ext_depth(inode);
> > >
> > > if (path) {
> > > /* probably there is space in leaf? */
> > > - depth = ext_depth(inode);
> > > if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> > > < le16_to_cpu(path[depth].p_hdr->eh_max))
> > > - return 1;
> > > + return 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
> > > }
> > >
> > > - /*
> > > - * given 32-bit logical block (4294967296 blocks), max. tree
> > > - * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
> > > - * Let's also add one more level for imbalance.
> > > - */
> > > - depth = 5;
> > > -
> > > - /* allocation of new data block(s) */
> > > - needed = 2;
> > > + return ext4_ext_writepage_trans_blocks(inode, num, 1);
> >
> >
> > single extent insert should not look at journaling mode or not
> > In the above with path specified we don't look at journaling mode
> > but ext4_ext_writepage_trans_blocks will do
>
> Are you sure about this?
>
> ext4_ext_calc_credits_for_single_extent is actually only called from
> migration and defrag with this set of changes now. In the
> data=journalling mode, since we move data around (for migration and
> defrag), I think we still need to log the data buffers. I might be
> over-causous.
>
Migrate doesn't touch the data buffers at all. It builds new extents
and insert them into the tree. So we don't need to reserve credits for
data blocks with migrate.
But still the code is wrong because we do
if (path) {
.....
.....
return 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
here we don't account for data blocks needed for the journal mode
and
return ext4_ext_writepage_trans_blocks(inode, num, 1);
here we account for data blocks.
Also ext4_ext_calc_credits_for_single_extent was not accounting for
data blocks earlier. So I would assume the change in behavior is not
needed
-aneesh
On Wed, Aug 13, 2008 at 06:01:10PM -0700, Mingming Cao wrote:
....
....
> > > +static int ext4_writepages_trans_blocks(struct inode *inode)
> > > +{
> > > + int bpp = ext4_journal_blocks_per_page(inode);
> > > + int max_blocks = EXT4_MAX_WRITEPAGES_SIZE * bpp;
> > > +
> > > + if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
> > > + max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> >
> >
> >
> > Why are we limiting max_blocks to i_reserved_data_blocks ?
> >
>
> i_reserved_data_blocks is the total number of "delayed" blocks that need
> block allocation. That's a counter being adds up at each write_begin()
> when the block allocation is defered. That's a accurate counter to
> indicate the max number of allocation we need to flush all dirty pages
> to disk for this inode, fits well when we need to calculate the credits
> for da_writepages.
>
> Now that we don't have PAGEVEC limit, we could use this to limit the
> total number of blocks to allocate when estimate the credit. But if
> this i_reserved_data_blocks gets too large, that can't fit into one
> single transaction, later get_block() will overflow the journal, we need
> someway to limit the number of pages to flush still:(
We should be requesting for credits needed for
EXT4_I(inode)->i_reserved_data_blocks with chunk = 1. If we don't have
that many credits we can limit max_blocks with EXT4_BLOCKS_PER_GROUP
if (EXT4_I(inode)->i_reserved_data_blocks > EXT4_BLOCKS_PER_GROUP)
max_blocks = EXT4_BLOCKS_PER_GROUP;
else
max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
>
> >
> > > +
-aneesh
On Tue, Aug 12, 2008 at 09:23:10AM -0700, Mingming Cao wrote:
> This is a rework of journal credits fix patch in the ext4 patch queue.
> The patch series contains
>
> - patch 1: helper funtions for journal credits calculation and fix the
> writepage/write_begin on nonextent files
> - patch 2: journal credit fix wirtepahe/write_begin for extents files,
> and migration
> - patch 3: credit fix for dio, fallocate
> -patch 4: rebase ext4_da_writepages_rework patch
> - patch 5: credit fix for delalloc writepages
> -patch 6: credit fix for defrag
>
>
I see this patch is pushed to the patch queue. Below is the review in
patch form.
commit 679a6fa1de0bc67c0a444b748696a2f2c22428c7
Author: Aneesh Kumar K.V <[email protected]>
Date: Fri Aug 15 22:13:07 2008 +0530
cleanup
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 258cd1a..164c988 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1150,7 +1150,8 @@ extern void ext4_set_inode_flags(struct inode *);
extern void ext4_get_inode_flags(struct ext4_inode_info *);
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
-extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
+extern int ext4_meta_trans_blocks(struct inode *, int nrblocks,
+ int idxblocks, bool chunk);
extern int ext4_data_trans_blocks(struct inode *, int nrblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
@@ -1316,7 +1317,7 @@ extern const struct inode_operations ext4_fast_symlink_inode_operations;
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
-extern int ext4_ext_writepage_trans_blocks(struct inode *, int num, int chunk);
+extern int ext4_ext_writeblock_trans_credits(struct inode *, int num, bool chunk);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2fbca0f..b38cc56 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1893,7 +1893,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
* When pass the actual path, the caller should calculate credits
* under i_data_sem.
*/
-int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
+int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int nrblocks,
struct ext4_ext_path *path)
{
int depth = ext_depth(inode);
@@ -1905,7 +1905,7 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
return 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
}
- return ext4_ext_writepage_trans_blocks(inode, num, 1);
+ return ext4_ext_writeblock_trans_credits(inode, nrblocks, 1);
}
/*
@@ -1919,7 +1919,7 @@ int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
* If the nrblocks are discontigous, they could cause
* the whole tree split more than once, but this is really rare.
*/
-static int ext4_ext_index_trans_blocks(struct inode *inode, int num, int chunk)
+static int ext4_ext_index_trans_blocks(struct inode *inode, int num, bool chunk)
{
int index;
int depth = ext_depth(inode);
@@ -2993,7 +2993,7 @@ void ext4_ext_truncate(struct inode *inode)
}
/*
- * ext4_ext_writepage_trans_blocks:
+ * ext4_ext_writeblock_trans_credits:
* calculate max number of blocks we could modify
* in order to allocate nrblocks of blocks.
*
@@ -3005,8 +3005,11 @@ void ext4_ext_truncate(struct inode *inode)
* see how many bitmapblocks and block group descriptor groups need to accounted
* At last adds up the superblock, inode, quotao and xattr blocks. These
* all take care of in ext4_meta_trans_blocks()
+ *
+ * This doesn't account for the blocks needed for journaled
+ * data blocks.
*/
-int ext4_ext_writepage_trans_blocks(struct inode *inode, int num, int chunk)
+int ext4_ext_writeblock_trans_credits(struct inode *inode, int nrblocks, bool chunk)
{
int needed;
int index_blocks;
@@ -3016,17 +3019,14 @@ int ext4_ext_writepage_trans_blocks(struct inode *inode, int num, int chunk)
* insert a single extent with num blocks(chunk == 1)
* or @num extents (chunk ==0)
*/
- index_blocks = ext4_ext_index_trans_blocks(inode, num, chunk);
+ index_blocks = ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
/* How many metadat blocks need to modify to modify the @num
* of data blocks and index_blocks? Include, index/leaf blocks,
* bitmaps,block group descriptor block for modifying both data
* and index/leaf blocks, superblock, inode, quota and xattrs
*/
- needed = ext4_meta_trans_blocks(inode, num, index_blocks);
-
- if (ext4_should_journal_data(inode))
- needed += num;
+ needed = ext4_meta_trans_blocks(inode, nrblocks, index_blocks, chunk);
return needed;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0b34998..a843cd3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1568,9 +1568,9 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
* but since this function is called from invalidate
* page, it's harmless to return without any action
*/
- printk(KERN_INFO "ext4 delalloc try to release %d reserved"
- "blocks for inode %lu, but there is no reserved"
- "data blocks\n", inode->i_ino, to_free);
+ printk(KERN_INFO "ext4 delalloc try to release %d reserved "
+ "blocks for inode %lu, but there is no reserved "
+ "data blocks\n", to_free, inode->i_ino);
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
return;
}
@@ -2303,13 +2303,13 @@ static int ext4_da_writepage(struct page *page,
* get_block is calculated
*/
-#define EXT4_MAX_WRITEPAGES_SIZE PAGEVEC_SIZE
-static int ext4_writepages_trans_blocks(struct inode *inode)
+static int ext4_da_writepages_trans_blocks(struct inode *inode)
{
- int bpp = ext4_journal_blocks_per_page(inode);
- int max_blocks = EXT4_MAX_WRITEPAGES_SIZE * bpp;
-
- if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
+ int max_blocks;
+ if (EXT4_I(inode)->i_reserved_data_blocks >
+ EXT4_BLOCKS_PER_GROUP(inode->i_sb))
+ max_blocks = EXT4_BLOCKS_PER_GROUP(inode->i_sb);
+ else
max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
return ext4_data_trans_blocks(inode, max_blocks);
@@ -2364,7 +2364,7 @@ static int ext4_da_writepages(struct address_space *mapping,
* by delalloc
*/
BUG_ON(ext4_should_journal_data(inode));
- needed_blocks = ext4_writepages_trans_blocks(inode);
+ needed_blocks = ext4_da_writepages_trans_blocks(inode);
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
@@ -4459,13 +4459,19 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
* over different block groups
*
* Also account for superblock, inode, quota and xattr blocks
+ * This doesn't account for the blocks needed for journaled
+ * data blocks.
*/
-int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
+int ext4_meta_trans_blocks(struct inode* inode,
+ int nrblocks, int idxblocks, bool chunk)
{
int groups, gdpblocks;
int ret = 0;
- groups = nrblocks + idxblocks;
+ if (chunk)
+ groups = 1 + idxblocks;
+ else
+ groups = nrblocks + idxblocks;
gdpblocks = groups;
if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
groups = EXT4_SB(inode->i_sb)->s_groups_count;
@@ -4473,14 +4479,10 @@ int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
/* bitmaps and block group descriptor blocks */
- ret += groups + gdpblocks;
+ ret = groups + gdpblocks;
ret += idxblocks;
- /* journalled mode, include buffer to modify data blocks */
- if (ext4_should_journal_data(inode))
- ret += nrblocks;
-
/* Blocks for super block, inode, quota and xattr blocks */
ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
@@ -4488,14 +4490,14 @@ int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
}
static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
- int chunk)
+ bool chunk)
{
int indirects;
- /* if nrblocks are contigous */
+ /* if nrblocks are contiguous */
if (chunk) {
/*
- * With N contigous data blocks, it need at most
+ * With N contiguous data blocks, it need at most
* N/EXT4_ADDR_PER_BLOCK(inode->i_sb) indirect blocks
* 2 dindirect blocks
* 1 tindirect block
@@ -4504,7 +4506,7 @@ static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
return indirects + 3;
}
/*
- * if nrblocks are not contigous, worse case, each block touch
+ * if nrblocks are not contiguous, worse case, each block touch
* a indirect block, and each indirect block touch a double indirect
* block, plus a triple indirect block
*/
@@ -4512,18 +4514,21 @@ static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
return indirects;
}
/*
- * How many journal blocks are need to modify N blocks contigous data()?
+ * How many journal blocks are need to modify N blocks contiguous data()?
*
* It need to account indirect blocks, data blocks, and
* bitmap blocks and group descriptor blocks.
*
* This still overestimates under most circumstances. If we were to pass the
* start and end offsets in here as well we could do block_to_path() on each
- * block and work out the exact number of indirects which are touched. Pah.
+ * block and work out the exact number of indirects which are touched.
+ *
+ * This doesn't account for the blocks needed for journaled
+ * data blocks.
*/
static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
- int chunk)
+ bool chunk)
{
int indirects;
int ret;
@@ -4531,7 +4536,7 @@ static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
/*
* How many index blocks need to touch to modify nrblocks?
* The "Chunk" flag indicating whether the nrblocks is
- * physically contigous on disk
+ * physically contiguous on disk
*
* For Direct IO and fallocate, they calls get_block to allocate
* one single extent at a time, so they could set the "Chunk" flag
@@ -4542,7 +4547,7 @@ static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
* descriptors.Worse case, the nrblocks+indirects blocks spread
* over different block groups
*/
- ret = ext4_meta_trans_blocks(inode, nrblocks, indirects);
+ ret = ext4_meta_trans_blocks(inode, nrblocks, indirects, chunk);
return ret;
}
@@ -4559,11 +4564,17 @@ static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks,
*/
int ext4_writepage_trans_blocks(struct inode *inode)
{
+ int ret;
int bpp = ext4_journal_blocks_per_page(inode);
if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
- return ext4_writeblocks_trans_credits_old(inode, bpp, 0);
- return ext4_ext_writepage_trans_blocks(inode, bpp, 0);
+ ret = ext4_writeblocks_trans_credits_old(inode, bpp, 0);
+ ret = ext4_ext_writeblock_trans_credits(inode, bpp, 0);
+
+ /* journalled mode, include buffer to modify data blocks */
+ if (ext4_should_journal_data(inode))
+ ret += bpp;
+ return ret;
}
/*
@@ -4575,13 +4586,16 @@ int ext4_writepage_trans_blocks(struct inode *inode)
* chunk of allocation needs.
*
* This is called from DIO, fallocate or whoever calling
- * ext4_get_blocks_wrap() to map/allocate a chunk of contigous disk blocks
+ * ext4_get_blocks_wrap() to map/allocate a chunk of contiguous disk blocks
+ *
+ * This doesn't account for the blocks needed for journaled
+ * data blocks.
*/
int ext4_data_trans_blocks(struct inode *inode, int nrblocks)
{
if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
return ext4_writeblocks_trans_credits_old(inode, nrblocks, 1);
- return ext4_ext_writepage_trans_blocks(inode, nrblocks, 1);
+ return ext4_ext_writeblock_trans_credits(inode, nrblocks, 1);
}
/*
在 2008-08-15五的 23:03 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 09:23:10AM -0700, Mingming Cao wrote:
> > This is a rework of journal credits fix patch in the ext4 patch queue.
> > The patch series contains
> >
> > - patch 1: helper funtions for journal credits calculation and fix the
> > writepage/write_begin on nonextent files
> > - patch 2: journal credit fix wirtepahe/write_begin for extents files,
> > and migration
> > - patch 3: credit fix for dio, fallocate
> > -patch 4: rebase ext4_da_writepages_rework patch
> > - patch 5: credit fix for delalloc writepages
> > -patch 6: credit fix for defrag
> >
> >
>
> I see this patch is pushed to the patch queue. Below is the review in
> patch form.
>
> commit 679a6fa1de0bc67c0a444b748696a2f2c22428c7
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Fri Aug 15 22:13:07 2008 +0530
>
> cleanup
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 258cd1a..164c988 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1150,7 +1150,8 @@ extern void ext4_set_inode_flags(struct inode *);
> extern void ext4_get_inode_flags(struct ext4_inode_info *);
> extern void ext4_set_aops(struct inode *inode);
> extern int ext4_writepage_trans_blocks(struct inode *);
> -extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
> +extern int ext4_meta_trans_blocks(struct inode *, int nrblocks,
> + int idxblocks, bool chunk);
This might be the right thing to do , but I don't see other places in
ext3/4 uses "bool" type for a flag. I think just to keep the code
consistant, using "int" as a flag varibale type is quit common, not a
big deal anyway.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0b34998..a843cd3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1568,9 +1568,9 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
> * but since this function is called from invalidate
> * page, it's harmless to return without any action
> */
> - printk(KERN_INFO "ext4 delalloc try to release %d reserved"
> - "blocks for inode %lu, but there is no reserved"
> - "data blocks\n", inode->i_ino, to_free);
> + printk(KERN_INFO "ext4 delalloc try to release %d reserved "
> + "blocks for inode %lu, but there is no reserved "
> + "data blocks\n", to_free, inode->i_ino);
> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> return;
> }
The compile warning has already fixed in patch queue. I wan't sure what
do you mean in your last email about adding a space at the end of each
line?
> @@ -2303,13 +2303,13 @@ static int ext4_da_writepage(struct page *page,
> * get_block is calculated
> */
>
> -#define EXT4_MAX_WRITEPAGES_SIZE PAGEVEC_SIZE
> -static int ext4_writepages_trans_blocks(struct inode *inode)
> +static int ext4_da_writepages_trans_blocks(struct inode *inode)
> {
> - int bpp = ext4_journal_blocks_per_page(inode);
> - int max_blocks = EXT4_MAX_WRITEPAGES_SIZE * bpp;
> -
> - if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
> + int max_blocks;
> + if (EXT4_I(inode)->i_reserved_data_blocks >
> + EXT4_BLOCKS_PER_GROUP(inode->i_sb))
> + max_blocks = EXT4_BLOCKS_PER_GROUP(inode->i_sb);
> + else
> max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
>
Hmm. EXT4_BLOCKS_PER_GROUP could still be too big for a single
transaction. Default EXT4_BLOCKS_PER_GROUP is 32kblocks, that's could
still overflow the max capacity of a single journal log (default is 1/4
journal log size(128M)).
In fact, even if the EXT4_BLOCKS_PER_GROUP is the right value, and you
only limit the credit reservation here, later, we still need to limit
the logical page extent to flush in mpage_da_writepages() to not exceed
the previously reserved credits.
> @@ -4459,13 +4459,19 @@ int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
> * over different block groups
> *
> * Also account for superblock, inode, quota and xattr blocks
> + * This doesn't account for the blocks needed for journaled
> + * data blocks.
> */
> -int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int idxblocks)
> +int ext4_meta_trans_blocks(struct inode* inode,
> + int nrblocks, int idxblocks, bool chunk)
> {
> int groups, gdpblocks;
> int ret = 0;
>
> - groups = nrblocks + idxblocks;
> + if (chunk)
> + groups = 1 + idxblocks;
Not exactly right, the datablocks could spread over multiple block
groups with flex_bg.
if (chunk)
groups = (nrblocks + 1 )/EXT4_BLOCKS_PER_GROUP + idxblocks;
> + else
> + groups = nrblocks + idxblocks;
> gdpblocks = groups;
> if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
> groups = EXT4_SB(inode->i_sb)->s_groups_count;
I will fix the spell errors and update the patches.
在 2008-08-15五的 12:02 -0700,Mingming Cao写道:
> 在 2008-08-15五的 23:03 +0530,Aneesh Kumar K.V写道:
> > On Tue, Aug 12, 2008 at 09:23:10AM -0700, Mingming Cao wrote:
> > > This is a rework of journal credits fix patch in the ext4 patch queue.
> > > The patch series contains
> > >
> > > - patch 1: helper funtions for journal credits calculation and fix the
> > > writepage/write_begin on nonextent files
> > > - patch 2: journal credit fix wirtepahe/write_begin for extents files,
> > > and migration
> > > - patch 3: credit fix for dio, fallocate
> > > -patch 4: rebase ext4_da_writepages_rework patch
> > > - patch 5: credit fix for delalloc writepages
> > > -patch 6: credit fix for defrag
> > >
> > >
> >
> > I see this patch is pushed to the patch queue. Below is the review in
> > patch form.
> >
I have updated the patches, mostly:
1) Remove unnecessary journalled mode handling
2) More common code sharing, less code than before
3) Updated da_writepages credits calculation
will reply to each individual patch.
Thanks,
Mingming
Ext4: journal credits calulation cleanup and fix that for nonextent writepage
From: Mingming Cao <[email protected]>
When considering how many journal credits are needed for modifying a chunk of
data, we need to account for the super block, inode
block, quota blocks and xattr block, indirect/index blocks,
also, group bitmap and group descriptor blocks for new allocation (including
data and indirect/index blocks). There are many places in ext4 do the
calculation on their own and often missed one or two meta blocks, and
often they assume single block allocation, and did not considering the
multile chunk of allocation case.
This patch is trying to cleanup current journal credit code,
provides some common helper funtion to calculate the journal credits, to
be used for writepage, writepages, DIO, fallocate, migration, defrag, and for
both nonextent and extent files.
This patch modified the writepage/write_begin credit caculation for
nonextent files, to use the new helper function. It also fixed the problem
that writepage on nonextent files did not consider the case blocksize <pagesize,
thus could possibelly need multiple block allocation in a single transaction.
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext4/ext4.h | 1
fs/ext4/ext4_jbd2.h | 8 +++
fs/ext4/inode.c | 133 ++++++++++++++++++++++++++++++++++++++--------------
3 files changed, 107 insertions(+), 35 deletions(-)
Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c 2008-08-15 12:37:21.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c 2008-08-15 14:18:30.000000000 -0700
@@ -4354,56 +4354,120 @@ int ext4_getattr(struct vfsmount *mnt, s
return 0;
}
+static int ext4_indirect_trans_blocks(struct inode *inode, int nrblocks,
+ int chunk)
+{
+ int indirects;
+
+ /* if nrblocks are contiguous */
+ if (chunk) {
+ /*
+ * With N contiguous data blocks, it need at most
+ * N/EXT4_ADDR_PER_BLOCK(inode->i_sb) indirect blocks
+ * 2 dindirect blocks
+ * 1 tindirect block
+ */
+ indirects = nrblocks / EXT4_ADDR_PER_BLOCK(inode->i_sb);
+ return indirects + 3;
+ }
+ /*
+ * if nrblocks are not contiguous, worse case, each block touch
+ * a indirect block, and each indirect block touch a double indirect
+ * block, plus a triple indirect block
+ */
+ indirects = nrblocks * 2 + 1;
+ return indirects;
+}
+
+static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
+{
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return ext4_indirect_trans_blocks(inode, nrblocks, 0);
+ return ext4_ext_index_trans_blocks(inode, nrblocks, 0);
+}
/*
- * How many blocks doth make a writepage()?
+ * Account for index blocks, block groups bitmaps and block group
+ * descriptor blocks if modify datablocks and index blocks
+ * worse case, the indexs blocks spread over different block groups
*
- * With N blocks per page, it may be:
- * N data blocks
- * 2 indirect block
- * 2 dindirect
- * 1 tindirect
- * N+5 bitmap blocks (from the above)
- * N+5 group descriptor summary blocks
- * 1 inode block
- * 1 superblock.
- * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files
+ * If datablocks are discontiguous, they are possible to spread over
+ * different block groups too. If they are contiugous, with flexbg,
+ * they could still across block group boundary.
*
- * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS
+ * Also account for superblock, inode, quota and xattr blocks
+ */
+int ext4_meta_trans_blocks(struct inode* inode, int nrblocks, int chunk)
+{
+ int groups, gdpblocks;
+ int idxblocks;
+ int ret = 0;
+
+ /*
+ * How many index blocks need to touch to modify nrblocks?
+ * The "Chunk" flag indicating whether the nrblocks is
+ * physically contiguous on disk
+ *
+ * For Direct IO and fallocate, they calls get_block to allocate
+ * one single extent at a time, so they could set the "Chunk" flag
+ */
+ idxblocks = ext4_index_trans_blocks(inode, nrblocks, chunk);
+
+ ret = idxblocks;
+
+ /*
+ * Now let's see how many group bitmaps and group descriptors need
+ * to account
+ */
+ groups = idxblocks;
+ if (chunk)
+ groups += (nrblocks + EXT4_BLOCKS_PER_GROUP(inode->i_sb)
+ -1 )/EXT4_BLOCKS_PER_GROUP(inode->i_sb);
+ else
+ groups += nrblocks;
+
+ gdpblocks = groups;
+ if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
+ groups = EXT4_SB(inode->i_sb)->s_groups_count;
+ if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
+ gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
+
+ /* bitmaps and block group descriptor blocks */
+ ret += groups + gdpblocks;
+
+ /* Blocks for super block, inode, quota and xattr blocks */
+ ret += EXT4_META_TRANS_BLOCKS(inode->i_sb);
+
+ return ret;
+}
+
+/*
+ * Calulate the total number of credits to reserve to fit
+ * the modification of a single pages into a single transaction
*
- * With ordered or writeback data it's the same, less the N data blocks.
+ * This could be called via ext4_write_begin() or later
+ * ext4_da_writepages() in delalyed allocation case.
*
- * If the inode's direct blocks can hold an integral number of pages then a
- * page cannot straddle two indirect blocks, and we can only touch one indirect
- * and dindirect block, and the "5" above becomes "3".
+ * In both case it's possible that we could allocating multiple
+ * chunks of blocks. We need to consider the worse case, when
+ * one new block per extent.
*
- * This still overestimates under most circumstances. If we were to pass the
- * start and end offsets in here as well we could do block_to_path() on each
- * block and work out the exact number of indirects which are touched. Pah.
+ * For Direct IO and fallocate, the journal credits reservation
+ * is based on one single extent allocation, so they could use
+ * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
+ * chunk of allocation needs.
*/
-
int ext4_writepage_trans_blocks(struct inode *inode)
{
int bpp = ext4_journal_blocks_per_page(inode);
- int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3;
int ret;
- if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)
- return ext4_ext_writepage_trans_blocks(inode, bpp);
+ ret = ext4_meta_trans_blocks(inode, bpp, 0);
+ /* Account for data blocks for journalled mode */
if (ext4_should_journal_data(inode))
- ret = 3 * (bpp + indirects) + 2;
- else
- ret = 2 * (bpp + indirects) + 2;
-
-#ifdef CONFIG_QUOTA
- /* We know that structure was already allocated during DQUOT_INIT so
- * we will be updating only the data blocks + inodes */
- ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
-
+ ret += bpp;
return ret;
}
-
/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
Index: linux-2.6.27-rc3/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4.h 2008-08-14 07:34:44.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4.h 2008-08-15 14:17:10.000000000 -0700
@@ -1072,6 +1072,7 @@ extern void ext4_set_inode_flags(struct
extern void ext4_get_inode_flags(struct ext4_inode_info *);
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
+extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
@@ -1227,6 +1228,8 @@ extern const struct inode_operations ext
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks,
+ int chunk);
extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
Index: linux-2.6.27-rc3/fs/ext4/ext4_jbd2.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4_jbd2.h 2008-08-14 07:34:44.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4_jbd2.h 2008-08-15 12:37:25.000000000 -0700
@@ -51,6 +51,14 @@
EXT4_XATTR_TRANS_BLOCKS - 2 + \
2*EXT4_QUOTA_TRANS_BLOCKS(sb))
+/*
+ * Define the number of metadata blocks we need to account to modify data.
+ *
+ * This include super block, inode block, quota blocks and xattr blocks
+ */
+#define EXT4_META_TRANS_BLOCKS(sb) (EXT4_XATTR_TRANS_BLOCKS + \
+ 2*EXT4_QUOTA_TRANS_BLOCKS(sb))
+
/* Delete operations potentially hit one directory's namespace plus an
* entire inode, plus arbitrary amounts of bitmap/indirection data. Be
* generous. We can grow the delete transaction later if necessary. */
Ext4: journal credits reservation fixes for extent file writepage
From: Mingming Cao <[email protected]>
This patch modified the writepage/write_begin credit caculation for
extent files, to use the credits caculation helper function.
The current calculation of how many index/leaf blocks should be
accounted is too conservetive, it always consider the worse case, where
the tree level is 5, and in the case of multiple chunk allocation, it
always multiple the needed credits. This path uses the accurate depth of
the inode with some extras to caluate the index blocks, and also less
conservetive in the case of multiple allocation accounting.
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext4/ext4_extents.h | 3 +
fs/ext4/extents.c | 88 ++++++++++++++++---------------------------------
fs/ext4/migrate.c | 3 +
3 files changed, 34 insertions(+), 60 deletions(-)
Index: linux-2.6.27-rc3/fs/ext4/ext4_extents.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4_extents.h 2008-08-15 14:43:20.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4_extents.h 2008-08-15 14:44:15.000000000 -0700
@@ -216,7 +216,9 @@ extern int ext4_ext_calc_metadata_amount
extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
extern int ext4_extent_tree_init(handle_t *, struct inode *);
-extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
+ int num,
+ struct ext4_ext_path *path);
extern int ext4_ext_try_to_merge(struct inode *inode,
struct ext4_ext_path *path,
struct ext4_extent *);
Index: linux-2.6.27-rc3/fs/ext4/extents.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/extents.c 2008-08-15 14:43:20.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/extents.c 2008-08-15 14:51:07.000000000 -0700
@@ -1747,54 +1747,62 @@ static int ext4_ext_rm_idx(handle_t *han
}
/*
- * ext4_ext_calc_credits_for_insert:
- * This routine returns max. credits that the extent tree can consume.
- * It should be OK for low-performance paths like ->writepage()
- * To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under i_data_sem and
- * pass the actual path.
+ * ext4_ext_calc_credits_for_single_extent:
+ * This routine returns max. credits that needed to insert an extent
+ * to the extent tree.
+ * When pass the actual path, the caller should calculate credits
+ * under i_data_sem.
*/
-int ext4_ext_calc_credits_for_insert(struct inode *inode,
+int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
struct ext4_ext_path *path)
{
- int depth, needed;
-
if (path) {
+ int depth = ext_depth(inode);
+ int ret;
+
/* probably there is space in leaf? */
- depth = ext_depth(inode);
if (le16_to_cpu(path[depth].p_hdr->eh_entries)
- < le16_to_cpu(path[depth].p_hdr->eh_max))
- return 1;
- }
-
- /*
- * given 32-bit logical block (4294967296 blocks), max. tree
- * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
- * Let's also add one more level for imbalance.
- */
- depth = 5;
+ < le16_to_cpu(path[depth].p_hdr->eh_max)) {
- /* allocation of new data block(s) */
- needed = 2;
+ /*
+ * There are some space in the leaf tree, no
+ * need to account for leaf block credit
+ *
+ * bitmaps and block group descriptor blocks
+ * and other metadat blocks still need to be
+ * accounted.
+ ret = (num + EXT4_BLOCKS_PER_GROUP(inode->i_sb) -1)
+ /EXT4_BLOCKS_PER_GROUP(inode->i_sb);
+ /* 1 one bitmap, 1 block group descriptor */
+ ret = ret*2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
+ }
+ }
- /*
- * tree can be full, so it would need to grow in depth:
- * we need one credit to modify old root, credits for
- * new root will be added in split accounting
- */
- needed += 1;
+ return ext4_meta_trans_blocks(inode, num, 1);
+}
- /*
- * Index split can happen, we would need:
- * allocate intermediate indexes (bitmap + group)
- * + change two blocks at each level, but root (already included)
- */
- needed += (depth * 2) + (depth * 2);
+/*
+ * How many index/leaf blocks need to change/allocate to modify nrblocks?
+ *
+ * if nrblocks are fit in a single extent (chunk flag is 1), then
+ * in the worse case, each tree level index/leaf need to be changed
+ * if the tree split due to insert a new extent, then the old tree
+ * index/leaf need to be updated too
+ *
+ * If the nrblocks are discontiguous, they could cause
+ * the whole tree split more than once, but this is really rare.
+ */
+int ext4_ext_index_trans_blocks(struct inode *inode, int num, int chunk)
+{
+ int index;
+ int depth = ext_depth(inode);
- /* any allocation modifies superblock */
- needed += 1;
+ if (chunk)
+ index = depth * 2;
+ else
+ index = depth * 3;
- return needed;
+ return index;
}
static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
@@ -1921,9 +1929,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
correct_index = 1;
credits += (ext_depth(inode)) + 1;
}
-#ifdef CONFIG_QUOTA
credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
err = ext4_ext_journal_restart(handle, credits);
if (err)
@@ -2858,27 +2864,6 @@ out_stop:
ext4_journal_stop(handle);
}
-/*
- * ext4_ext_writepage_trans_blocks:
- * calculate max number of blocks we could modify
- * in order to allocate new block for an inode
- */
-int ext4_ext_writepage_trans_blocks(struct inode *inode, int num)
-{
- int needed;
-
- needed = ext4_ext_calc_credits_for_insert(inode, NULL);
-
- /* caller wants to allocate num blocks, but note it includes sb */
- needed = needed * num - (num - 1);
-
-#ifdef CONFIG_QUOTA
- needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
-#endif
-
- return needed;
-}
Ext4: journal credits reservation fixes for DIO, fallocate
From: Mingming Cao <[email protected]>
DIO and fallocate credit calculation is different than writepage, as
they do start a new journal right for each call to ext4_get_blocks_wrap().
This patch uses the helper function in DIO and fallocate case, passing
a flag indicating that the modified data are contigous thus could account
less indirect/index blocks.
This patch also fixed the journal credit reservation for direct I/O
(DIO). Previously the estimated credits for DIO only was calculated for
non-extent files, which was not enough if the file is extent-based.
Also fixed was fallocate double-counting credits for modifying the the
superblock.
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 14 +++++++-------
fs/ext4/inode.c | 45 ++++++++++++++++++++++++---------------------
3 files changed, 32 insertions(+), 28 deletions(-)
===================================================================
Index: linux-2.6.27-rc3/fs/ext4/extents.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/extents.c 2008-08-15 14:51:07.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/extents.c 2008-08-15 17:22:02.000000000 -0700
@@ -1758,7 +1758,7 @@ int ext4_ext_calc_credits_for_single_ext
{
if (path) {
int depth = ext_depth(inode);
- int ret;
+ int ret = 0;
/* probably there is space in leaf? */
if (le16_to_cpu(path[depth].p_hdr->eh_entries)
@@ -1771,14 +1771,15 @@ int ext4_ext_calc_credits_for_single_ext
* bitmaps and block group descriptor blocks
* and other metadat blocks still need to be
* accounted.
+ */
ret = (num + EXT4_BLOCKS_PER_GROUP(inode->i_sb) -1)
/EXT4_BLOCKS_PER_GROUP(inode->i_sb);
/* 1 one bitmap, 1 block group descriptor */
- ret = ret*2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
+ ret = ret* 2 + EXT4_META_TRANS_BLOCKS(inode->i_sb);
}
}
- return ext4_meta_trans_blocks(inode, num, 1);
+ return ext4_chunk_trans_blocks(inode, num);
}
/*
@@ -2811,7 +2812,7 @@ void ext4_ext_truncate(struct inode *ino
/*
* probably first extent we're gonna free will be last in block
*/
- err = ext4_writepage_trans_blocks(inode) + 3;
+ err = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, err);
if (IS_ERR(handle))
return;
@@ -2924,10 +2925,9 @@ long ext4_fallocate(struct inode *inode,
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- block;
/*
- * credits to insert 1 extent into extent tree + buffers to be able to
- * modify 1 super block, 1 block bitmap and 1 group descriptor.
+ * credits to insert 1 extent into extent tree
*/
- credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+ credits = ext4_chunk_trans_blocks(inode, max_blocks);
mutex_lock(&inode->i_mutex);
retry:
while (ret >= 0 && ret < max_blocks) {
Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c 2008-08-15 14:43:20.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c 2008-08-15 17:21:04.000000000 -0700
@@ -1044,18 +1044,6 @@ static void ext4_da_update_reserve_space
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
}
-/* Maximum number of blocks we map for direct IO at once. */
-#define DIO_MAX_BLOCKS 4096
-/*
- * Number of credits we need for writing DIO_MAX_BLOCKS:
- * We need sb + group descriptor + bitmap + inode -> 4
- * For B blocks with A block pointers per block we need:
- * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
- * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
- */
-#define DIO_CREDITS 25
-
-
/*
* The ext4_get_blocks_wrap() function try to look up the requested blocks,
* and returns if the blocks are already mapped.
@@ -1167,19 +1155,23 @@ int ext4_get_blocks_wrap(handle_t *handl
return retval;
}
+/* Maximum number of blocks we map for direct IO at once. */
+#define DIO_MAX_BLOCKS 4096
+
static int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
handle_t *handle = ext4_journal_current_handle();
int ret = 0, started = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+ int dio_credits;
if (create && !handle) {
/* Direct IO write... */
if (max_blocks > DIO_MAX_BLOCKS)
max_blocks = DIO_MAX_BLOCKS;
- handle = ext4_journal_start(inode, DIO_CREDITS +
- 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
+ dio_credits = ext4_chunk_trans_blocks(inode, max_blocks);
+ handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
@@ -2243,7 +2235,7 @@ static int ext4_da_writepage(struct page
* for DIO, writepages, and truncate
*/
#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS
+#define EXT4_MAX_WRITEBACK_CREDITS 25
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -4442,7 +4434,8 @@ int ext4_meta_trans_blocks(struct inode*
/*
* Calulate the total number of credits to reserve to fit
- * the modification of a single pages into a single transaction
+ * the modification of a single pages into a single transaction,
+ * which may include multile chunk of block allocations.
*
* This could be called via ext4_write_begin() or later
* ext4_da_writepages() in delalyed allocation case.
@@ -4450,11 +4443,6 @@ int ext4_meta_trans_blocks(struct inode*
* In both case it's possible that we could allocating multiple
* chunks of blocks. We need to consider the worse case, when
* one new block per extent.
- *
- * For Direct IO and fallocate, the journal credits reservation
- * is based on one single extent allocation, so they could use
- * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single
- * chunk of allocation needs.
*/
int ext4_writepage_trans_blocks(struct inode *inode)
{
@@ -4468,6 +4456,21 @@ int ext4_writepage_trans_blocks(struct i
ret += bpp;
return ret;
}
+
+/*
+ * Calculate the journal credits for a chunk of data modification.
+ *
+ * This is called from DIO, fallocate or whoever calling
+ * ext4_get_blocks_wrap() to map/allocate a chunk of contigous disk blocks.
+ *
+ * journal buffers for data blocks are not included here, as DIO
+ * and fallocate do no need to journal data buffers.
+ */
+int ext4_chunk_trans_blocks(struct inode *inode, int nrblocks)
+{
+ return ext4_meta_trans_blocks(inode, nrblocks, 1);
+}
+
/*
* The caller must have previously called ext4_reserve_inode_write().
* Give this, we know that the caller already has write access to iloc->bh.
Index: linux-2.6.27-rc3/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4.h 2008-08-15 14:43:20.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4.h 2008-08-15 14:51:20.000000000 -0700
@@ -1073,6 +1073,7 @@ extern void ext4_get_inode_flags(struct
extern void ext4_set_aops(struct inode *inode);
extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
+extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
Ext4: journal credit fix the delalloc writepages
From: Mingming Cao <[email protected]>
Previous delalloc writepages implementation start a new transaction outside
a loop call of get_block() to do the block allocation. Due to lack of
information of how many blocks to be allocated, the estimate of the journal
credits is very conservtive and caused many issues.
With the rewored delayed allocation, a new transaction is created for
each get_block(), thus we don't need to guess how many credits for the multiple
chunk of allocation. Start every transaction with credits for insert a
single exent is enough. But we still need to consider the journalled mode,
where it need to account for the number of data blocks. So we guess
max number of data blocks for each allocation. Due to the current VFS
implementation writepages() could only flush PAGEVEC lengh of pages at a
time, the max block allocation is limited and calculated based on
that, and the total number of reserved delalloc datablocks, whichever
is smaller.
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/inode.c | 42 +++++++++++++++++++++++++++---------------
1 file changed, 27 insertions(+), 15 deletions(-)
Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c 2008-08-15 14:51:22.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c 2008-08-15 17:18:09.000000000 -0700
@@ -1850,8 +1850,11 @@ static void mpage_add_bh_to_extent(struc
{
struct buffer_head *lbh = &mpd->lbh;
sector_t next;
+ int nrblocks = lbh->b_size >> mpd->inode->i_blkbits;
- next = lbh->b_blocknr + (lbh->b_size >> mpd->inode->i_blkbits);
+ /* check if thereserved journal credits might overflow */
+ if (nrblocks >EXT4_MAX_TRANS_DATA)
+ goto flush_it;
/*
* First block in the extent
@@ -1863,6 +1866,7 @@ static void mpage_add_bh_to_extent(struc
return;
}
+ next = lbh->b_blocknr + nrblocks;
/*
* Can we merge the block to our big extent?
*/
@@ -1871,6 +1875,7 @@ static void mpage_add_bh_to_extent(struc
return;
}
+flush_it:
/*
* We couldn't merge the block to our extent, so we
* need to flush current extent and start new one
@@ -2231,17 +2236,26 @@ static int ext4_da_writepage(struct page
}
/*
- * For now just follow the DIO way to estimate the max credits
- * needed to write out EXT4_MAX_WRITEBACK_PAGES.
- * todo: need to calculate the max credits need for
- * extent based files, currently the DIO credits is based on
- * indirect-blocks mapping way.
+ * This is called via ext4_da_writepages() to
+ * calulate the total number of credits to reserve to fit
+ * a single extent allocation into a single transaction,
+ * ext4_da_writpeages() will loop calling this before
+ * the block allocation.
*
- * Probably should have a generic way to calculate credits
- * for DIO, writepages, and truncate
+ * The page vector size limited the max number of pages could
+ * be writeout at a time. Based on this, the max blocks to pass to
+ * get_block is calculated
*/
-#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
-#define EXT4_MAX_WRITEBACK_CREDITS 25
+
+static int ext4_writepages_trans_blocks(struct inode *inode)
+{
+ int max_blocks = EXT4_MAX_TRANS_DATA;
+
+ if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
+ max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
+
+ return ext4_chunk_trans_blocks(inode, max_blocks);
+}
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -2283,7 +2297,7 @@ restart_loop:
* by delalloc
*/
BUG_ON(ext4_should_journal_data(inode));
- needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+ needed_blocks = ext4_writepages_trans_blocks(inode);
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
@@ -4462,11 +4476,9 @@ int ext4_meta_trans_blocks(struct inode*
* the modification of a single pages into a single transaction,
* which may include multile chunk of block allocations.
*
- * This could be called via ext4_write_begin() or later
- * ext4_da_writepages() in delalyed allocation case.
+ * This could be called via ext4_write_begin()
*
- * In both case it's possible that we could allocating multiple
- * chunks of blocks. We need to consider the worse case, when
+ * We need to consider the worse case, when
* one new block per extent.
*/
int ext4_writepage_trans_blocks(struct inode *inode)
Same as version 1, resend just for the whole credit V2 series
ext4: Rework the ext4_da_writepages
From: "Aneesh Kumar K.V" <[email protected]>
With the below changes we reserve credit needed to insert only one extent
resulting from a call to single get_block. That make sure we don't take
too much journal credits during writeout. We also don't limit the pages
to write. That means we loop through the dirty pages building largest
possible contiguous block request. Then we issue a single get_block request.
We may get less block that we requested. If so we would end up not mapping
some of the buffer_heads. That means those buffer_heads are still marked delay.
Later in the writepage callback via __mpage_writepage we redirty those pages.
We should also not limit/throttle wbc->nr_to_write in the filesystem writepages
callback. That cause wrong behaviour in generic_sync_sb_inodes caused by
wbc->nr_to_write being <= 0
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 201 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 113 insertions(+), 88 deletions(-)
Index: linux-2.6.27-rc1/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/inode.c 2008-08-11 15:12:11.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/inode.c 2008-08-11 15:38:09.000000000 -0700
@@ -41,6 +41,8 @@
#include "acl.h"
#include "ext4_extents.h"
+#define MPAGE_DA_EXTENT_TAIL 0x01
+
static inline int ext4_begin_ordered_truncate(struct inode *inode,
loff_t new_size)
{
@@ -1605,11 +1607,13 @@ struct mpage_da_data {
unsigned long first_page, next_page; /* extent of pages */
get_block_t *get_block;
struct writeback_control *wbc;
+ int io_done;
+ long pages_written;
};
/*
* mpage_da_submit_io - walks through extent of pages and try to write
- * them with __mpage_writepage()
+ * them with writepage() call back
*
* @mpd->inode: inode
* @mpd->first_page: first page of the extent
@@ -1624,18 +1628,11 @@ struct mpage_da_data {
static int mpage_da_submit_io(struct mpage_da_data *mpd)
{
struct address_space *mapping = mpd->inode->i_mapping;
- struct mpage_data mpd_pp = {
- .bio = NULL,
- .last_block_in_bio = 0,
- .get_block = mpd->get_block,
- .use_writepage = 1,
- };
int ret = 0, err, nr_pages, i;
unsigned long index, end;
struct pagevec pvec;
BUG_ON(mpd->next_page <= mpd->first_page);
-
pagevec_init(&pvec, 0);
index = mpd->first_page;
end = mpd->next_page - 1;
@@ -1653,8 +1650,9 @@ static int mpage_da_submit_io(struct mpa
break;
index++;
- err = __mpage_writepage(page, mpd->wbc, &mpd_pp);
-
+ err = mapping->a_ops->writepage(page, mpd->wbc);
+ if (!err)
+ mpd->pages_written++;
/*
* In error case, we have to continue because
* remaining pages are still locked
@@ -1665,9 +1663,6 @@ static int mpage_da_submit_io(struct mpa
}
pagevec_release(&pvec);
}
- if (mpd_pp.bio)
- mpage_bio_submit(WRITE, mpd_pp.bio);
-
return ret;
}
@@ -1690,7 +1685,7 @@ static void mpage_put_bnr_to_bhs(struct
int blocks = exbh->b_size >> inode->i_blkbits;
sector_t pblock = exbh->b_blocknr, cur_logical;
struct buffer_head *head, *bh;
- unsigned long index, end;
+ pgoff_t index, end;
struct pagevec pvec;
int nr_pages, i;
@@ -1775,13 +1770,11 @@ static inline void __unmap_underlying_bl
*
* The function skips space we know is already mapped to disk blocks.
*
- * The function ignores errors ->get_block() returns, thus real
- * error handling is postponed to __mpage_writepage()
*/
static void mpage_da_map_blocks(struct mpage_da_data *mpd)
{
+ int err = 0;
struct buffer_head *lbh = &mpd->lbh;
- int err = 0, remain = lbh->b_size;
sector_t next = lbh->b_blocknr;
struct buffer_head new;
@@ -1791,35 +1784,32 @@ static void mpage_da_map_blocks(struct m
if (buffer_mapped(lbh) && !buffer_delay(lbh))
return;
- while (remain) {
- new.b_state = lbh->b_state;
- new.b_blocknr = 0;
- new.b_size = remain;
- err = mpd->get_block(mpd->inode, next, &new, 1);
- if (err) {
- /*
- * Rather than implement own error handling
- * here, we just leave remaining blocks
- * unallocated and try again with ->writepage()
- */
- break;
- }
- BUG_ON(new.b_size == 0);
+ new.b_state = lbh->b_state;
+ new.b_blocknr = 0;
+ new.b_size = lbh->b_size;
+
+ /*
+ * If we didn't accumulate anything
+ * to write simply return
+ */
+ if (!new.b_size)
+ return;
+ err = mpd->get_block(mpd->inode, next, &new, 1);
+ if (err)
+ return;
+ BUG_ON(new.b_size == 0);
- if (buffer_new(&new))
- __unmap_underlying_blocks(mpd->inode, &new);
+ if (buffer_new(&new))
+ __unmap_underlying_blocks(mpd->inode, &new);
- /*
- * If blocks are delayed marked, we need to
- * put actual blocknr and drop delayed bit
- */
- if (buffer_delay(lbh) || buffer_unwritten(lbh))
- mpage_put_bnr_to_bhs(mpd, next, &new);
+ /*
+ * If blocks are delayed marked, we need to
+ * put actual blocknr and drop delayed bit
+ */
+ if (buffer_delay(lbh) || buffer_unwritten(lbh))
+ mpage_put_bnr_to_bhs(mpd, next, &new);
- /* go for the remaining blocks */
- next += new.b_size >> mpd->inode->i_blkbits;
- remain -= new.b_size;
- }
+ return;
}
#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
@@ -1865,13 +1855,9 @@ static void mpage_add_bh_to_extent(struc
* need to flush current extent and start new one
*/
mpage_da_map_blocks(mpd);
-
- /*
- * Now start a new extent
- */
- lbh->b_size = bh->b_size;
- lbh->b_state = bh->b_state & BH_FLAGS;
- lbh->b_blocknr = logical;
+ mpage_da_submit_io(mpd);
+ mpd->io_done = 1;
+ return;
}
/*
@@ -1891,17 +1877,35 @@ static int __mpage_da_writepage(struct p
struct buffer_head *bh, *head, fake;
sector_t logical;
+ if (mpd->io_done) {
+ /*
+ * Rest of the page in the page_vec
+ * redirty then and skip then. We will
+ * try to to write them again after
+ * starting a new transaction
+ */
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return MPAGE_DA_EXTENT_TAIL;
+ }
/*
* Can we merge this page to current extent?
*/
if (mpd->next_page != page->index) {
/*
* Nope, we can't. So, we map non-allocated blocks
- * and start IO on them using __mpage_writepage()
+ * and start IO on them using writepage()
*/
if (mpd->next_page != mpd->first_page) {
mpage_da_map_blocks(mpd);
mpage_da_submit_io(mpd);
+ /*
+ * skip rest of the page in the page_vec
+ */
+ mpd->io_done = 1;
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return MPAGE_DA_EXTENT_TAIL;
}
/*
@@ -1932,6 +1936,8 @@ static int __mpage_da_writepage(struct p
set_buffer_dirty(bh);
set_buffer_uptodate(bh);
mpage_add_bh_to_extent(mpd, logical, bh);
+ if (mpd->io_done)
+ return MPAGE_DA_EXTENT_TAIL;
} else {
/*
* Page with regular buffer heads, just add all dirty ones
@@ -1940,8 +1946,12 @@ static int __mpage_da_writepage(struct p
bh = head;
do {
BUG_ON(buffer_locked(bh));
- if (buffer_dirty(bh))
+ if (buffer_dirty(bh) &&
+ (!buffer_mapped(bh) || buffer_delay(bh))) {
mpage_add_bh_to_extent(mpd, logical, bh);
+ if (mpd->io_done)
+ return MPAGE_DA_EXTENT_TAIL;
+ }
logical++;
} while ((bh = bh->b_this_page) != head);
}
@@ -1960,22 +1970,13 @@ static int __mpage_da_writepage(struct p
*
* This is a library function, which implements the writepages()
* address_space_operation.
- *
- * In order to avoid duplication of logic that deals with partial pages,
- * multiple bio per page, etc, we find non-allocated blocks, allocate
- * them with minimal calls to ->get_block() and re-use __mpage_writepage()
- *
- * It's important that we call __mpage_writepage() only once for each
- * involved page, otherwise we'd have to implement more complicated logic
- * to deal with pages w/o PG_lock or w/ PG_writeback and so on.
- *
- * See comments to mpage_writepages()
*/
static int mpage_da_writepages(struct address_space *mapping,
struct writeback_control *wbc,
get_block_t get_block)
{
struct mpage_da_data mpd;
+ long to_write;
int ret;
if (!get_block)
@@ -1989,17 +1990,22 @@ static int mpage_da_writepages(struct ad
mpd.first_page = 0;
mpd.next_page = 0;
mpd.get_block = get_block;
+ mpd.io_done = 0;
+ mpd.pages_written = 0;
+
+ to_write = wbc->nr_to_write;
ret = write_cache_pages(mapping, wbc, __mpage_da_writepage, &mpd);
/*
* Handle last extent of pages
*/
- if (mpd.next_page != mpd.first_page) {
+ if (!mpd.io_done && mpd.next_page != mpd.first_page) {
mpage_da_map_blocks(&mpd);
mpage_da_submit_io(&mpd);
}
+ wbc->nr_to_write = to_write - mpd.pages_written;
return ret;
}
@@ -2217,7 +2223,7 @@ static int ext4_da_writepage(struct page
#define EXT4_MAX_WRITEBACK_CREDITS 25
static int ext4_da_writepages(struct address_space *mapping,
- struct writeback_control *wbc)
+ struct writeback_control *wbc)
{
struct inode *inode = mapping->host;
handle_t *handle = NULL;
@@ -2225,42 +2231,53 @@ static int ext4_da_writepages(struct add
int ret = 0;
long to_write;
loff_t range_start = 0;
+ long pages_skipped = 0;
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
* because that could violate lock ordering on umount
*/
- if (!mapping->nrpages)
+ if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
- /*
- * Estimate the worse case needed credits to write out
- * EXT4_MAX_BUF_BLOCKS pages
- */
- needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
-
- to_write = wbc->nr_to_write;
- if (!wbc->range_cyclic) {
+ if (!wbc->range_cyclic)
/*
* If range_cyclic is not set force range_cont
* and save the old writeback_index
*/
wbc->range_cont = 1;
- range_start = wbc->range_start;
- }
- while (!ret && to_write) {
+ range_start = wbc->range_start;
+ pages_skipped = wbc->pages_skipped;
+
+restart_loop:
+ to_write = wbc->nr_to_write;
+ while (!ret && to_write > 0) {
+
+ /*
+ * we insert one extent at a time. So we need
+ * credit needed for single extent allocation.
+ * journalled mode is currently not supported
+ * by delalloc
+ */
+ BUG_ON(ext4_should_journal_data(inode));
+ needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
+
/* start a new transaction*/
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
+ printk(KERN_EMERG "%s: jbd2_start: "
+ "%ld pages, ino %lu; err %d\n", __func__,
+ wbc->nr_to_write, inode->i_ino, ret);
+ dump_stack();
goto out_writepages;
}
if (ext4_should_order_data(inode)) {
/*
* With ordered mode we need to add
- * the inode to the journal handle
+ * the inode to the journal handl
* when we do block allocation.
*/
ret = ext4_jbd2_file_inode(handle, inode);
@@ -2268,20 +2285,20 @@ static int ext4_da_writepages(struct add
ext4_journal_stop(handle);
goto out_writepages;
}
-
}
- /*
- * set the max dirty pages could be write at a time
- * to fit into the reserved transaction credits
- */
- if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
- wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
to_write -= wbc->nr_to_write;
ret = mpage_da_writepages(mapping, wbc,
- ext4_da_get_block_write);
+ ext4_da_get_block_write);
ext4_journal_stop(handle);
- if (wbc->nr_to_write) {
+ if (ret == MPAGE_DA_EXTENT_TAIL) {
+ /*
+ * got one extent now try with
+ * rest of the pages
+ */
+ to_write += wbc->nr_to_write;
+ ret = 0;
+ } else if (wbc->nr_to_write) {
/*
* There is no more writeout needed
* or we requested for a noblocking writeout
@@ -2293,10 +2310,18 @@ static int ext4_da_writepages(struct add
wbc->nr_to_write = to_write;
}
+ if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
+ /* We skipped pages in this loop */
+ wbc->range_start = range_start;
+ wbc->nr_to_write = to_write +
+ wbc->pages_skipped - pages_skipped;
+ wbc->pages_skipped = pages_skipped;
+ goto restart_loop;
+ }
+
out_writepages:
wbc->nr_to_write = to_write;
- if (range_start)
- wbc->range_start = range_start;
+ wbc->range_start = range_start;
return ret;
}
Same as Version 1, Resend for the completeness of the credit fixes version 2
This patch should be merged with ext4-online-defrag-alloc-contiguous-blks.patch
It originally came from ext4_journal_credits_fix_for_writepages.patch,
but was split off when this patch was moved into the stable part of
the patch queue.
Cc: Akira Fujita <[email protected]>
Cc: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
Index: linux-2.6.27-rc1/fs/ext4/defrag.c
===================================================================
--- linux-2.6.27-rc1.orig/fs/ext4/defrag.c 2008-08-11 17:15:49.000000000 -0700
+++ linux-2.6.27-rc1/fs/ext4/defrag.c 2008-08-11 19:46:59.000000000 -0700
@@ -186,9 +186,9 @@ ext4_defrag_alloc_blocks(handle_t *handl
struct buffer_head *bh = NULL;
int err, i, credits = 0;
- credits = ext4_ext_calc_credits_for_insert(dest_inode, dest_path);
- err = ext4_ext_journal_restart(handle,
- credits + EXT4_TRANS_META_BLOCKS);
+ credits = ext4_ext_calc_credits_for_single_extent(dest_inode,
+ ar->len, dest_path);
+ err = ext4_ext_journal_restart(handle, credits);
if (err)
return err;
On Fri, Aug 15, 2008 at 05:40:58PM -0700, Mingming Cao wrote:
> Ext4: journal credit fix the delalloc writepages
>
> From: Mingming Cao <[email protected]>
>
> Previous delalloc writepages implementation start a new transaction outside
> a loop call of get_block() to do the block allocation. Due to lack of
> information of how many blocks to be allocated, the estimate of the journal
> credits is very conservtive and caused many issues.
>
> With the rewored delayed allocation, a new transaction is created for
> each get_block(), thus we don't need to guess how many credits for the multiple
> chunk of allocation. Start every transaction with credits for insert a
> single exent is enough. But we still need to consider the journalled mode,
> where it need to account for the number of data blocks. So we guess
> max number of data blocks for each allocation. Due to the current VFS
> implementation writepages() could only flush PAGEVEC lengh of pages at a
> time, the max block allocation is limited and calculated based on
> that, and the total number of reserved delalloc datablocks, whichever
> is smaller.
Need to update the comment.
>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/ext4/inode.c | 42 +++++++++++++++++++++++++++---------------
> 1 file changed, 27 insertions(+), 15 deletions(-)
>
> Index: linux-2.6.27-rc3/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.27-rc3.orig/fs/ext4/inode.c 2008-08-15 14:51:22.000000000 -0700
> +++ linux-2.6.27-rc3/fs/ext4/inode.c 2008-08-15 17:18:09.000000000 -0700
> @@ -1850,8 +1850,11 @@ static void mpage_add_bh_to_extent(struc
> {
> struct buffer_head *lbh = &mpd->lbh;
> sector_t next;
> + int nrblocks = lbh->b_size >> mpd->inode->i_blkbits;
>
> - next = lbh->b_blocknr + (lbh->b_size >> mpd->inode->i_blkbits);
> + /* check if thereserved journal credits might overflow */
> + if (nrblocks >EXT4_MAX_TRANS_DATA)
> + goto flush_it;
Since we don't support data=journal I am not sure whether we should
limit nrblocks. Also limiting to EXT4_MAX_TRANS_DATA = 64 blocks
may give highly fragmented files. May be we can do this only
for non extent files because only for no extents files we are
dependent on the number of blocks for calculating credits even
if we know that we are going to insert a contiguous chunk.
>
> /*
> * First block in the extent
> @@ -1863,6 +1866,7 @@ static void mpage_add_bh_to_extent(struc
> return;
> }
>
> + next = lbh->b_blocknr + nrblocks;
> /*
> * Can we merge the block to our big extent?
> */
> @@ -1871,6 +1875,7 @@ static void mpage_add_bh_to_extent(struc
> return;
> }
>
> +flush_it:
> /*
> * We couldn't merge the block to our extent, so we
> * need to flush current extent and start new one
> @@ -2231,17 +2236,26 @@ static int ext4_da_writepage(struct page
> }
>
> /*
> - * For now just follow the DIO way to estimate the max credits
> - * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> - * todo: need to calculate the max credits need for
> - * extent based files, currently the DIO credits is based on
> - * indirect-blocks mapping way.
> + * This is called via ext4_da_writepages() to
> + * calulate the total number of credits to reserve to fit
> + * a single extent allocation into a single transaction,
> + * ext4_da_writpeages() will loop calling this before
> + * the block allocation.
> *
> - * Probably should have a generic way to calculate credits
> - * for DIO, writepages, and truncate
> + * The page vector size limited the max number of pages could
> + * be writeout at a time. Based on this, the max blocks to pass to
> + * get_block is calculated
> */
> -#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS
> -#define EXT4_MAX_WRITEBACK_CREDITS 25
> +
> +static int ext4_writepages_trans_blocks(struct inode *inode)
> +{
> + int max_blocks = EXT4_MAX_TRANS_DATA;
> +
> + if (max_blocks > EXT4_I(inode)->i_reserved_data_blocks)
> + max_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> +
> + return ext4_chunk_trans_blocks(inode, max_blocks);
> +}
>
> static int ext4_da_writepages(struct address_space *mapping,
> struct writeback_control *wbc)
> @@ -2283,7 +2297,7 @@ restart_loop:
> * by delalloc
> */
> BUG_ON(ext4_should_journal_data(inode));
> - needed_blocks = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> + needed_blocks = ext4_writepages_trans_blocks(inode);
>
> /* start a new transaction*/
> handle = ext4_journal_start(inode, needed_blocks);
> @@ -4462,11 +4476,9 @@ int ext4_meta_trans_blocks(struct inode*
> * the modification of a single pages into a single transaction,
> * which may include multile chunk of block allocations.
> *
> - * This could be called via ext4_write_begin() or later
> - * ext4_da_writepages() in delalyed allocation case.
> + * This could be called via ext4_write_begin()
> *
> - * In both case it's possible that we could allocating multiple
> - * chunks of blocks. We need to consider the worse case, when
> + * We need to consider the worse case, when
> * one new block per extent.
> */
> int ext4_writepage_trans_blocks(struct inode *inode)
>
>
-aneesh
On Fri, Aug 15, 2008 at 05:38:33PM -0700, Mingming Cao wrote:
>
> Ext4: journal credits reservation fixes for extent file writepage
>
> From: Mingming Cao <[email protected]>
>
> This patch modified the writepage/write_begin credit caculation for
> extent files, to use the credits caculation helper function.
>
> The current calculation of how many index/leaf blocks should be
> accounted is too conservetive, it always consider the worse case, where
> the tree level is 5, and in the case of multiple chunk allocation, it
> always multiple the needed credits. This path uses the accurate depth of
> the inode with some extras to caluate the index blocks, and also less
> conservetive in the case of multiple allocation accounting.
>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> ---
> fs/ext4/ext4_extents.h | 3 +
> fs/ext4/extents.c | 88 ++++++++++++++++---------------------------------
> fs/ext4/migrate.c | 3 +
> 3 files changed, 34 insertions(+), 60 deletions(-)
>
> Index: linux-2.6.27-rc3/fs/ext4/ext4_extents.h
> ===================================================================
> --- linux-2.6.27-rc3.orig/fs/ext4/ext4_extents.h 2008-08-15 14:43:20.000000000 -0700
> +++ linux-2.6.27-rc3/fs/ext4/ext4_extents.h 2008-08-15 14:44:15.000000000 -0700
> @@ -216,7 +216,9 @@ extern int ext4_ext_calc_metadata_amount
> extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *);
> extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t);
> extern int ext4_extent_tree_init(handle_t *, struct inode *);
> -extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
> +extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
> + int num,
> + struct ext4_ext_path *path);
> extern int ext4_ext_try_to_merge(struct inode *inode,
> struct ext4_ext_path *path,
> struct ext4_extent *);
> Index: linux-2.6.27-rc3/fs/ext4/extents.c
> ===================================================================
> --- linux-2.6.27-rc3.orig/fs/ext4/extents.c 2008-08-15 14:43:20.000000000 -0700
> +++ linux-2.6.27-rc3/fs/ext4/extents.c 2008-08-15 14:51:07.000000000 -0700
> @@ -1747,54 +1747,62 @@ static int ext4_ext_rm_idx(handle_t *han
> }
>
> /*
> - * ext4_ext_calc_credits_for_insert:
> - * This routine returns max. credits that the extent tree can consume.
> - * It should be OK for low-performance paths like ->writepage()
> - * To allow many writing processes to fit into a single transaction,
> - * the caller should calculate credits under i_data_sem and
> - * pass the actual path.
> + * ext4_ext_calc_credits_for_single_extent:
> + * This routine returns max. credits that needed to insert an extent
> + * to the extent tree.
> + * When pass the actual path, the caller should calculate credits
> + * under i_data_sem.
> */
> -int ext4_ext_calc_credits_for_insert(struct inode *inode,
> +int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num,
> struct ext4_ext_path *path)
> {
s/num/nrblocks/
> - int depth, needed;
> -
> if (path) {
> + int depth = ext_depth(inode);
> + int ret;
> +
> /* probably there is space in leaf? */
> - depth = ext_depth(inode);
> if (le16_to_cpu(path[depth].p_hdr->eh_entries)
> - < le16_to_cpu(path[depth].p_hdr->eh_max))
> - return 1;
> - }
> -
> - /*
> - * given 32-bit logical block (4294967296 blocks), max. tree
> - * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
> - * Let's also add one more level for imbalance.
> - */
> - depth = 5;
> + < le16_to_cpu(path[depth].p_hdr->eh_max)) {
>
> - /* allocation of new data block(s) */
> - needed = 2;
-aneesh
I've integrated the V2 patches into the patch queue. I've also
renamed the patches in the patch so to make it easier to update the
right patches in the patch queue.
In the future, please send a new version of the patches in a new mail
thread, so it's easier separate patches and comments that apply to the
newer version from the older version of the patches.
- Ted