2009-10-20 21:00:01

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH, RFC] ext4: flush delalloc blocks when space is low

Creating many small files in rapid succession on a small
filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

for i in `seq 1 22500`; do
echo -n > $SCRATCH_MNT/$i
echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
done

leads to ENOSPC even though after a sync, 40% of the fs is free
again.

This is because we reserve worst-case metadata for delalloc writes,
and when data is allocated that worst-case reservation was not
needed.

I've added 2 flushers here:

* when free space is low compared to dirty blocks, do an async flush
* when we get a hard ENOSPC, do a sync flush before retry

This resolves the testcase for me.

Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1d04189..63519fc 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -605,6 +605,17 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
*/
int ext4_should_retry_alloc(struct super_block *sb, int *retries)
{
+ /* try a sync to flush delalloc space & free resvd metadata */
+ if (test_opt(sb, DELALLOC) &&
+ *retries == 0 &&
+ !ext4_has_free_blocks(EXT4_SB(sb), 1)) {
+ down_read(&sb->s_umount);
+ sync_inodes_sb(sb);
+ up_read(&sb->s_umount);
+ (*retries)++;
+ return 1;
+ }
+
if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
(*retries)++ > 3 ||
!EXT4_SB(sb)->s_journal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..27c8b9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
if (2 * free_blocks < 3 * dirty_blocks ||
free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
/*
- * free block count is less that 150% of dirty blocks
- * or free blocks is less that watermark
+ * free block count is less than 150% of dirty blocks
+ * or free blocks is less than watermark
*/
return 1;
}
+ /*
+ * Even if we don't switch but are nearing capacity,
+ * start pushing delalloc when 1/2 of free blocks are dirty.
+ */
+ if (free_blocks < 2 * dirty_blocks)
+ writeback_inodes_sb(sb);
+
return 0;
}





2009-10-21 02:37:48

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC] ext4: flush delalloc blocks when space is low

Eric Sandeen wrote:
> Creating many small files in rapid succession on a small
> filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
>
> for i in `seq 1 22500`; do
> echo -n > $SCRATCH_MNT/$i
> echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
> done
>
> leads to ENOSPC even though after a sync, 40% of the fs is free
> again.
>
> This is because we reserve worst-case metadata for delalloc writes,
> and when data is allocated that worst-case reservation was not
> needed.
>
> I've added 2 flushers here:
>
> * when free space is low compared to dirty blocks, do an async flush
> * when we get a hard ENOSPC, do a sync flush before retry
>
> This resolves the testcase for me.

argh, but fails xfstests 083 at least:

# Exercise filesystem full behaviour - run numerous fsstress
# processes in write mode on a small filesystem. NB: delayed
# allocate flushing is quite deadlock prone at the filesystem
# full boundary due to the fact that we will retry allocation
# several times after flushing, before giving back ENOSPC.

and indeed I deadlock ;) so don't merge this yet!

-Eric

> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 1d04189..63519fc 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -605,6 +605,17 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
> */
> int ext4_should_retry_alloc(struct super_block *sb, int *retries)
> {
> + /* try a sync to flush delalloc space & free resvd metadata */
> + if (test_opt(sb, DELALLOC) &&
> + *retries == 0 &&
> + !ext4_has_free_blocks(EXT4_SB(sb), 1)) {
> + down_read(&sb->s_umount);
> + sync_inodes_sb(sb);
> + up_read(&sb->s_umount);
> + (*retries)++;
> + return 1;
> + }
> +
> if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
> (*retries)++ > 3 ||
> !EXT4_SB(sb)->s_journal)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c5bc5d..27c8b9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block
> *sb)
> if (2 * free_blocks < 3 * dirty_blocks ||
> free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
> /*
> - * free block count is less that 150% of dirty blocks
> - * or free blocks is less that watermark
> + * free block count is less than 150% of dirty blocks
> + * or free blocks is less than watermark
> */
> return 1;
> }
> + /*
> + * Even if we don't switch but are nearing capacity,
> + * start pushing delalloc when 1/2 of free blocks are dirty.
> + */
> + if (free_blocks < 2 * dirty_blocks)
> + writeback_inodes_sb(sb);
> +
> return 0;
> }
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2009-10-21 19:51:01

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low

Creating many small files in rapid succession on a small
filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

for i in `seq 1 22500`; do
echo -n > $SCRATCH_MNT/$i
echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
done

leads to ENOSPC even though after a sync, 40% of the fs is free
again.

This is because we reserve worst-case metadata for delalloc writes,
and when data is allocated that worst-case reservation was not
needed.

I've added 2 flushers here:

* when free space is low compared to dirty blocks, do an async flush
* when we get a hard ENOSPC, do a sync flush before retry

This resolves the testcase for me, and survives all 4 generic
ENOSPC tests in xfstests.

V2: don't try to sync if we're still in a (probably nested) transaction.

Thanks to Josef for pointing out that possibility.

Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1d04189..28bde58 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -605,11 +605,27 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
*/
int ext4_should_retry_alloc(struct super_block *sb, int *retries)
{
- if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
+ s64 dirtyblocks = 0;
+ struct percpu_counter *dbc = &EXT4_SB(sb)->s_dirtyblocks_counter;
+
+ if (test_opt(sb, DELALLOC))
+ dirtyblocks = percpu_counter_read_positive(dbc);
+
+ if ((!ext4_has_free_blocks(EXT4_SB(sb), 1) && !dirtyblocks) ||
(*retries)++ > 3 ||
!EXT4_SB(sb)->s_journal)
return 0;

+ /* try a sync to flush delalloc space & free resvd metadata */
+ if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
+ if (!ext4_journal_current_handle()) {
+ down_read(&sb->s_umount);
+ sync_inodes_sb(sb);
+ up_read(&sb->s_umount);
+ return 1;
+ }
+ }
+
jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);

return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..27c8b9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
if (2 * free_blocks < 3 * dirty_blocks ||
free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
/*
- * free block count is less that 150% of dirty blocks
- * or free blocks is less that watermark
+ * free block count is less than 150% of dirty blocks
+ * or free blocks is less than watermark
*/
return 1;
}
+ /*
+ * Even if we don't switch but are nearing capacity,
+ * start pushing delalloc when 1/2 of free blocks are dirty.
+ */
+ if (free_blocks < 2 * dirty_blocks)
+ writeback_inodes_sb(sb);
+
return 0;
}




2009-11-05 14:09:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low

> Creating many small files in rapid succession on a small
> filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
>
> for i in `seq 1 22500`; do
> echo -n > $SCRATCH_MNT/$i
> echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
> done
>
> leads to ENOSPC even though after a sync, 40% of the fs is free
> again.
>
> This is because we reserve worst-case metadata for delalloc writes,
> and when data is allocated that worst-case reservation was not
> needed.
>
> I've added 2 flushers here:
>
> * when free space is low compared to dirty blocks, do an async flush
> * when we get a hard ENOSPC, do a sync flush before retry
>
> This resolves the testcase for me, and survives all 4 generic
> ENOSPC tests in xfstests.
>
> V2: don't try to sync if we're still in a (probably nested) transaction.
>
> Thanks to Josef for pointing out that possibility.
I still think it's deadlockable... See below.

> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 1d04189..28bde58 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -605,11 +605,27 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
> */
> int ext4_should_retry_alloc(struct super_block *sb, int *retries)
> {
> - if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
> + s64 dirtyblocks = 0;
> + struct percpu_counter *dbc = &EXT4_SB(sb)->s_dirtyblocks_counter;
> +
> + if (test_opt(sb, DELALLOC))
> + dirtyblocks = percpu_counter_read_positive(dbc);
> +
> + if ((!ext4_has_free_blocks(EXT4_SB(sb), 1) && !dirtyblocks) ||
> (*retries)++ > 3 ||
> !EXT4_SB(sb)->s_journal)
> return 0;
>
> + /* try a sync to flush delalloc space & free resvd metadata */
> + if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
> + if (!ext4_journal_current_handle()) {
> + down_read(&sb->s_umount);
> + sync_inodes_sb(sb);
> + up_read(&sb->s_umount);
ext4_should_retry_alloc() is called quite deep from the filesystem. In
particular we can hold i_mutex of some inodes etc. So I'd almost bet
that taking s_umount sem here violates lock ranking in some code paths
(an easy check would be to enable lockdep and stress the filesystem a
bit).
Also calling sync_inodes_sb() with i_mutex held just seems as a bad
thing to do although I don't see where it could deadlock and so it's
probably just a matter of taste...
If we start writeback from ext4_nonda_switch as you do below, I think
that we should get decent results even without synchronous writeback in
the allocation path (maybe we'd need to tweak a bit the logic in
ext4_nonda_switch to provide more time for writeback thread to catchup).

Honza

> + return 1;
> + }
> + }
> +
> jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
>
> return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c5bc5d..27c8b9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
> if (2 * free_blocks < 3 * dirty_blocks ||
> free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
> /*
> - * free block count is less that 150% of dirty blocks
> - * or free blocks is less that watermark
> + * free block count is less than 150% of dirty blocks
> + * or free blocks is less than watermark
> */
> return 1;
> }
> + /*
> + * Even if we don't switch but are nearing capacity,
> + * start pushing delalloc when 1/2 of free blocks are dirty.
> + */
> + if (free_blocks < 2 * dirty_blocks)
> + writeback_inodes_sb(sb);
> +
> return 0;
> }
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-11-05 15:45:43

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low

Jan Kara wrote:
...

>> + /* try a sync to flush delalloc space & free resvd metadata */
>> + if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
>> + if (!ext4_journal_current_handle()) {
>> + down_read(&sb->s_umount);
>> + sync_inodes_sb(sb);
>> + up_read(&sb->s_umount);
> ext4_should_retry_alloc() is called quite deep from the filesystem. In
> particular we can hold i_mutex of some inodes etc. So I'd almost bet
> that taking s_umount sem here violates lock ranking in some code paths
> (an easy check would be to enable lockdep and stress the filesystem a
> bit).
> Also calling sync_inodes_sb() with i_mutex held just seems as a bad
> thing to do although I don't see where it could deadlock and so it's
> probably just a matter of taste...

Well, to be honest I agree with you ;) It does still feel like a hack.

> If we start writeback from ext4_nonda_switch as you do below, I think
> that we should get decent results even without synchronous writeback in
> the allocation path (maybe we'd need to tweak a bit the logic in
> ext4_nonda_switch to provide more time for writeback thread to catchup).

I think starting writeback helps a lot, but it seems that in the end we
still need a synchronous attempt when we hit a real enocpc... after I
finish dealing with this corruption thing I'll come back and look at this.

Maybe we should put the writeback in for now, and worry about the
synchronous sync-up later?

Thanks for the review,

-Eric

> Honza

2009-11-05 16:05:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH, RFC V2] ext4: flush delalloc blocks when space is low

> Jan Kara wrote:
> ...
>
> >> + /* try a sync to flush delalloc space & free resvd metadata */
> >> + if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
> >> + if (!ext4_journal_current_handle()) {
> >> + down_read(&sb->s_umount);
> >> + sync_inodes_sb(sb);
> >> + up_read(&sb->s_umount);
> > ext4_should_retry_alloc() is called quite deep from the filesystem. In
> > particular we can hold i_mutex of some inodes etc. So I'd almost bet
> > that taking s_umount sem here violates lock ranking in some code paths
> > (an easy check would be to enable lockdep and stress the filesystem a
> > bit).
> > Also calling sync_inodes_sb() with i_mutex held just seems as a bad
> > thing to do although I don't see where it could deadlock and so it's
> > probably just a matter of taste...
>
> Well, to be honest I agree with you ;) It does still feel like a hack.
>
> > If we start writeback from ext4_nonda_switch as you do below, I think
> > that we should get decent results even without synchronous writeback in
> > the allocation path (maybe we'd need to tweak a bit the logic in
> > ext4_nonda_switch to provide more time for writeback thread to catchup).
>
> I think starting writeback helps a lot, but it seems that in the end we
> still need a synchronous attempt when we hit a real enocpc... after I
> finish dealing with this corruption thing I'll come back and look at this.
Without the synchronous attempt, it will never be perfect, that is
correct. But it could be quite close to perfect...

> Maybe we should put the writeback in for now, and worry about the
> synchronous sync-up later?
Yes, I'd do that for now.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs