2009-12-31 15:50:32

by Theodore Ts'o

[permalink] [raw]
Subject: fsstress-induced corruption reproduced

One of the things which has been annoying me for a while now is a
hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
a test failure because the file system found to be inconsistent:

Inode NNN, i_blocks is X, should be Y.

I finally reproduced it; the problem happens when we fallocate() a
region of the file which we had recently written, and which is still in
the page cache marked as delayed allocation blocks. When we finally
write those blocks out, since they are marked BH_Delay,
ext4_get_blocks() calls ext4_da_update_reserve_space(), which ends up
bumping i_blocks a second time and charging the blocks against the
user's quota a second time. Oops.

Fortunately the fsck problem is one that will be fixed with a preen (and
if quota is enabled, a quotacheck), so it's not super serious, but we
should fix it when we have a chance. If anyone has time to look at it,
please let me know. Otherwise, I'll put it on my todo list. I don't
consider seriously urgent since the case is highly unlikely to occur in
real life, and it doesn't have any security implications; the worst an
attacker could do is end up charging excesss quota to herself.

I've included a simple reproduction case below; if you run this program,
it will create a file "test-file" in the current working directory which
will appear to be 32k, even though it is really only 16k long, and if
you then unmount the test file system and run e2fsck -p on it, you will get
the error message:

Inode XXX, i_blocks is 64, should be 32. FIXED.

- Ted

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <fcntl.h>
#include <fcntl.h>

#define BUFSIZE 1024

int main(int argc, char **argv)
{
int i, fd, ret;
char buf[BUFSIZE];

fd = open("test-file", O_RDWR|O_CREAT|O_TRUNC, 0644);
if (fd < 0) {
perror("open");
exit(1);
}
memset(&buf, 0, BUFSIZE);
for (i=0; i < 16; i++) {
ret = write(fd, &buf, BUFSIZE);
if (ret < 0) {
perror("write");
exit(1);
}
if (ret != BUFSIZE) {
fprintf(stderr, "Write return expected %d, got %d\n",
BUFSIZE, ret);
exit(1);
}
}
ret = fallocate(fd, 0, 0, 16384);
if (ret < 0) {
perror("fallocate");
exit(1);
}
ret = fsync(fd);
if (ret < 0) {
perror("fsync");
exit(1);
}
ret = close(fd);
if (ret < 0) {
perror("close");
exit(1);
}
exit(0);
}


2010-01-04 20:13:27

by Eric Sandeen

[permalink] [raw]
Subject: Re: fsstress-induced corruption reproduced

Theodore Ts'o wrote:
> One of the things which has been annoying me for a while now is a
> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
> a test failure because the file system found to be inconsistent:
>
> Inode NNN, i_blocks is X, should be Y.

Interesting, this apparently has gotten much worse since 2.6.32.

I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
off on 2.6.33-rc2.

Probably should find out why ;) I'll go take a look.

-Eric

> I finally reproduced it; the problem happens when we fallocate() a
> region of the file which we had recently written, and which is still in
> the page cache marked as delayed allocation blocks. When we finally
> write those blocks out, since they are marked BH_Delay,
> ext4_get_blocks() calls ext4_da_update_reserve_space(), which ends up
> bumping i_blocks a second time and charging the blocks against the
> user's quota a second time. Oops.
>
> Fortunately the fsck problem is one that will be fixed with a preen (and
> if quota is enabled, a quotacheck), so it's not super serious, but we
> should fix it when we have a chance. If anyone has time to look at it,
> please let me know. Otherwise, I'll put it on my todo list. I don't
> consider seriously urgent since the case is highly unlikely to occur in
> real life, and it doesn't have any security implications; the worst an
> attacker could do is end up charging excesss quota to herself.
>
> I've included a simple reproduction case below; if you run this program,
> it will create a file "test-file" in the current working directory which
> will appear to be 32k, even though it is really only 16k long, and if
> you then unmount the test file system and run e2fsck -p on it, you will get
> the error message:
>
> Inode XXX, i_blocks is 64, should be 32. FIXED.
>
> - Ted
>
> #define _GNU_SOURCE
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/types.h>
> #include <fcntl.h>
> #include <fcntl.h>
>
> #define BUFSIZE 1024
>
> int main(int argc, char **argv)
> {
> int i, fd, ret;
> char buf[BUFSIZE];
>
> fd = open("test-file", O_RDWR|O_CREAT|O_TRUNC, 0644);
> if (fd < 0) {
> perror("open");
> exit(1);
> }
> memset(&buf, 0, BUFSIZE);
> for (i=0; i < 16; i++) {
> ret = write(fd, &buf, BUFSIZE);
> if (ret < 0) {
> perror("write");
> exit(1);
> }
> if (ret != BUFSIZE) {
> fprintf(stderr, "Write return expected %d, got %d\n",
> BUFSIZE, ret);
> exit(1);
> }
> }
> ret = fallocate(fd, 0, 0, 16384);
> if (ret < 0) {
> perror("fallocate");
> exit(1);
> }
> ret = fsync(fd);
> if (ret < 0) {
> perror("fsync");
> exit(1);
> }
> ret = close(fd);
> if (ret < 0) {
> perror("close");
> exit(1);
> }
> exit(0);
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2010-01-04 23:08:57

by Eric Sandeen

[permalink] [raw]
Subject: Re: fsstress-induced corruption reproduced

Eric Sandeen wrote:
> Theodore Ts'o wrote:
>> One of the things which has been annoying me for a while now is a
>> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
>> a test failure because the file system found to be inconsistent:
>>
>> Inode NNN, i_blocks is X, should be Y.
>
> Interesting, this apparently has gotten much worse since 2.6.32.
>
> I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
> off on 2.6.33-rc2.
>
> Probably should find out why ;) I'll go take a look.

commit d21cd8f163ac44b15c465aab7306db931c606908
Author: Dmitry Monakhov <[email protected]>
Date: Thu Dec 10 03:31:45 2009 +0000

ext4: Fix potential quota deadlock

seems to be the culprit.

(unfortunately this means that the error we saw before is something
-else- to be fixed, yet) Anyway ...

This is because we used to do this in ext4_mb_mark_diskspace_used() :

/*
* Now reduce the dirty block count also. Should not go negative
*/
if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
/* release all the reserved blocks if non delalloc */
percpu_counter_sub(&sbi->s_dirtyblocks_counter,
reserv_blks);
else {
percpu_counter_sub(&sbi->s_dirtyblocks_counter,
ac->ac_b_ex.fe_len);
/* convert reserved quota blocks to real quota blocks */
vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
}

i.e. the vfs_dq_claim_block was conditional based on
EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
because we had already preallocated the blocks.

But with the above quota deadlock commit it's not unconditional
anymore in ext4_da_update_reserve_space and we always call
vfs_dq_claim_block which over-accounts.

Of course with the above commit, we have no allocation context in
ext4_da_update_reserve_space... that's all long gone so we can't key
on that anymore.

However, I think the following change will fix it; I'll run it through
xfstests later on and be sure nothing else regresses.

-Eric

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5352db1..28cd8d8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1257,9 +1257,10 @@ int ext4_get_blocks(handle_t *handle, struct
inode *inode, sector_t block,
* if the caller is from delayed allocation writeout path
* we have already reserved fs blocks for allocation
* let the underlying get_block() function know to
- * avoid double accounting
+ * avoid double accounting. Ditto for prealloc blocks.
*/
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
+ if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ||
+ flags & EXT4_GET_BLOCKS_UNINIT_EXT)
EXT4_I(inode)->i_delalloc_reserved_flag = 1;
/*
* We need to check for EXT4 here because migrate

-Eric

> -Eric
>
>> I finally reproduced it; the problem happens when we fallocate() a
>> region of the file which we had recently written, and which is still in
>> the page cache marked as delayed allocation blocks. When we finally
>> write those blocks out, since they are marked BH_Delay,
>> ext4_get_blocks() calls ext4_da_update_reserve_space(), which ends up
>> bumping i_blocks a second time and charging the blocks against the
>> user's quota a second time. Oops.
>>
>> Fortunately the fsck problem is one that will be fixed with a preen (and
>> if quota is enabled, a quotacheck), so it's not super serious, but we
>> should fix it when we have a chance. If anyone has time to look at it,
>> please let me know. Otherwise, I'll put it on my todo list. I don't
>> consider seriously urgent since the case is highly unlikely to occur in
>> real life, and it doesn't have any security implications; the worst an
>> attacker could do is end up charging excesss quota to herself.
>>
>> I've included a simple reproduction case below; if you run this program,
>> it will create a file "test-file" in the current working directory which
>> will appear to be 32k, even though it is really only 16k long, and if
>> you then unmount the test file system and run e2fsck -p on it, you will get
>> the error message:
>>
>> Inode XXX, i_blocks is 64, should be 32. FIXED.
>>
>> - Ted
>>
>> #define _GNU_SOURCE
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <sys/types.h>
>> #include <fcntl.h>
>> #include <fcntl.h>
>>
>> #define BUFSIZE 1024
>>
>> int main(int argc, char **argv)
>> {
>> int i, fd, ret;
>> char buf[BUFSIZE];
>>
>> fd = open("test-file", O_RDWR|O_CREAT|O_TRUNC, 0644);
>> if (fd < 0) {
>> perror("open");
>> exit(1);
>> }
>> memset(&buf, 0, BUFSIZE);
>> for (i=0; i < 16; i++) {
>> ret = write(fd, &buf, BUFSIZE);
>> if (ret < 0) {
>> perror("write");
>> exit(1);
>> }
>> if (ret != BUFSIZE) {
>> fprintf(stderr, "Write return expected %d, got %d\n",
>> BUFSIZE, ret);
>> exit(1);
>> }
>> }
>> ret = fallocate(fd, 0, 0, 16384);
>> if (ret < 0) {
>> perror("fallocate");
>> exit(1);
>> }
>> ret = fsync(fd);
>> if (ret < 0) {
>> perror("fsync");
>> exit(1);
>> }
>> ret = close(fd);
>> if (ret < 0) {
>> perror("close");
>> exit(1);
>> }
>> exit(0);
>> }
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2010-01-05 06:17:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsstress-induced corruption reproduced

On Mon, Jan 04, 2010 at 05:08:55PM -0600, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > Theodore Ts'o wrote:
> >> One of the things which has been annoying me for a while now is a
> >> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
> >> a test failure because the file system found to be inconsistent:
> >>
> >> Inode NNN, i_blocks is X, should be Y.
> >
> > Interesting, this apparently has gotten much worse since 2.6.32.
> >
> > I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
> > off on 2.6.33-rc2.
> >
> > Probably should find out why ;) I'll go take a look.
>
> commit d21cd8f163ac44b15c465aab7306db931c606908
> Author: Dmitry Monakhov <[email protected]>
> Date: Thu Dec 10 03:31:45 2009 +0000
>
> ext4: Fix potential quota deadlock
>
> seems to be the culprit.
>
> (unfortunately this means that the error we saw before is something
> -else- to be fixed, yet) Anyway ...
>
> This is because we used to do this in ext4_mb_mark_diskspace_used() :
>
> /*
> * Now reduce the dirty block count also. Should not go negative
> */
> if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
> /* release all the reserved blocks if non delalloc */
> percpu_counter_sub(&sbi->s_dirtyblocks_counter,
> reserv_blks);
> else {
> percpu_counter_sub(&sbi->s_dirtyblocks_counter,
> ac->ac_b_ex.fe_len);
> /* convert reserved quota blocks to real quota blocks */
> vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
> }
>
> i.e. the vfs_dq_claim_block was conditional based on
> EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
> because we had already preallocated the blocks.
>
> But with the above quota deadlock commit it's not unconditional
> anymore in ext4_da_update_reserve_space and we always call
> vfs_dq_claim_block which over-accounts.
>

It is still conditional right ? We call ext4_da_update_reserve_space
only if EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE is set . That will
happen only in case of delayed allocation. I guess the problem is
same as what Ted stated. But i am not sure why we are able to reproduce
it much easily on 2.6.33-rc2.


> Of course with the above commit, we have no allocation context in
> ext4_da_update_reserve_space... that's all long gone so we can't key
> on that anymore.
>
> However, I think the following change will fix it; I'll run it through
> xfstests later on and be sure nothing else regresses.
>
> -Eric
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5352db1..28cd8d8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1257,9 +1257,10 @@ int ext4_get_blocks(handle_t *handle, struct
> inode *inode, sector_t block,
> * if the caller is from delayed allocation writeout path
> * we have already reserved fs blocks for allocation
> * let the underlying get_block() function know to
> - * avoid double accounting
> + * avoid double accounting. Ditto for prealloc blocks.
> */
> - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ||
> + flags & EXT4_GET_BLOCKS_UNINIT_EXT)
> EXT4_I(inode)->i_delalloc_reserved_flag = 1;
> /*
> * We need to check for EXT4 here because migrate
>


But we need to do quota update during fallocate call. Doing the above
will result we not doing that. We would will also get block accounting
wrong because we now won't be doing ext4_claim_free_blocks for fallocate.

I guess what we need is to make sure that if we have any buffer_head mapping
the same block range allocated via fallocate and if they are marked BH_Delay
we need to clear the delay flag and update the block reservation. Later during
writepage we will find these buffer_heads mapped/non-delay and will do the right thing.

-aneesh

2010-01-05 14:40:58

by Eric Sandeen

[permalink] [raw]
Subject: Re: fsstress-induced corruption reproduced

Aneesh Kumar K.V wrote:
> On Mon, Jan 04, 2010 at 05:08:55PM -0600, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Theodore Ts'o wrote:
>>>> One of the things which has been annoying me for a while now is a
>>>> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
>>>> a test failure because the file system found to be inconsistent:
>>>>
>>>> Inode NNN, i_blocks is X, should be Y.
>>> Interesting, this apparently has gotten much worse since 2.6.32.
>>>
>>> I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
>>> off on 2.6.33-rc2.
>>>
>>> Probably should find out why ;) I'll go take a look.
>> commit d21cd8f163ac44b15c465aab7306db931c606908
>> Author: Dmitry Monakhov <[email protected]>
>> Date: Thu Dec 10 03:31:45 2009 +0000
>>
>> ext4: Fix potential quota deadlock
>>
>> seems to be the culprit.
>>
>> (unfortunately this means that the error we saw before is something
>> -else- to be fixed, yet) Anyway ...
>>
>> This is because we used to do this in ext4_mb_mark_diskspace_used() :
>>
>> /*
>> * Now reduce the dirty block count also. Should not go negative
>> */
>> if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
>> /* release all the reserved blocks if non delalloc */
>> percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>> reserv_blks);
>> else {
>> percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>> ac->ac_b_ex.fe_len);
>> /* convert reserved quota blocks to real quota blocks */
>> vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
>> }
>>
>> i.e. the vfs_dq_claim_block was conditional based on
>> EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
>> because we had already preallocated the blocks.
>>
>> But with the above quota deadlock commit it's not unconditional
>> anymore in ext4_da_update_reserve_space and we always call
>> vfs_dq_claim_block which over-accounts.
>>
>
> It is still conditional right ? We call ext4_da_update_reserve_space
> only if EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE is set . That will
> happen only in case of delayed allocation. I guess the problem is
> same as what Ted stated. But i am not sure why we are able to reproduce
> it much easily on 2.6.33-rc2.
>

Well, I'll take another look. But back out the above commit and I think
you'll see that it changed things to make it 100% reproducible.

-Eric


2010-01-05 23:38:01

by Eric Sandeen

[permalink] [raw]
Subject: Re: fsstress-induced corruption reproduced

Aneesh Kumar K.V wrote:
> On Mon, Jan 04, 2010 at 05:08:55PM -0600, Eric Sandeen wrote:
>
>> Eric Sandeen wrote:
>>
>>> Theodore Ts'o wrote:
>>>
>>>> One of the things which has been annoying me for a while now is a
>>>> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
>>>> a test failure because the file system found to be inconsistent:
>>>>
>>>> Inode NNN, i_blocks is X, should be Y.
>>>>
>>> Interesting, this apparently has gotten much worse since 2.6.32.
>>>
>>> I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
>>> off on 2.6.33-rc2.
>>>
>>> Probably should find out why ;) I'll go take a look.
>>>
>> commit d21cd8f163ac44b15c465aab7306db931c606908
>> Author: Dmitry Monakhov <[email protected]>
>> Date: Thu Dec 10 03:31:45 2009 +0000
>>
>> ext4: Fix potential quota deadlock
>>
>> seems to be the culprit.
>>
>> (unfortunately this means that the error we saw before is something
>> -else- to be fixed, yet) Anyway ...
>>
>> This is because we used to do this in ext4_mb_mark_diskspace_used() :
>>
>> /*
>> * Now reduce the dirty block count also. Should not go negative
>> */
>> if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
>> /* release all the reserved blocks if non delalloc */
>> percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>> reserv_blks);
>> else {
>> percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>> ac->ac_b_ex.fe_len);
>> /* convert reserved quota blocks to real quota blocks */
>> vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
>> }
>>
>> i.e. the vfs_dq_claim_block was conditional based on
>> EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
>> because we had already preallocated the blocks.
>>
>> But with the above quota deadlock commit it's not unconditional
>> anymore in ext4_da_update_reserve_space and we always call
>> vfs_dq_claim_block which over-accounts.
>>
>>
>
> It is still conditional right ? We call ext4_da_update_reserve_space
> only if EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE is set . That will
> happen only in case of delayed allocation. I guess the problem is
> same as what Ted stated. But i am not sure why we are able to reproduce
> it much easily on 2.6.33-rc2.
>
>
Maybe something like this works:

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1203,6 +1203,7 @@ int ext4_get_blocks(handle_t *handle, st
int flags)
{
int retval;
+ int was_unwritten;

clear_buffer_mapped(bh);
clear_buffer_unwritten(bh);
@@ -1253,9 +1254,13 @@ int ext4_get_blocks(handle_t *handle, st
* part of the uninitialized extent to be an initialized
* extent. This is because we need to avoid the combination
* of BH_Unwritten and BH_Mapped flags being simultaneously
- * set on the buffer_head.
+ * set on the buffer_head. However, if it was unwritten we
+ * don't want to update reserved space later.
*/
- clear_buffer_unwritten(bh);
+ if (buffer_unwritten(bh)) {
+ was_unwritten = 1;
+ clear_buffer_unwritten(bh);
+ }

/*
* New blocks allocate and/or writing to uninitialized extent
@@ -1301,7 +1306,8 @@ int ext4_get_blocks(handle_t *handle, st
* Update reserved blocks/metadata blocks after successful
* block allocation which had been deferred till now.
*/
- if ((retval > 0) && (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE))
+ if ((retval > 0) && !was_unwritten &&
+ (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE))
ext4_da_update_reserve_space(inode, retval);

up_write((&EXT4_I(inode)->i_data_sem));

but that might leave the previous reservations hanging around from
prior to the fallocate ...

-Eric



2010-01-06 08:50:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsstress-induced corruption reproduced

> Maybe something like this works:
>
> Index: linux-2.6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/inode.c
> +++ linux-2.6/fs/ext4/inode.c
> @@ -1203,6 +1203,7 @@ int ext4_get_blocks(handle_t *handle, st
> int flags)
> {
> int retval;
> + int was_unwritten;
>
> clear_buffer_mapped(bh);
> clear_buffer_unwritten(bh);
> @@ -1253,9 +1254,13 @@ int ext4_get_blocks(handle_t *handle, st
> * part of the uninitialized extent to be an initialized
> * extent. This is because we need to avoid the combination
> * of BH_Unwritten and BH_Mapped flags being simultaneously
> - * set on the buffer_head.
> + * set on the buffer_head. However, if it was unwritten we
> + * don't want to update reserved space later.
> */
> - clear_buffer_unwritten(bh);
> + if (buffer_unwritten(bh)) {
> + was_unwritten = 1;
> + clear_buffer_unwritten(bh);
> + }


That won't work because we can do the fallocate after we did the ext4_ext_get_block
with create = 0 and before we do down_write below. So we would still have was_unwritten = 0


>
> /*
> * New blocks allocate and/or writing to uninitialized extent
> @@ -1301,7 +1306,8 @@ int ext4_get_blocks(handle_t *handle, st
> * Update reserved blocks/metadata blocks after successful
> * block allocation which had been deferred till now.
> */
> - if ((retval > 0) && (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE))
> + if ((retval > 0) && !was_unwritten &&
> + (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE))
> ext4_da_update_reserve_space(inode, retval);
>
> up_write((&EXT4_I(inode)->i_data_sem));
>
> but that might leave the previous reservations hanging around from
> prior to the fallocate ...
>

What commit d21cd8f163ac44b15c465aab7306db931c606908 did was to move
quota claim to ext4_get_block function. Earlier we didn't do a
quota claim if we happened to write to fallocate area because
ext4_ext_handle_uninitialized_extents didn't call ext4_mb_mark_diskspace_used
So that should explain why we are seeing this problem with
d21cd8f163ac44b15c465aab7306db931c606908.

How about the patch below ?

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index af7b626..b98de17 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1443,6 +1443,8 @@ extern int ext4_block_truncate_page(handle_t *handle,
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern int flush_aio_dio_completed_IO(struct inode *inode);
+extern void ext4_da_update_reserve_space(struct inode *inode,
+ int used, int quota_claim);
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7d7b74e..3b6ff72 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3132,7 +3132,19 @@ out:
unmap_underlying_metadata_blocks(inode->i_sb->s_bdev,
newblock + max_blocks,
allocated - max_blocks);
+ allocated = max_blocks;
}
+
+ /*
+ * If we have done fallocate with the offset that is already
+ * delayed allocated, we would have block reservation
+ * and quota reservation done in the delayed write path.
+ * But fallocate would have already updated quota and block
+ * count for this offset. So cancel these reservation
+ */
+ if (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE)
+ ext4_da_update_reserve_space(inode, allocated, 0);
+
map_out:
set_buffer_mapped(bh_result);
out1:
@@ -3368,9 +3380,18 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
/* previous routine could use block we allocated */
newblock = ext_pblock(&newex);
allocated = ext4_ext_get_actual_len(&newex);
+ if (allocated > max_blocks)
+ allocated = max_blocks;
set_buffer_new(bh_result);

/*
+ * Update reserved blocks/metadata blocks after successful
+ * block allocation which had been deferred till now.
+ */
+ if (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE)
+ ext4_da_update_reserve_space(inode, allocated, 1);
+
+ /*
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an uninitialized extent.
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c818972..77ff941 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1053,7 +1053,7 @@ static int ext4_calc_metadata_amount(struct inode *inode, sector_t lblock)
* Called with i_data_sem down, which is important since we can call
* ext4_discard_preallocations() from here.
*/
-static void ext4_da_update_reserve_space(struct inode *inode, int used)
+void ext4_da_update_reserve_space(struct inode *inode, int used, int quota_claim)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -1090,9 +1090,17 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);

/* Update quota subsystem */
- vfs_dq_claim_block(inode, used);
- if (mdb_free)
- vfs_dq_release_reservation_block(inode, mdb_free);
+ if (quota_claim) {
+ vfs_dq_claim_block(inode, used);
+ if (mdb_free)
+ vfs_dq_release_reservation_block(inode, mdb_free);
+ } else {
+ /*
+ * This is a request to cancel the reservation. So just
+ * update the resevation and cancel the quota reservation
+ */
+ vfs_dq_release_reservation_block(inode, mdb_free + used);
+ }

/*
* If we have done all the pending block allocations and if
@@ -1292,18 +1300,20 @@ int ext4_get_blocks(handle_t *handle, struct inode *inode, sector_t block,
*/
EXT4_I(inode)->i_state &= ~EXT4_STATE_EXT_MIGRATE;
}
- }

+ /*
+ * Update reserved blocks/metadata blocks after successful
+ * block allocation which had been deferred till now. We don't
+ * support fallocate for non extent files. So we can update
+ * reserve space here.
+ */
+ if ((retval > 0) &&
+ (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE))
+ ext4_da_update_reserve_space(inode, retval, 1);
+ }
if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
EXT4_I(inode)->i_delalloc_reserved_flag = 0;

- /*
- * Update reserved blocks/metadata blocks after successful
- * block allocation which had been deferred till now.
- */
- if ((retval > 0) && (flags & EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE))
- ext4_da_update_reserve_space(inode, retval);