When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
previously reserved space is not released as the error handling,
in which case @s_dirtyclusters_counter is left over. Since this delayed
extent failes to be inserted into extent status tree, when inode is
written back, the extra @s_dirtyclusters_counter won't be subtracted and
remains there forever.
This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
non-zero even when syncfs is executed on the filesystem.
Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree")
Cc: <[email protected]>
Reported-by: Gao Xiang <[email protected]>
Signed-off-by: Jeffle Xu <[email protected]>
---
fs/ext4/inode.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 82087657860b..7f15da370281 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1650,6 +1650,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
int ret;
bool allocated = false;
+ bool reserved = false;
/*
* If the cluster containing lblk is shared with a delayed,
@@ -1666,6 +1667,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
ret = ext4_da_reserve_space(inode);
if (ret != 0) /* ENOSPC */
goto errout;
+ reserved = true;
} else { /* bigalloc */
if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) {
if (!ext4_es_scan_clu(inode,
@@ -1678,6 +1680,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
ret = ext4_da_reserve_space(inode);
if (ret != 0) /* ENOSPC */
goto errout;
+ reserved = true;
} else {
allocated = true;
}
@@ -1688,6 +1691,8 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
}
ret = ext4_es_insert_delayed_block(inode, lblk, allocated);
+ if (ret && reserved)
+ ext4_da_release_space(inode, 1);
errout:
return ret;
--
2.27.0
* Jeffle Xu <[email protected]>:
> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
> previously reserved space is not released as the error handling,
> in which case @s_dirtyclusters_counter is left over. Since this delayed
> extent failes to be inserted into extent status tree, when inode is
> written back, the extra @s_dirtyclusters_counter won't be subtracted and
> remains there forever.
>
> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
> non-zero even when syncfs is executed on the filesystem.
>
Hi:
I think the fix below looks fine. However, this comment doesn't look right
to me. Are you really seeing delayed_allocation_blocks values that remain
incorrectly elevated across last closes (or across file system unmounts and
remounts)? s_dirtyclusters_counter isn't written out to stable storage -
it's an in-memory only variable that's created when a file is first opened
and destroyed on last close.
Eric
> Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree")
> Cc: <[email protected]>
> Reported-by: Gao Xiang <[email protected]>
> Signed-off-by: Jeffle Xu <[email protected]>
> ---
> fs/ext4/inode.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 82087657860b..7f15da370281 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1650,6 +1650,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> int ret;
> bool allocated = false;
> + bool reserved = false;
>
> /*
> * If the cluster containing lblk is shared with a delayed,
> @@ -1666,6 +1667,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> ret = ext4_da_reserve_space(inode);
> if (ret != 0) /* ENOSPC */
> goto errout;
> + reserved = true;
> } else { /* bigalloc */
> if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) {
> if (!ext4_es_scan_clu(inode,
> @@ -1678,6 +1680,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> ret = ext4_da_reserve_space(inode);
> if (ret != 0) /* ENOSPC */
> goto errout;
> + reserved = true;
> } else {
> allocated = true;
> }
> @@ -1688,6 +1691,8 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> }
>
> ret = ext4_es_insert_delayed_block(inode, lblk, allocated);
> + if (ret && reserved)
> + ext4_da_release_space(inode, 1);
>
> errout:
> return ret;
> --
> 2.27.0
>
On Fri, Aug 20, 2021 at 12:45:56PM -0400, Eric Whitney wrote:
> * Jeffle Xu <[email protected]>:
> > When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
> > previously reserved space is not released as the error handling,
> > in which case @s_dirtyclusters_counter is left over. Since this delayed
> > extent failes to be inserted into extent status tree, when inode is
> > written back, the extra @s_dirtyclusters_counter won't be subtracted and
> > remains there forever.
> >
> > This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
> > non-zero even when syncfs is executed on the filesystem.
> >
>
> Hi:
>
> I think the fix below looks fine. However, this comment doesn't look right
> to me. Are you really seeing delayed_allocation_blocks values that remain
> incorrectly elevated across last closes (or across file system unmounts and
> remounts)? s_dirtyclusters_counter isn't written out to stable storage -
> it's an in-memory only variable that's created when a file is first opened
> and destroyed on last close.
hmmm.... Let me explain a bit about this. It can be reproduced easily by fault
injection with the code modified below:
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9a3a8996aacf..29dc0da5960c 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -794,6 +794,9 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
}
}
+ if (!(ktime_get_ns() % 3)) {
+ return -ENOMEM;
+ }
es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
newes->es_pblk);
if (!es)
and then run a loop
while true; do dd if=/dev/zero of=aaa bs=8192 count=10000; sync; rm -rf aaa; done
After "Cannot allocate memory reported" is shown, s_dirtyclusters_counter
was already leaked. It can cause df and free space counting incorrect in
this mount.
If my understanging is correct, in priciple, we should also check with
"WARN_ON(ei->i_reserved_data_blocks)" in the inode evict path since it
should be considered as 0.
Thanks,
Gao Xiang
>
> Eric
>
> > Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree")
> > Cc: <[email protected]>
> > Reported-by: Gao Xiang <[email protected]>
> > Signed-off-by: Jeffle Xu <[email protected]>
> > ---
> > fs/ext4/inode.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 82087657860b..7f15da370281 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1650,6 +1650,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> > struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > int ret;
> > bool allocated = false;
> > + bool reserved = false;
> >
> > /*
> > * If the cluster containing lblk is shared with a delayed,
> > @@ -1666,6 +1667,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> > ret = ext4_da_reserve_space(inode);
> > if (ret != 0) /* ENOSPC */
> > goto errout;
> > + reserved = true;
> > } else { /* bigalloc */
> > if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) {
> > if (!ext4_es_scan_clu(inode,
> > @@ -1678,6 +1680,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> > ret = ext4_da_reserve_space(inode);
> > if (ret != 0) /* ENOSPC */
> > goto errout;
> > + reserved = true;
> > } else {
> > allocated = true;
> > }
> > @@ -1688,6 +1691,8 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
> > }
> >
> > ret = ext4_es_insert_delayed_block(inode, lblk, allocated);
> > + if (ret && reserved)
> > + ext4_da_release_space(inode, 1);
> >
> > errout:
> > return ret;
> > --
> > 2.27.0
> >
On 8/21/21 12:45 AM, Eric Whitney wrote:
> * Jeffle Xu <[email protected]>:
>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
>> previously reserved space is not released as the error handling,
>> in which case @s_dirtyclusters_counter is left over. Since this delayed
>> extent failes to be inserted into extent status tree, when inode is
>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
>> remains there forever.
>>
>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
>> non-zero even when syncfs is executed on the filesystem.
>>
>
> Hi:
>
> I think the fix below looks fine. However, this comment doesn't look right
> to me. Are you really seeing delayed_allocation_blocks values that remain
> incorrectly elevated across last closes (or across file system unmounts and
> remounts)? s_dirtyclusters_counter isn't written out to stable storage -
> it's an in-memory only variable that's created when a file is first opened
> and destroyed on last close.
>
Actually we've encountered a real case in our production environment,
which has about 20G space lost (df - du = ~20G).
After some investigation, we've confirmed that it cause by leaked
s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
Since there is no error messages, we've checked all logic around
s_dirtyclusters_counter and found this. Also we can manually inject
error and reproduce the leaked s_dirtyclusters_counter.
Thanks,
Joseph
On 8/22/21 9:06 PM, Joseph Qi wrote:
>
>
> On 8/21/21 12:45 AM, Eric Whitney wrote:
>> * Jeffle Xu <[email protected]>:
>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
>>> previously reserved space is not released as the error handling,
>>> in which case @s_dirtyclusters_counter is left over. Since this delayed
>>> extent failes to be inserted into extent status tree, when inode is
>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
>>> remains there forever.
>>>
>>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
>>> non-zero even when syncfs is executed on the filesystem.
>>>
>>
>> Hi:
>>
>> I think the fix below looks fine. However, this comment doesn't look right
>> to me. Are you really seeing delayed_allocation_blocks values that remain
>> incorrectly elevated across last closes (or across file system unmounts and
>> remounts)? s_dirtyclusters_counter isn't written out to stable storage -
>> it's an in-memory only variable that's created when a file is first opened
>> and destroyed on last close.
>>
>
> Actually we've encountered a real case in our production environment,
> which has about 20G space lost (df - du = ~20G).
> After some investigation, we've confirmed that it cause by leaked
> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
> Since there is no error messages, we've checked all logic around
> s_dirtyclusters_counter and found this. Also we can manually inject
> error and reproduce the leaked s_dirtyclusters_counter.
>
BTW, it's a runtime lost, but not about on-disk.
If umount and then mount it again, it becomes normal. But
application also should be restarted...
Thanks,
Joseph
* Joseph Qi <[email protected]>:
>
>
> On 8/22/21 9:06 PM, Joseph Qi wrote:
> >
> >
> > On 8/21/21 12:45 AM, Eric Whitney wrote:
> >> * Jeffle Xu <[email protected]>:
> >>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
> >>> previously reserved space is not released as the error handling,
> >>> in which case @s_dirtyclusters_counter is left over. Since this delayed
> >>> extent failes to be inserted into extent status tree, when inode is
> >>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
> >>> remains there forever.
> >>>
> >>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
> >>> non-zero even when syncfs is executed on the filesystem.
> >>>
> >>
> >> Hi:
> >>
> >> I think the fix below looks fine. However, this comment doesn't look right
> >> to me. Are you really seeing delayed_allocation_blocks values that remain
> >> incorrectly elevated across last closes (or across file system unmounts and
> >> remounts)? s_dirtyclusters_counter isn't written out to stable storage -
> >> it's an in-memory only variable that's created when a file is first opened
> >> and destroyed on last close.
> >>
> >
> > Actually we've encountered a real case in our production environment,
> > which has about 20G space lost (df - du = ~20G).
> > After some investigation, we've confirmed that it cause by leaked
> > s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
> > Since there is no error messages, we've checked all logic around
> > s_dirtyclusters_counter and found this. Also we can manually inject
> > error and reproduce the leaked s_dirtyclusters_counter.
> >
Sure - as I noted, the fix looks good - I agree that you could see inaccurate
s_dirtyclusters_counter (and i_reserved_data_blocks) values. This is a good
catch and a good fix. It's the comment I find misleading / inaccurate, and
I'd like to see that improved for the sake of developers reading commit
histories in the future.
Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict
path sounds good to me - I'd considered doing that in the past but never
actually did it.
>
> BTW, it's a runtime lost, but not about on-disk.
> If umount and then mount it again, it becomes normal. But
> application also should be restarted...
And this is where the comment could use a little help. "when inode is
written back, the extra @s_dirtyclusters_counter won't be subtracted and
remains there forever" suggests to me that s_dirtyclusters_counter is
being persisted on stable storage. But as you note, simply umounting and
remounting the filesystem clears up the problem. (And in my rush to get
my feedback out earlier I incorrectly stated that s_dirtyclusters_counter
would get created and destroyed on first open and last close - that's
i_reserved_data_blocks, of course.)
So, in order to speed things along, please allow me to suggest some edits
for the commit comment:
When ext4_insert_delayed block receives and recovers from an error from
ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
space it has reserved for that block insertion as it should. One effect
of this bug is that s_dirtyclusters_counter is not decremented and remains
incorrectly elevated until the file system has been unmounted. This can
result in premature ENOSPC returns and apparent loss of free space.
Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks
can remain non-zero even after syncfs has been executed on the filesystem.
Does that sound right?
Regards,
Eric
>
> Thanks,
> Joseph
On 8/23/21 5:52 AM, Eric Whitney wrote:
> * Joseph Qi <[email protected]>:
>>
>>
>> On 8/22/21 9:06 PM, Joseph Qi wrote:
>>>
>>>
>>> On 8/21/21 12:45 AM, Eric Whitney wrote:
>>>> * Jeffle Xu <[email protected]>:
>>>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
>>>>> previously reserved space is not released as the error handling,
>>>>> in which case @s_dirtyclusters_counter is left over. Since this delayed
>>>>> extent failes to be inserted into extent status tree, when inode is
>>>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
>>>>> remains there forever.
>>>>>
>>>>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
>>>>> non-zero even when syncfs is executed on the filesystem.
>>>>>
>>>>
>>>> Hi:
>>>>
>>>> I think the fix below looks fine. However, this comment doesn't look right
>>>> to me. Are you really seeing delayed_allocation_blocks values that remain
>>>> incorrectly elevated across last closes (or across file system unmounts and
>>>> remounts)? s_dirtyclusters_counter isn't written out to stable storage -
>>>> it's an in-memory only variable that's created when a file is first opened
>>>> and destroyed on last close.
>>>>
>>>
>>> Actually we've encountered a real case in our production environment,
>>> which has about 20G space lost (df - du = ~20G).
>>> After some investigation, we've confirmed that it cause by leaked
>>> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
>>> Since there is no error messages, we've checked all logic around
>>> s_dirtyclusters_counter and found this. Also we can manually inject
>>> error and reproduce the leaked s_dirtyclusters_counter.
>>>
>
> Sure - as I noted, the fix looks good - I agree that you could see inaccurate
> s_dirtyclusters_counter (and i_reserved_data_blocks) values. This is a good
> catch and a good fix. It's the comment I find misleading / inaccurate, and
> I'd like to see that improved for the sake of developers reading commit
> histories in the future.
>
> Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict
> path sounds good to me - I'd considered doing that in the past but never
> actually did it.
>
>>
>> BTW, it's a runtime lost, but not about on-disk.
>> If umount and then mount it again, it becomes normal. But
>> application also should be restarted...
>
> And this is where the comment could use a little help. "when inode is
> written back, the extra @s_dirtyclusters_counter won't be subtracted and
> remains there forever" suggests to me that s_dirtyclusters_counter is
> being persisted on stable storage. But as you note, simply umounting and
> remounting the filesystem clears up the problem. (And in my rush to get
> my feedback out earlier I incorrectly stated that s_dirtyclusters_counter
> would get created and destroyed on first open and last close - that's
> i_reserved_data_blocks, of course.)
>
> So, in order to speed things along, please allow me to suggest some edits
> for the commit comment:
>
> When ext4_insert_delayed block receives and recovers from an error from
> ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
> space it has reserved for that block insertion as it should. One effect
> of this bug is that s_dirtyclusters_counter is not decremented and remains
> incorrectly elevated until the file system has been unmounted. This can
> result in premature ENOSPC returns and apparent loss of free space.
>
> Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks
> can remain non-zero even after syncfs has been executed on the filesystem.
>
> Does that sound right?
>
Yes, looks better. Thanks for your comments.
We'll update the commit log in v2.
Thanks,
Joseph
On 8/23/21 5:52 AM, Eric Whitney wrote:
> * Joseph Qi <[email protected]>:
>>
>>
>> On 8/22/21 9:06 PM, Joseph Qi wrote:
>>>
>>>
>>> On 8/21/21 12:45 AM, Eric Whitney wrote:
>>>> * Jeffle Xu <[email protected]>:
>>>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
>>>>> previously reserved space is not released as the error handling,
>>>>> in which case @s_dirtyclusters_counter is left over. Since this delayed
>>>>> extent failes to be inserted into extent status tree, when inode is
>>>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
>>>>> remains there forever.
>>>>>
>>>>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
>>>>> non-zero even when syncfs is executed on the filesystem.
>>>>>
>>>>
>>>> Hi:
>>>>
>>>> I think the fix below looks fine. However, this comment doesn't look right
>>>> to me. Are you really seeing delayed_allocation_blocks values that remain
>>>> incorrectly elevated across last closes (or across file system unmounts and
>>>> remounts)? s_dirtyclusters_counter isn't written out to stable storage -
>>>> it's an in-memory only variable that's created when a file is first opened
>>>> and destroyed on last close.
>>>>
>>>
>>> Actually we've encountered a real case in our production environment,
>>> which has about 20G space lost (df - du = ~20G).
>>> After some investigation, we've confirmed that it cause by leaked
>>> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
>>> Since there is no error messages, we've checked all logic around
>>> s_dirtyclusters_counter and found this. Also we can manually inject
>>> error and reproduce the leaked s_dirtyclusters_counter.
>>>
>
> Sure - as I noted, the fix looks good - I agree that you could see inaccurate
> s_dirtyclusters_counter (and i_reserved_data_blocks) values. This is a good
> catch and a good fix. It's the comment I find misleading / inaccurate, and
> I'd like to see that improved for the sake of developers reading commit
> histories in the future.
>
> Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict
> path sounds good to me - I'd considered doing that in the past but never
> actually did it.
Thanks, it will be added in v2.
>
>>
>> BTW, it's a runtime lost, but not about on-disk.
>> If umount and then mount it again, it becomes normal. But
>> application also should be restarted...
>
> And this is where the comment could use a little help. "when inode is
> written back, the extra @s_dirtyclusters_counter won't be subtracted and
> remains there forever" suggests to me that s_dirtyclusters_counter is
> being persisted on stable storage. But as you note, simply umounting and
> remounting the filesystem clears up the problem. (And in my rush to get
> my feedback out earlier I incorrectly stated that s_dirtyclusters_counter
> would get created and destroyed on first open and last close - that's
> i_reserved_data_blocks, of course.)
>
> So, in order to speed things along, please allow me to suggest some edits
> for the commit comment:
>
> When ext4_insert_delayed block receives and recovers from an error from
> ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
> space it has reserved for that block insertion as it should. One effect
> of this bug is that s_dirtyclusters_counter is not decremented and remains
> incorrectly elevated until the file system has been unmounted. This can
> result in premature ENOSPC returns and apparent loss of free space.
>
> Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks
> can remain non-zero even after syncfs has been executed on the filesystem.
>
> Does that sound right?
Thanks, I will update the commit log in v2.
--
Thanks,
Jeffle