During orphan cleanup while doing truncate, if we fail to obtain journal
handle, the inode for which truncate was called would not be removed from
both the on-disk and in-memory orphan lists as the call to ext4_orphan_del
would not be executed.
This would have following consequences:
a) As the inode is not removed from the on-disk list, truncate would be
called again for the same inode. Each call would add the inode to the
in-memory list. This operation would continue endlessly or until truncate
is succeed.
b) If somehow, after some iterations, truncate is succeed, ext4_orphan_del
will only remove the inode from in-memory list just 1 time. This will trigger
j_assert during put super.
This patch handles both the problems. If truncate fails, first in-memory
list is cleared and than the partition is mounted as read only.
Failure to obtain journal handle during mount may suggest that journal
device is corrupted.
Signed-off-by: Ashish Sangwan <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
---
fs/ext4/super.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e2aacb..859eccb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2201,7 +2201,21 @@ static void ext4_orphan_cleanup(struct super_block *sb,
jbd_debug(2, "truncating inode %lu to %lld bytes\n",
inode->i_ino, inode->i_size);
ext4_truncate(inode);
- nr_truncates++;
+ if (list_empty(&EXT4_I(inode)->i_orphan)) {
+ nr_truncates++;
+ } else {
+ /* Remove inode from in-memory orphan list */
+ list_del_init(&EXT4_I(inode)->i_orphan);
+ ext4_msg(sb, KERN_ERR, "Truncate failed for "
+ "orphan inode = %lu. Running e2fsck"
+ " is recommended", inode->i_ino);
+ if (!(s_flags & MS_RDONLY)) {
+ ext4_msg(sb, KERN_INFO, "FS would be"
+ " mounted as readonly");
+ s_flags |= MS_RDONLY;
+ }
+ break;
+ }
} else {
ext4_msg(sb, KERN_DEBUG,
"%s: deleting unreferenced inode %lu",
--
1.7.11.4
Add Cc in mail loop.
Thanks.
2012/10/24, Ashish Sangwan <[email protected]>:
> During orphan cleanup while doing truncate, if we fail to obtain journal
> handle, the inode for which truncate was called would not be removed from
> both the on-disk and in-memory orphan lists as the call to ext4_orphan_del
> would not be executed.
>
> This would have following consequences:
> a) As the inode is not removed from the on-disk list, truncate would be
> called again for the same inode. Each call would add the inode to the
> in-memory list. This operation would continue endlessly or until truncate
> is succeed.
>
> b) If somehow, after some iterations, truncate is succeed, ext4_orphan_del
> will only remove the inode from in-memory list just 1 time. This will
> trigger
> j_assert during put super.
>
> This patch handles both the problems. If truncate fails, first in-memory
> list is cleared and than the partition is mounted as read only.
> Failure to obtain journal handle during mount may suggest that journal
> device is corrupted.
>
> Signed-off-by: Ashish Sangwan <[email protected]>
> Signed-off-by: Namjae Jeon <[email protected]>
> ---
> fs/ext4/super.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 4e2aacb..859eccb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2201,7 +2201,21 @@ static void ext4_orphan_cleanup(struct super_block
> *sb,
> jbd_debug(2, "truncating inode %lu to %lld bytes\n",
> inode->i_ino, inode->i_size);
> ext4_truncate(inode);
> - nr_truncates++;
> + if (list_empty(&EXT4_I(inode)->i_orphan)) {
> + nr_truncates++;
> + } else {
> + /* Remove inode from in-memory orphan list */
> + list_del_init(&EXT4_I(inode)->i_orphan);
> + ext4_msg(sb, KERN_ERR, "Truncate failed for "
> + "orphan inode = %lu. Running e2fsck"
> + " is recommended", inode->i_ino);
> + if (!(s_flags & MS_RDONLY)) {
> + ext4_msg(sb, KERN_INFO, "FS would be"
> + " mounted as readonly");
> + s_flags |= MS_RDONLY;
> + }
> + break;
> + }
> } else {
> ext4_msg(sb, KERN_DEBUG,
> "%s: deleting unreferenced inode %lu",
> --
> 1.7.11.4
>
>
Hi Ted,
Did you get any time to look into this patch?
This problem is with ext4 only as ext4_truncate does not clean the
orphan list unlike that of ext3_truncate.
Instead, in case of failure to obtain handle, orphan list cleanup is
done in ext4_setattr.
But during mount, ext4_truncate is not called via ext4_setattr and
hence the problem.
What do you think?
On Fri, Oct 26, 2012 at 5:11 PM, Namjae Jeon <[email protected]> wrote:
> Add Cc in mail loop.
>
> Thanks.
>
> 2012/10/24, Ashish Sangwan <[email protected]>:
>> During orphan cleanup while doing truncate, if we fail to obtain journal
>> handle, the inode for which truncate was called would not be removed from
>> both the on-disk and in-memory orphan lists as the call to ext4_orphan_del
>> would not be executed.
>>
>> This would have following consequences:
>> a) As the inode is not removed from the on-disk list, truncate would be
>> called again for the same inode. Each call would add the inode to the
>> in-memory list. This operation would continue endlessly or until truncate
>> is succeed.
>>
>> b) If somehow, after some iterations, truncate is succeed, ext4_orphan_del
>> will only remove the inode from in-memory list just 1 time. This will
>> trigger
>> j_assert during put super.
>>
>> This patch handles both the problems. If truncate fails, first in-memory
>> list is cleared and than the partition is mounted as read only.
>> Failure to obtain journal handle during mount may suggest that journal
>> device is corrupted.
>>
>> Signed-off-by: Ashish Sangwan <[email protected]>
>> Signed-off-by: Namjae Jeon <[email protected]>
>> ---
>> fs/ext4/super.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 4e2aacb..859eccb 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -2201,7 +2201,21 @@ static void ext4_orphan_cleanup(struct super_block
>> *sb,
>> jbd_debug(2, "truncating inode %lu to %lld bytes\n",
>> inode->i_ino, inode->i_size);
>> ext4_truncate(inode);
>> - nr_truncates++;
>> + if (list_empty(&EXT4_I(inode)->i_orphan)) {
>> + nr_truncates++;
>> + } else {
>> + /* Remove inode from in-memory orphan list */
>> + list_del_init(&EXT4_I(inode)->i_orphan);
>> + ext4_msg(sb, KERN_ERR, "Truncate failed for "
>> + "orphan inode = %lu. Running e2fsck"
>> + " is recommended", inode->i_ino);
>> + if (!(s_flags & MS_RDONLY)) {
>> + ext4_msg(sb, KERN_INFO, "FS would be"
>> + " mounted as readonly");
>> + s_flags |= MS_RDONLY;
>> + }
>> + break;
>> + }
>> } else {
>> ext4_msg(sb, KERN_DEBUG,
>> "%s: deleting unreferenced inode %lu",
>> --
>> 1.7.11.4
>>
>>
On Thu, Dec 06, 2012 at 04:59:43PM +0530, Ashish Sangwan wrote:
>
> Did you get any time to look into this patch?
> This problem is with ext4 only as ext4_truncate does not clean the
> orphan list unlike that of ext3_truncate.
> Instead, in case of failure to obtain handle, orphan list cleanup is
> done in ext4_setattr.
> But during mount, ext4_truncate is not called via ext4_setattr and
> hence the problem.
> What do you think?
In the patch description, you mentioned that this occurs when the
there is a failure to obtain a journal handle. Is this a hypothetical
thing that you exposed via some kind of tester which checks to see
what happens if kmalloc() randomly fails some number of allocation
requests? Or was it happening in real life? And if it is happening
in real life, do we understand why it's happening, and is there
something we should be doing to mitigate against the root cause of the
failure?
The alternative to your patch is to do something similar to what ext3
does. That is, if there are any inodes left on the orphan list, to
iterate through them all and then clean up the orphan list. Perhaps
we should then also call ext4_error() since technically the file
system may very well be inconsistent (there may be allocated inodes
holding blocks which are no longer connected the directory hierarchy,
which e2fsck would be able to clean up). But that could potentially
cause the system to panic or remount the file system read-only,
depending on what the errors= behavior is set to. Which is why I go
back to the original question; do we understand why ext4_truncate()
was failing during orphan cleanup in the first place?
- Ted
On Thu, Dec 6, 2012 at 10:39 PM, Theodore Ts'o <[email protected]> wrote:
> On Thu, Dec 06, 2012 at 04:59:43PM +0530, Ashish Sangwan wrote:
>>
>> Did you get any time to look into this patch?
>> This problem is with ext4 only as ext4_truncate does not clean the
>> orphan list unlike that of ext3_truncate.
>> Instead, in case of failure to obtain handle, orphan list cleanup is
>> done in ext4_setattr.
>> But during mount, ext4_truncate is not called via ext4_setattr and
>> hence the problem.
>> What do you think?
>
> In the patch description, you mentioned that this occurs when the
> there is a failure to obtain a journal handle. Is this a hypothetical
> thing that you exposed via some kind of tester which checks to see
> what happens if kmalloc() randomly fails some number of allocation
> requests? Or was it happening in real life? And if it is happening
> in real life, do we understand why it's happening, and is there
> something we should be doing to mitigate against the root cause of the
> failure?
>
Thanks for looking into the patch.
Yes, you are right. The situation is hypothetical.
We were checking about Ext4's behavior when it returns error at
different points in truncate/hole punch path and that's when we
stumbled on to this.
In real, we cannot re-produce this scenario without modifying Ext4's
code and hence an extra RFC added with the patch.
If you think that obtaining journal handle could never fail at mount
time, than please ignore this patch.
> The alternative to your patch is to do something similar to what ext3
> does. That is, if there are any inodes left on the orphan list, to
> iterate through them all and then clean up the orphan list. Perhaps
> we should then also call ext4_error() since technically the file
> system may very well be inconsistent (there may be allocated inodes
> holding blocks which are no longer connected the directory hierarchy,
> which e2fsck would be able to clean up). But that could potentially
> cause the system to panic or remount the file system read-only,
> depending on what the errors= behavior is set to. Which is why I go
> back to the original question; do we understand why ext4_truncate()
> was failing during orphan cleanup in the first place?
>
> - Ted
Right, using ext4_error() would have been a better choice than
explicitly setting RO flag.
I wonder why ext4_truncate does not clear the orphan_list in case of
failure to obtain journal handle unlike that of ext3_truncate?
We would not have to bother about explicitly clearing the orphan list
in the first place if we could just add a lable like out_notrans (as
in ext3)=>
if (IS_ERR(handle))
goto out_notrans;
out_notrans:
/*
* Delete the inode from orphan list so that it doesn't stay there
* forever and trigger assertion on umount.
*/
if (inode->i_nlink)
ext3_orphan_del(NULL, inode);
Regards,
Ashish