LinuxLists.cc - Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

2014-04-15 07:54:26

Subject: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

Initially in normal write path, when the disk was almost full – we got
hung for the ‘sync’ because the flusher (which is busy in the
writepages is not responding). Before the hung task, we also found the
logs like:

EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
clusters in bitmap, 58339 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
clusters in bitmap, 3 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
clusters in bitmap, 1 in gd
JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.
JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

EXT4-fs (sda1): error count: 58
EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742

When we analysed the problem, it occurred from the writepages path in ext4.
This is because of the difference in the free blocks reported by
cluster bitmap and the number of free blocks reported by group
descriptor.
During ext4_fill_super, ext4 calculates the number of free blocks by
reading all the descriptors in function ext4_count_free_clusters and
store it in percpu counter s_freeclusters_counter.
ext4_count_free_clusters:
desc_count = 0;
for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
desc_count += ext4_free_group_clusters(sb, gdp);
}
return desc_count;

During writebegin call, ext4 checks this s_freeclusters_counter
counter to know if there are free blocks present or not.
When the free blocks reported by group descriptor are greater than the
actual free blocks reported by bitmap, a call to writebegin could
still succeed even if the free blocks represented by bitmaps are 0.

This situation was seen from our logs above:
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
clusters in bitmap, 58339 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
clusters in bitmap, 3 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
clusters in bitmap, 1 in gd

In this situation, during writeback, block allocation fails as
actually there are no free blocks available and ext4 revolves in
unbreakable loop in function ext4_da_writepages.
Due to this looping, sync thread which is waiting for completion for
the flusher thread, which initiated the ext4 writeback, is never
completed and hung task occur for sync thread.

When searching for the relevant problem which occurs in this path. We
got the patch-set from ‘Darrick’ which revolves around this problem.
ext4: error out if verifying the block bitmap fails
ext4: fix type declaration of ext4_validate_block_bitmap
ext4: mark block group as corrupt on block bitmap error
ext4: mark block group as corrupt on inode bitmap error
ext4: mark group corrupt on group descriptor checksum
ext4: don't count free clusters from a corrupt block group

After adopting the patch-set and performing verification on the
similar setup, we ran ‘fsstress’. But now it is resulting in hang at
different points.

In the current logs we got:
EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

Hinting at the corruption in the filesystem for block bitmap

And then logs like:
EXT4-fs (sdb1): delayed block allocation failed for inode 241 at
logical offset 17 with max blocks 22 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7847
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=22
EXT4-fs (sdb1): i_reserved_meta_blocks=1
EXT4-fs (sdb1): delayed block allocation failed for inode 218 at
logical offset 82 with max blocks 17 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): delayed block allocation failed for inode 196 at
logical offset 527 with max blocks 7 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=106
EXT4-fs (sdb1): i_reserved_meta_blocks=5
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=17
EXT4-fs (sdb1): i_reserved_meta_blocks=1
EXT4-fs (sdb1): delayed block allocation failed for inode 251 at
logical offset 197 with max blocks 14 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=44
EXT4-fs (sdb1): i_reserved_meta_blocks=2
EXT4-fs (sdb1): delayed block allocation failed for inode 251 at
logical offset 364 with max blocks 30 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=44
EXT4-fs (sdb1): i_reserved_meta_blocks=2
EXT4-fs (sdb1): delayed block allocation failed for inode 243 at
logical offset 23 with max blocks 14 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=14
EXT4-fs (sdb1): i_reserved_meta_blocks=1
EXT4-fs (sdb1): delayed block allocation failed for inode 249 at
logical offset 62 with max blocks 15 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=48
EXT4-fs (sdb1): i_reserved_meta_blocks=3
EXT4-fs (sdb1): delayed block allocation failed for inode 249 at
logical offset 155 with max blocks 6 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=48
EXT4-fs (sdb1): i_reserved_meta_blocks=3
EXT4-fs (sdb1): delayed block allocation failed for inode 249 at
logical offset 225 with max blocks 27 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 12
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25455
EXT4-fs (sdb1): dirty_blocks=7874
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=48
EXT4-fs (sdb1): i_reserved_meta_blocks=3
EXT4-fs (sdb1): delayed block allocation failed for inode 221 at
logical offset 1 with max blocks 2 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25455
EXT4-fs (sdb1): dirty_blocks=7901
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=2
EXT4-fs (sdb1): i_reserved_meta_blocks=1
EXT4-fs (sdb1): delayed block allocation failed for inode 250 at
logical offset 140 with max blocks 13 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25455
EXT4-fs (sdb1): dirty_blocks=7901
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=13
EXT4-fs (sdb1): i_reserved_meta_blocks=1

Followed by the hungtask traces for:
Pid: 662, comm: jbd2/sdb1-8
jbd2/sdb1-8 [2] D [e5354840] c03e910c 0 662 0x00000000
2 663 635 (kernel thread)
Backtrace:
[<c03e8f00>] (__schedule+0x0/0x5f4) from [<c03e9590>] (schedule+0x40/0x80)
[<c03e9550>] (schedule+0x0/0x80) from [<c03e9644>] (io_schedule+0x74/0xa0)
[<c03e95d0>] (io_schedule+0x0/0xa0) from [<c01737c0>]
(sleep_on_buffer+0x18/0x20)
r6:e5921dd4 r5:002ecac6 r4:e5921dcc r3:00000002
[<c01737a8>] (sleep_on_buffer+0x0/0x20) from [<c03e7a54>]
(__wait_on_bit+0x8c/0xc4)
[<c03e79c8>] (__wait_on_bit+0x0/0xc4) from [<c03e7b04>]
(out_of_line_wait_on_bit+0x78/0x80)
[<c03e7a8c>] (out_of_line_wait_on_bit+0x0/0x80) from [<c01737a0>]
(__wait_on_buffer+0x30/0x38)
r8:e5343e68 r7:e5343c00 r6:00000000 r5:e5677680 r4:00000000
[<c0173770>] (__wait_on_buffer+0x0/0x38) from [<c01f449c>]
(jbd2_journal_commit_transaction+0xbe8/0x1830)
[<c01f38b4>] (jbd2_journal_commit_transaction+0x0/0x1830) from
[<c01f9768>] (kjournald2+0xb8/0x24c)
[<c01f96b0>] (kjournald2+0x0/0x24c) from [<c005c700>] (kthread+0xb4/0xc0)
[<c005c64c>] (kthread+0x0/0xc0) from [<c0013188>] (ret_from_fork+0x14/0x20)
r7:00000000 r6:00000000 r5:c005c64c r4:e4cf5c04

Pid: 635, comm: flush-8:16
flush-8:16 [1] D [e5357b40] c03e910c 0 635 0x00000000
2 662 628 (kernel thread)
Backtrace:
[<c03e8f00>] (__schedule+0x0/0x5f4) from [<c03e9590>] (schedule+0x40/0x80)
[<c03e9550>] (schedule+0x0/0x80) from [<c01f93e8>]
(jbd2_log_wait_commit+0xa8/0x120)
[<c01f9340>] (jbd2_log_wait_commit+0x0/0x120) from [<c01fb3c0>]
(jbd2_journal_force_commit_nested+0x70/0xb4)
[<c01fb350>] (jbd2_journal_force_commit_nested+0x0/0xb4) from
[<c01afc48>] (ext4_da_writepages+0x450/0x5ec)
r6:e588e000 r5:e5433660 r4:00000001 r3:ffffffe4
[<c01af7f8>] (ext4_da_writepages+0x0/0x5ec) from [<c0110178>]
(do_writepages+0x34/0x48)
[<c0110144>] (do_writepages+0x0/0x48) from [<c016b15c>]
(__writeback_single_inode+0x3c/0x17c)
[<c016b120>] (__writeback_single_inode+0x0/0x17c) from [<c016cc54>]
(writeback_sb_inodes+0x1fc/0x39c)
[<c016ca58>] (writeback_sb_inodes+0x0/0x39c) from [<c016d82c>]
(wb_writeback+0xf8/0x334)
[<c016d734>] (wb_writeback+0x0/0x334) from [<c016db18>]
(wb_do_writeback+0xb0/0x280)
[<c016da68>] (wb_do_writeback+0x0/0x280) from [<c016dd68>]
(bdi_writeback_thread+0x80/0x27c)
[<c016dce8>] (bdi_writeback_thread+0x0/0x27c) from [<c005c700>]
(kthread+0xb4/0xc0)
[<c005c64c>] (kthread+0x0/0xc0) from [<c0013188>] (ret_from_fork+0x14/0x20)
r7:00000000 r6:00000000 r5:c005c64c r4:e5923e9c

Kernel Version: 3.8
Test command:
fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

2014-04-15 12:47:43

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

On Tue, Apr 15, 2014 at 01:24:26PM +0530, Amit Sahrawat wrote:
> Initially in normal write path, when the disk was almost full – we got
> hung for the ‘sync’ because the flusher (which is busy in the
> writepages is not responding). Before the hung task, we also found the
> logs like:
>
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
> clusters in bitmap, 58339 in gd
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
> clusters in bitmap, 3 in gd
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
> clusters in bitmap, 1 in gd

These errors indicate that the several block groups contain have
corrupt block bitmap: block group #1493, #1000, and #1425. The fact
that there are 0 free blocks/clusters in the bitmap indicate that the
bitmap was all zero's, which could be the potential cause of the
corruption.

The other thing which is funny is the number of free blocks/clusters
being greater than 32768 in block group #1493. Assuming a 4k page
size, that shouldn't be possible. Can you send the output of
"dumpe2fs -h /dev/sdXX" so we can take a look at the file system parameters?

How much before the hung task did you see these messages? I normally
recommend that the file system be set up to either panic the system,
or force the file system to be remounted read/only when EXT4-fs error
messages are reported, since that means that the file system is
corrupted, and further operaion can cause more data to be lost.

> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.
> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.
>
> EXT4-fs (sda1): error count: 58
> EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
> EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742

The "607" and "58" in the "at 607" and "at 58" are normally supposed
to be a unix time_t value. That is, it's normally a number like:
1397564866, and it can be decoded via:

% date -d @1397564866
Tue Apr 15 08:27:46 EDT 2014

The fact that these numbers are numerically so small means that the
time wasn't set correctly on your system. Was this a test system
running under kvm without a proper real-time clock?

> When we analysed the problem, it occurred from the writepages path in ext4.
> This is because of the difference in the free blocks reported by
> cluster bitmap and the number of free blocks reported by group
> descriptor.

Yes, indicating that the file system was corrupt.

> During ext4_fill_super, ext4 calculates the number of free blocks by
> reading all the descriptors in function ext4_count_free_clusters and
> store it in percpu counter s_freeclusters_counter.
> ext4_count_free_clusters:
> desc_count = 0;
> for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> desc_count += ext4_free_group_clusters(sb, gdp);
> }
> return desc_count;
>
> During writebegin call, ext4 checks this s_freeclusters_counter
> counter to know if there are free blocks present or not.
> When the free blocks reported by group descriptor are greater than the
> actual free blocks reported by bitmap, a call to writebegin could
> still succeed even if the free blocks represented by bitmaps are 0.

Yes. We used to have code that would optionally read every single
bitmap, and verify that the descriptor counts match the values in the
bitmap. However, that was expensive, and wasn't a full check of all
possible file system inconsistencies that could lead to data loss. So
we ultimately removed this code. If the file system is potentially
corrupt, it is the system administrator's responsibility to force an
fsck run to make sure the file system data structures are consistent.

> When searching for the relevant problem which occurs in this path. We
> got the patch-set from ‘Darrick’ which revolves around this problem.
> ext4: error out if verifying the block bitmap fails
> ext4: fix type declaration of ext4_validate_block_bitmap
> ext4: mark block group as corrupt on block bitmap error
> ext4: mark block group as corrupt on inode bitmap error
> ext4: mark group corrupt on group descriptor checksum
> ext4: don't count free clusters from a corrupt block group
>
> After adopting the patch-set and performing verification on the
> similar setup, we ran ‘fsstress’. But now it is resulting in hang at
> different points.
>
> In the current logs we got:
> EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
> 20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
> JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.

OK, what version of the kernel are you using? The patches that you
reference above have been in the upstream kernel since 3.12, so I'm
assuming you're not using the latest upstream kernel, but rather an
older kernel with some patches applied. Hmmm, skipping ahead:

> Kernel Version: 3.8
> Test command:
> fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

There is clearly either some kernel bug or hardware problem which is
causing the file system corruption. Given that you are using a much
older kernel, it's quite likely that there is some bug that has been
fixed in a later version of the kernel (although we can't really rule
out a hardware problem without know much more about your setup).

Unfortunately, there has been a *large* number of changes since
version 3.8, and I can't remember all of the changes and bug fixes
that we might have made in the past year or more (v3.8 dates from
March 2013).

Something that might be helpful is for you to use xfstests. That's a
much more thorough set of tests which we've been using so if you must
use an antique version of the kernel, that will probably be a much
better set of tests. It includes fsstress, and much more besides.
More importantly, there are times when fixes are identified by the
xfstest failure that has gotten fixed up in the commit logs. So that
might help you find the bug fix that you need to backport.

For your convenience, there is a simple test framework that makes it
relatively easy to build and run xfstests under KVM. You can find it
here:

git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git

See the documentation found at:

https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/README

for more details.

I hope this helps,

- Ted

2014-04-16 05:00:10

by Amit Sahrawat

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

Thanks Ted, for the detailed reply.

We could not retain the original HDD – on which we got the issue. In
order to replicate the problem we did steps like this:
1) Make 250MB partition
2) Format the partition with blocksize ‘4K’
3) Using dumpe2fs – get the block group information
Group 0: (Blocks 0-32767)
Checksum 0x89ae, unused inodes 29109
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-15
Block bitmap at 16 (+16), Inode bitmap at 32 (+32)
Inode table at 48-957 (+48)
26826 free blocks, 29109 free inodes, 2 directories, 29109 unused inodes
Free blocks: 5942-32767
Free inodes: 12-29120
Group 1: (Blocks 32768-58226) [INODE_UNINIT]
Checksum 0xb43a, unused inodes 29120
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32783
Block bitmap at 17 (bg #0 + 17), Inode bitmap at 33 (bg #0 + 33)
Inode table at 958-1867 (bg #0 + 958)
25443 free blocks, 29120 free inodes, 0 directories, 29120 unused inodes
Free blocks: 32784-58226
Free inodes: 29121-58240
4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file
with all 1’s, so using ‘dd’ –
dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
After this mount the partition – create few random size files and then
ran ‘fsstress,

fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

and we get logs like before the hang:
#fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
seed = 582332
EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

EXT4-fs (sdb1): delayed block allocation failed for inode 26 at
logical offset 181 with max blocks 2 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=51
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=9
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 101 at
logical offset 198 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=43
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=1
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 94 at
logical offset 450 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=36
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=12
EXT4-fs (sdb1): i_reserved_meta_blocks=2

…
EXT4-fs (sdb1): error count: 3
EXT4-fs (sdb1): initial error at 545: ext4_mb_generate_buddy:743
EXT4-fs (sdb1): last error at 42: ext4_mb_generate_buddy:743
…
Yes, we are running this on ARM target and we did not set the time
before running these operations. So, the time actually corresponds to
date -d @545
Thu Jan 1 05:39:05 IST 1970
date -d @42
Thu Jan 1 05:30:42 IST 1970

Yes, we are running kernel version 3.8 and applied the patches from
‘Darrick’ to fix the first looping around in ext4_da_writepages.

As you suggested, we also made change to mark the FS READ ONLY in case
of corruption. The changes are;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c0fbd96..04f3a66 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1641,8 +1641,10 @@ static void mpage_da_map_and_submit(struct
mpage_da_data *mpd)
mpd->b_size >> mpd->inode->i_blkbits, err);
ext4_msg(sb, KERN_CRIT,
"This should not happen!! Data will be lost\n");
- if (err == -ENOSPC)
+ if (err == -ENOSPC) {
ext4_print_free_blocks(mpd->inode);
+ mpd->inode->i_sb->s_flags |= MS_RDONLY;
+ }
}
/* invalidate all the pages */
ext4_da_block_invalidatepages(mpd);
@@ -2303,6 +2305,36 @@ out:
return ret;
}

+static void ext4_invalidate_mapping(struct address_space *mapping)
+{
+ struct pagevec pvec;
+ unsigned int i;
+ pgoff_t index = 0;
+ struct inode* inode = mapping->host;
+
+ pagevec_init(&pvec, 0);
+ while (pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY,
+ PAGEVEC_SIZE)) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+ lock_page(page);
+ spin_lock_irq(&mapping->tree_lock);
+ if (test_bit(PG_dirty, &page->flags)) {
+ printk(KERN_ERR"INO[%lu], i(%d)\n",
inode->i_ino, i);
+ radix_tree_tag_clear(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_DIRTY);
+ spin_unlock_irq(&mapping->tree_lock);
+ clear_page_dirty_for_io(page);
+ goto unlock_page;
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+unlock_page:
+ unlock_page(page);
+ }
+ pagevec_release(&pvec);
+ }
+}
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -2419,6 +2451,14 @@ retry:
ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
"%ld pages, ino %lu; err %d", __func__,
wbc->nr_to_write, inode->i_ino, ret);
+
+ if(ret == -EROFS) {
+ ext4_msg(inode->i_sb, KERN_CRIT,"This
is probably result of corruption"
+ "that FS is marked RO
in between writepages\n");
+ /*Invalidate mapping if RO encountered
during writepages */
+ ext4_invalidate_mapping(mapping);
+ invalidate_inode_pages2(mapping);
+ }
blk_finish_plug(&plug);
goto out_writepages;
}

After making these changes, as soon as there is corruption – the FS is
marked RO and all pending DIRTY writes are cleared. So, the hang is
not seen.

But we have a doubt about the possible side-effects of this code. Can
you please provide your inputs on this? It will be of immense help.

Thanks & Regards,
Amit Sahrawat

On Tue, Apr 15, 2014 at 6:17 PM, Theodore Ts'o <[email protected]> wrote:
> On Tue, Apr 15, 2014 at 01:24:26PM +0530, Amit Sahrawat wrote:
>> Initially in normal write path, when the disk was almost full – we got
>> hung for the ‘sync’ because the flusher (which is busy in the
>> writepages is not responding). Before the hung task, we also found the
>> logs like:
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>
> These errors indicate that the several block groups contain have
> corrupt block bitmap: block group #1493, #1000, and #1425. The fact
> that there are 0 free blocks/clusters in the bitmap indicate that the
> bitmap was all zero's, which could be the potential cause of the
> corruption.
>
> The other thing which is funny is the number of free blocks/clusters
> being greater than 32768 in block group #1493. Assuming a 4k page
> size, that shouldn't be possible. Can you send the output of
> "dumpe2fs -h /dev/sdXX" so we can take a look at the file system parameters?
>
> How much before the hung task did you see these messages? I normally
> recommend that the file system be set up to either panic the system,
> or force the file system to be remounted read/only when EXT4-fs error
> messages are reported, since that means that the file system is
> corrupted, and further operaion can cause more data to be lost.
>
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> EXT4-fs (sda1): error count: 58
>> EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
>> EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742
>
> The "607" and "58" in the "at 607" and "at 58" are normally supposed
> to be a unix time_t value. That is, it's normally a number like:
> 1397564866, and it can be decoded via:
>
> % date -d @1397564866
> Tue Apr 15 08:27:46 EDT 2014
>
> The fact that these numbers are numerically so small means that the
> time wasn't set correctly on your system. Was this a test system
> running under kvm without a proper real-time clock?
>
>> When we analysed the problem, it occurred from the writepages path in ext4.
>> This is because of the difference in the free blocks reported by
>> cluster bitmap and the number of free blocks reported by group
>> descriptor.
>
> Yes, indicating that the file system was corrupt.
>
>> During ext4_fill_super, ext4 calculates the number of free blocks by
>> reading all the descriptors in function ext4_count_free_clusters and
>> store it in percpu counter s_freeclusters_counter.
>> ext4_count_free_clusters:
>> desc_count = 0;
>> for (i = 0; i < ngroups; i++) {
>> gdp = ext4_get_group_desc(sb, i, NULL);
>> if (!gdp)
>> continue;
>> desc_count += ext4_free_group_clusters(sb, gdp);
>> }
>> return desc_count;
>>
>> During writebegin call, ext4 checks this s_freeclusters_counter
>> counter to know if there are free blocks present or not.
>> When the free blocks reported by group descriptor are greater than the
>> actual free blocks reported by bitmap, a call to writebegin could
>> still succeed even if the free blocks represented by bitmaps are 0.
>
> Yes. We used to have code that would optionally read every single
> bitmap, and verify that the descriptor counts match the values in the
> bitmap. However, that was expensive, and wasn't a full check of all
> possible file system inconsistencies that could lead to data loss. So
> we ultimately removed this code. If the file system is potentially
> corrupt, it is the system administrator's responsibility to force an
> fsck run to make sure the file system data structures are consistent.
>
>> When searching for the relevant problem which occurs in this path. We
>> got the patch-set from ‘Darrick’ which revolves around this problem.
>> ext4: error out if verifying the block bitmap fails
>> ext4: fix type declaration of ext4_validate_block_bitmap
>> ext4: mark block group as corrupt on block bitmap error
>> ext4: mark block group as corrupt on inode bitmap error
>> ext4: mark group corrupt on group descriptor checksum
>> ext4: don't count free clusters from a corrupt block group
>>
>> After adopting the patch-set and performing verification on the
>> similar setup, we ran ‘fsstress’. But now it is resulting in hang at
>> different points.
>>
>> In the current logs we got:
>> EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
>> 20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
>> JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>
> OK, what version of the kernel are you using? The patches that you
> reference above have been in the upstream kernel since 3.12, so I'm
> assuming you're not using the latest upstream kernel, but rather an
> older kernel with some patches applied. Hmmm, skipping ahead:
>
>> Kernel Version: 3.8
>> Test command:
>> fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
>
> There is clearly either some kernel bug or hardware problem which is
> causing the file system corruption. Given that you are using a much
> older kernel, it's quite likely that there is some bug that has been
> fixed in a later version of the kernel (although we can't really rule
> out a hardware problem without know much more about your setup).
>
> Unfortunately, there has been a *large* number of changes since
> version 3.8, and I can't remember all of the changes and bug fixes
> that we might have made in the past year or more (v3.8 dates from
> March 2013).
>
> Something that might be helpful is for you to use xfstests. That's a
> much more thorough set of tests which we've been using so if you must
> use an antique version of the kernel, that will probably be a much
> better set of tests. It includes fsstress, and much more besides.
> More importantly, there are times when fixes are identified by the
> xfstest failure that has gotten fixed up in the commit logs. So that
> might help you find the bug fix that you need to backport.
>
> For your convenience, there is a simple test framework that makes it
> relatively easy to build and run xfstests under KVM. You can find it
> here:
>
> git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git
>
> See the documentation found at:
>
> https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/README
>
> for more details.
>
> I hope this helps,
>
> - Ted

2014-04-16 05:07:29

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
> 4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file
> with all 1’s, so using ‘dd’ –
> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
> After this mount the partition – create few random size files and then
> ran ‘fsstress,

Um, sigh. You didn't say that you were deliberately corrupting the
file system. That wasn't in the subject line, or anywhere else in the
original message.

So the question isn't how the file system got corrupted, but that
you'd prefer that the system recovers without hanging after this
corruption.

I wish you had *said* that. It would have saved me a lot of time,
since I was trying to figure out how the system had gotten so
corrupted (not realizing you had deliberately corrupted the file
system).

So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
started the fsstress, the file system would have remounted the
filesystem read-only at the first EXT4-fs error message. This would
avoid the hang that you saw, since the file system would hopefully
"failed fast", before th euser had the opportunity to put data into
the page cache that would be lost when the system discovered there was
no place to put the data.

Regards,

- Ted

2014-04-16 05:22:44

by Andreas Dilger

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

On Apr 15, 2014, at 11:07 PM, Theodore Ts'o <[email protected]> wrote:
> On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
>> 4) Corrupt the block group ?1? by writing all ?1?, we had one file
>> with all 1?s, so using ?dd? ?
>> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
>> After this mount the partition ? create few random size files and then
>> ran ?fsstress,
>
> Um, sigh. You didn't say that you were deliberately corrupting the
> file system. That wasn't in the subject line, or anywhere else in the
> original message.
>
> So the question isn't how the file system got corrupted, but that
> you'd prefer that the system recovers without hanging after this
> corruption.
>
> I wish you had *said* that. It would have saved me a lot of time,
> since I was trying to figure out how the system had gotten so
> corrupted (not realizing you had deliberately corrupted the file
> system).

Don't we check the bitmaps upon load to verify they are not bogus?
Looks like this is disabled completely if flex_bg is enabled, though
I guess that it is not impossible to do some checking even if flex_bg
is enabled.

For example, we have the background thread to do the inode table zeroing,
and it could load the block bitmaps and check the group descriptors for
itable and bitmap locations against the various bitmaps. With flex_bg,
this would probably only load a small subset of block bitmaps.

> So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> started the fsstress, the file system would have remounted the
> filesystem read-only at the first EXT4-fs error message. This would
> avoid the hang that you saw, since the file system would hopefully
> "failed fast", before th euser had the opportunity to put data into
> the page cache that would be lost when the system discovered there was
> no place to put the data.

Even without remount-ro, it would be possible to flag the block bitmaps
for a particular group as corrupt, so that no new allocations are done
from that group until the next e2fsck is run.

Cheers, Andreas

Attachments:

signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-04-16 07:51:34

by Amit Sahrawat

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

Sorry Ted, if it caused the confusion.

There were actually 2 parts to the problem, the logs in the first mail
were from the original situation – where in there were many block
groups and error prints also showed that.

EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
clusters in bitmap, 58339 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
clusters in bitmap, 3 in gd
EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
clusters in bitmap, 1 in gd
JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.
JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

1) Original case – when the disk got corrupted and we only had the
logs and the hung task messages. But not the HDD on which issue was
observed.
2) In order to reproduce the problem as was coming through the logs
(which highlighted the problem in the bitmap corruption). To minimize
the environment and make a proper case, we created a smaller partition
size and with only 2 groups. And intentionally corrupted the group 1
(our intention was just to replicate the error scenario).
3) After corruption we used ‘fsstress’ - we got the similar problem
as was coming the original logs. – We shared our analysis after this
point for looping in the writepages part the free blocks mismatch.
4) We came across ‘Darrick’ patches(in which it also mentioned about
how to corrupt to reproduce the problem) and applied on our
environment. It solved the initial problem about the looping in
writepages, but now we got hangs at other places.

Using ‘tune2fs’ is not a viable solution in our case, we can only
provide the solution via. the kernel changes. So, we made the changes
as shared earlier.

So the question isn't how the file system got corrupted, but that
you'd prefer that the system recovers without hanging after this
corruption.
>> Yes, our priority is to keep the system running.

Again, Sorry for the confusion. But the intention was just to show the
original problem and what we did in order to replicate the problem.

Thanks & Regards,
Amit Sahrawat

On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <[email protected]> wrote:
> On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
>> 4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file
>> with all 1’s, so using ‘dd’ –
>> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
>> After this mount the partition – create few random size files and then
>> ran ‘fsstress,
>
> Um, sigh. You didn't say that you were deliberately corrupting the
> file system. That wasn't in the subject line, or anywhere else in the
> original message.
>
> So the question isn't how the file system got corrupted, but that
> you'd prefer that the system recovers without hanging after this
> corruption.
>
> I wish you had *said* that. It would have saved me a lot of time,
> since I was trying to figure out how the system had gotten so
> corrupted (not realizing you had deliberately corrupted the file
> system).
>
> So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> started the fsstress, the file system would have remounted the
> filesystem read-only at the first EXT4-fs error message. This would
> avoid the hang that you saw, since the file system would hopefully
> "failed fast", before th euser had the opportunity to put data into
> the page cache that would be lost when the system discovered there was
> no place to put the data.
>
> Regards,
>
> - Ted

2014-04-16 17:46:52

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote:
> Sorry Ted, if it caused the confusion.
>
> There were actually 2 parts to the problem, the logs in the first mail
> were from the original situation – where in there were many block
> groups and error prints also showed that.
>
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
> clusters in bitmap, 58339 in gd
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
> clusters in bitmap, 3 in gd
> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
> clusters in bitmap, 1 in gd
> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.
> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> a risk of filesystem corruption in case of system crash.
>
> 1) Original case – when the disk got corrupted and we only had the
> logs and the hung task messages. But not the HDD on which issue was
> observed.
> 2) In order to reproduce the problem as was coming through the logs
> (which highlighted the problem in the bitmap corruption). To minimize
> the environment and make a proper case, we created a smaller partition
> size and with only 2 groups. And intentionally corrupted the group 1
> (our intention was just to replicate the error scenario).

I'm assuming that the original broken fs simply had a corrupt block bitmap, and
that the dd thing was just to simulate that corruption in a testing
environment?

> 3) After corruption we used ‘fsstress’ - we got the similar problem
> as was coming the original logs. – We shared our analysis after this
> point for looping in the writepages part the free blocks mismatch.

Hm. I tried it with 3.15-rc1 and didn't see any hangs. Corrupt bitmaps shut
down allocations from the block group and the FS continues, as expected.

> 4) We came across ‘Darrick’ patches(in which it also mentioned about
> how to corrupt to reproduce the problem) and applied on our
> environment. It solved the initial problem about the looping in
> writepages, but now we got hangs at other places.

There are hundreds of Darrick patches ... to which one are you referring? :)
(What was the subject line?)

> Using ‘tune2fs’ is not a viable solution in our case, we can only
> provide the solution via. the kernel changes. So, we made the changes
> as shared earlier.

Would it help if you could set errors=remount-ro in mke2fs?

--D
> So the question isn't how the file system got corrupted, but that
> you'd prefer that the system recovers without hanging after this
> corruption.
> >> Yes, our priority is to keep the system running.
>
> Again, Sorry for the confusion. But the intention was just to show the
> original problem and what we did in order to replicate the problem.
>
> Thanks & Regards,
> Amit Sahrawat
>
>
> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <[email protected]> wrote:
> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
> >> 4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file
> >> with all 1’s, so using ‘dd’ –
> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
> >> After this mount the partition – create few random size files and then
> >> ran ‘fsstress,
> >
> > Um, sigh. You didn't say that you were deliberately corrupting the
> > file system. That wasn't in the subject line, or anywhere else in the
> > original message.
> >
> > So the question isn't how the file system got corrupted, but that
> > you'd prefer that the system recovers without hanging after this
> > corruption.
> >
> > I wish you had *said* that. It would have saved me a lot of time,
> > since I was trying to figure out how the system had gotten so
> > corrupted (not realizing you had deliberately corrupted the file
> > system).
> >
> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> > started the fsstress, the file system would have remounted the
> > filesystem read-only at the first EXT4-fs error message. This would
> > avoid the hang that you saw, since the file system would hopefully
> > "failed fast", before th euser had the opportunity to put data into
> > the page cache that would be lost when the system discovered there was
> > no place to put the data.
> >
> > Regards,
> >
> > - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-04-22 05:49:34

by Amit Sahrawat

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

Hi Darrick,

Thanks for the reply, sorry for responding late.

On Wed, Apr 16, 2014 at 11:16 PM, Darrick J. Wong
<[email protected]> wrote:
> On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote:
>> Sorry Ted, if it caused the confusion.
>>
>> There were actually 2 parts to the problem, the logs in the first mail
>> were from the original situation – where in there were many block
>> groups and error prints also showed that.
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> 1) Original case – when the disk got corrupted and we only had the
>> logs and the hung task messages. But not the HDD on which issue was
>> observed.
>> 2) In order to reproduce the problem as was coming through the logs
>> (which highlighted the problem in the bitmap corruption). To minimize
>> the environment and make a proper case, we created a smaller partition
>> size and with only 2 groups. And intentionally corrupted the group 1
>> (our intention was just to replicate the error scenario).
>
> I'm assuming that the original broken fs simply had a corrupt block bitmap, and
> that the dd thing was just to simulate that corruption in a testing
> environment?

Yes, we did so in order to replicate the error scenario.
>
>> 3) After corruption we used ‘fsstress’ - we got the similar problem
>> as was coming the original logs. – We shared our analysis after this
>> point for looping in the writepages part the free blocks mismatch.
>
> Hm. I tried it with 3.15-rc1 and didn't see any hangs. Corrupt bitmaps shut
> down allocations from the block group and the FS continues, as expected.
>
We are using kernel version 3.8, so cannot switch to 3.15-rc1. It is a
limitation currently.

>> 4) We came across ‘Darrick’ patches(in which it also mentioned about
>> how to corrupt to reproduce the problem) and applied on our
>> environment. It solved the initial problem about the looping in
>> writepages, but now we got hangs at other places.
>
> There are hundreds of Darrick patches ... to which one are you referring? :)
> (What was the subject line?)
>
ext4: error out if verifying the block bitmap fails
ext4: fix type declaration of ext4_validate_block_bitmap
ext4: mark block group as corrupt on block bitmap error
ext4: mark block group as corrupt on inode bitmap error
ext4: mark group corrupt on group descriptor checksum
ext4: don't count free clusters from a corrupt block group

So, the patches helps in marking the block group as corrupt and avoids
further allocation. But when we consider the normal write path using
write_begin. Since, there is mismatch between the free cluster count
from the group descriptor and the bitmap. In that case it marks the
pages dirty by copying dirty but later it get ENOSPC from the
writepages when it actually does the allocation.

So, our doubt is if we are marking the block group as corrupt, we
should also subtract the block group count from the
s_freeclusters_counter. This will make sure we have the valid
freecluster count and error ‘ENOSPC’ can be returned from the
write_begin, instead of propagating such paths till the writepages.

We made change like this:

@@ -737,14 +737,18 @@ void ext4_mb_generate_buddy(struct super_block *sb,
grp->bb_fragments = fragments;

if (free != grp->bb_free) {
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_grp_locked_error(sb, group, 0, 0,
"%u clusters in bitmap, %u in gd; "
"block bitmap corrupt.",
free, grp->bb_free);
/*
* If we intend to continue, we consider group descriptor
* corrupt and update bb_free using bitmap value
*/
+ percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free);
grp->bb_free = free;
set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
}
mb_set_largest_free_order(sb, grp);

Is this the correct method? Or are missing something in this? Please
share your opinion.

>> Using ‘tune2fs’ is not a viable solution in our case, we can only
>> provide the solution via. the kernel changes. So, we made the changes
>> as shared earlier.
>
> Would it help if you could set errors=remount-ro in mke2fs?
>
Sorry, we cannot reformat or use tune2fs to change the ‘errors’ value.

> --D
>> So the question isn't how the file system got corrupted, but that
>> you'd prefer that the system recovers without hanging after this
>> corruption.
>> >> Yes, our priority is to keep the system running.
>>
>> Again, Sorry for the confusion. But the intention was just to show the
>> original problem and what we did in order to replicate the problem.
>>
>> Thanks & Regards,
>> Amit Sahrawat
>>
>>
>> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <[email protected]> wrote:
>> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
>> >> 4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file
>> >> with all 1’s, so using ‘dd’ –
>> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
>> >> After this mount the partition – create few random size files and then
>> >> ran ‘fsstress,
>> >
>> > Um, sigh. You didn't say that you were deliberately corrupting the
>> > file system. That wasn't in the subject line, or anywhere else in the
>> > original message.
>> >
>> > So the question isn't how the file system got corrupted, but that
>> > you'd prefer that the system recovers without hanging after this
>> > corruption.
>> >
>> > I wish you had *said* that. It would have saved me a lot of time,
>> > since I was trying to figure out how the system had gotten so
>> > corrupted (not realizing you had deliberately corrupted the file
>> > system).
>> >
>> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
>> > started the fsstress, the file system would have remounted the
>> > filesystem read-only at the first EXT4-fs error message. This would
>> > avoid the hang that you saw, since the file system would hopefully
>> > "failed fast", before th euser had the opportunity to put data into
>> > the page cache that would be lost when the system discovered there was
>> > no place to put the data.
>> >
>> > Regards,
>> >
>> > - Ted
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-04-22 19:25:20

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

On Tue, Apr 22, 2014 at 11:19:32AM +0530, Amit Sahrawat wrote:
> Hi Darrick,
>
> Thanks for the reply, sorry for responding late.
>
> On Wed, Apr 16, 2014 at 11:16 PM, Darrick J. Wong
> <[email protected]> wrote:
> > On Wed, Apr 16, 2014 at 01:21:34PM +0530, Amit Sahrawat wrote:
> >> Sorry Ted, if it caused the confusion.
> >>
> >> There were actually 2 parts to the problem, the logs in the first mail
> >> were from the original situation – where in there were many block
> >> groups and error prints also showed that.
> >>
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
> >> clusters in bitmap, 58339 in gd
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
> >> clusters in bitmap, 3 in gd
> >> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
> >> clusters in bitmap, 1 in gd
> >> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> >> a risk of filesystem corruption in case of system crash.
> >> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
> >> a risk of filesystem corruption in case of system crash.
> >>
> >> 1) Original case – when the disk got corrupted and we only had the
> >> logs and the hung task messages. But not the HDD on which issue was
> >> observed.
> >> 2) In order to reproduce the problem as was coming through the logs
> >> (which highlighted the problem in the bitmap corruption). To minimize
> >> the environment and make a proper case, we created a smaller partition
> >> size and with only 2 groups. And intentionally corrupted the group 1
> >> (our intention was just to replicate the error scenario).
> >
> > I'm assuming that the original broken fs simply had a corrupt block bitmap, and
> > that the dd thing was just to simulate that corruption in a testing
> > environment?
>
> Yes, we did so in order to replicate the error scenario.
> >
> >> 3) After corruption we used ‘fsstress’ - we got the similar problem
> >> as was coming the original logs. – We shared our analysis after this
> >> point for looping in the writepages part the free blocks mismatch.
> >
> > Hm. I tried it with 3.15-rc1 and didn't see any hangs. Corrupt bitmaps shut
> > down allocations from the block group and the FS continues, as expected.
> >
> We are using kernel version 3.8, so cannot switch to 3.15-rc1. It is a
> limitation currently.
>
> >> 4) We came across ‘Darrick’ patches(in which it also mentioned about
> >> how to corrupt to reproduce the problem) and applied on our
> >> environment. It solved the initial problem about the looping in
> >> writepages, but now we got hangs at other places.
> >
> > There are hundreds of Darrick patches ... to which one are you referring? :)
> > (What was the subject line?)
> >
> ext4: error out if verifying the block bitmap fails
> ext4: fix type declaration of ext4_validate_block_bitmap
> ext4: mark block group as corrupt on block bitmap error
> ext4: mark block group as corrupt on inode bitmap error
> ext4: mark group corrupt on group descriptor checksum
> ext4: don't count free clusters from a corrupt block group

Ok, thank you for clarifying. :)

> So, the patches helps in marking the block group as corrupt and avoids
> further allocation. But when we consider the normal write path using
> write_begin. Since, there is mismatch between the free cluster count
> from the group descriptor and the bitmap. In that case it marks the
> pages dirty by copying dirty but later it get ENOSPC from the
> writepages when it actually does the allocation.
>
> So, our doubt is if we are marking the block group as corrupt, we
> should also subtract the block group count from the
> s_freeclusters_counter. This will make sure we have the valid
> freecluster count and error ‘ENOSPC’ can be returned from the
> write_begin, instead of propagating such paths till the writepages.
>
> We made change like this:
>
> @@ -737,14 +737,18 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> grp->bb_fragments = fragments;
>
> if (free != grp->bb_free) {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> ext4_grp_locked_error(sb, group, 0, 0,
> "%u clusters in bitmap, %u in gd; "
> "block bitmap corrupt.",
> free, grp->bb_free);
> /*
> * If we intend to continue, we consider group descriptor
> * corrupt and update bb_free using bitmap value
> */
> + percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free);
> grp->bb_free = free;
> set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> }
> mb_set_largest_free_order(sb, grp);
>
> Is this the correct method? Or are missing something in this? Please
> share your opinion.

I think this looks ok. If you send a proper patch doing this to the mailing
list, I'll officially review it.

> >> Using ‘tune2fs’ is not a viable solution in our case, we can only
> >> provide the solution via. the kernel changes. So, we made the changes
> >> as shared earlier.
> >
> > Would it help if you could set errors=remount-ro in mke2fs?
> >
> Sorry, we cannot reformat or use tune2fs to change the ‘errors’ value.

I apologize, my question was unclear; what I meant to ask is, would it have
been helpful if you could have set errors=remount-ro back when you ran mke2fs?

Now that the format's been done, I suppose the only recourse is mount -o
remount,errors=remount-ro (online) or tune2fs (offline).

--D
>
> > --D
> >> So the question isn't how the file system got corrupted, but that
> >> you'd prefer that the system recovers without hanging after this
> >> corruption.
> >> >> Yes, our priority is to keep the system running.
> >>
> >> Again, Sorry for the confusion. But the intention was just to show the
> >> original problem and what we did in order to replicate the problem.
> >>
> >> Thanks & Regards,
> >> Amit Sahrawat
> >>
> >>
> >> On Wed, Apr 16, 2014 at 10:37 AM, Theodore Ts'o <[email protected]> wrote:
> >> > On Wed, Apr 16, 2014 at 10:30:10AM +0530, Amit Sahrawat wrote:
> >> >> 4) Corrupt the block group ‘1’ by writing all ‘1’, we had one file
> >> >> with all 1’s, so using ‘dd’ –
> >> >> dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
> >> >> After this mount the partition – create few random size files and then
> >> >> ran ‘fsstress,
> >> >
> >> > Um, sigh. You didn't say that you were deliberately corrupting the
> >> > file system. That wasn't in the subject line, or anywhere else in the
> >> > original message.
> >> >
> >> > So the question isn't how the file system got corrupted, but that
> >> > you'd prefer that the system recovers without hanging after this
> >> > corruption.
> >> >
> >> > I wish you had *said* that. It would have saved me a lot of time,
> >> > since I was trying to figure out how the system had gotten so
> >> > corrupted (not realizing you had deliberately corrupted the file
> >> > system).
> >> >
> >> > So I think if you run "tune2fs -e remount-ro /dev/sdb1" before you
> >> > started the fsstress, the file system would have remounted the
> >> > filesystem read-only at the first EXT4-fs error message. This would
> >> > avoid the hang that you saw, since the file system would hopefully
> >> > "failed fast", before th euser had the opportunity to put data into
> >> > the page cache that would be lost when the system discovered there was
> >> > no place to put the data.
> >> >
> >> > Regards,
> >> >
> >> > - Ted
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html