Hi All,
I have a question about some timeout/error handling in ext4. And hoping
for tips on how to improve the situation :)
The case:
Virtual Machine (different linux versions) with ext4 + LVM
All the VM's are running on ESXi and using NFS SAN.
There was a short disruption of the SAN, which caused it to be
unavailable for 20-25 minutes for the ESXi.
What was observed in the VM was the following:
- Tasks Hangs
[5877185.755481] RSP: 002b:00007fc1253bc9d0 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[5877185.755481] RAX: ffffffffffffffda RBX: 000000000000000a RCX:
00007fc138f3f4a7
[5877185.755482] RDX: 0000000000000102 RSI: 00007fc1253bcb80 RDI:
000000000000000a
[5877185.755482] RBP: 00007fc1253bcb80 R08: 0000000000000000 R09:
0000000000000101
[5877185.755483] R10: 0000560111be2660 R11: 0000000000000293 R12:
0000000000000102
[5877185.755483] R13: 00007fc1253bcb80 R14: 0000000000000000 R15:
00007fc1253bcb80
[5877185.755488] INFO: task kworker/u256:0:34652 blocked for more than
120 seconds.
[5877185.755505] Not tainted 4.19.0-6-amd64 #1 Debian
4.19.67-2+deb10u1
[5877185.755520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[5877185.755538] kworker/u256:0 D 0 34652 2 0x80000000
[5877185.755541] Workqueue: writeback wb_workfn (flush-254:2)
[5877185.755542] Call Trace:
[5877185.755543] ? __schedule+0x2a2/0x870
[5877185.755544] ? bit_wait_timeout+0x90/0x90
[5877185.755545] schedule+0x28/0x80
[5877185.755546] io_schedule+0x12/0x40
[5877185.755547] bit_wait_io+0xd/0x50
[5877185.755548] __wait_on_bit_lock+0x5d/0xa0
[5877185.755549] out_of_line_wait_on_bit_lock+0x91/0xb0
[5877185.755550] ? init_wait_var_entry+0x40/0x40
[5877185.755552] do_get_write_access+0x1d8/0x430 [jbd2]
[5877185.755559] ? ext4_dirty_inode+0x46/0x60 [ext4]
[5877185.755561] jbd2_journal_get_write_access+0x37/0x50 [jbd2]
[5877185.755566] __ext4_journal_get_write_access+0x36/0x70 [ext4]
[5877185.755573] ext4_reserve_inode_write+0x96/0xc0 [ext4]
[5877185.755575] ? jbd2_journal_dirty_metadata+0x117/0x230 [jbd2]
[5877185.755581] ext4_mark_inode_dirty+0x51/0x1d0 [ext4]
[5877185.755587] ? __ext4_journal_start_sb+0x34/0x120 [ext4]
[5877185.755593] ext4_dirty_inode+0x46/0x60 [ext4]
[5877185.755594] __mark_inode_dirty+0x1ba/0x380
[5877185.755600] ext4_da_update_reserve_space+0xcf/0x180 [ext4]
[5877185.755610] ext4_ext_map_blocks+0xb46/0xda0 [ext4]
[5877185.755617] ext4_map_blocks+0xed/0x5f0 [ext4]
[5877185.755623] ext4_writepages+0x9ba/0xf00 [ext4]
[5877185.755626] ? do_writepages+0x41/0xd0
[5877185.755632] ? ext4_mark_inode_dirty+0x1d0/0x1d0 [ext4]
[5877185.755633] do_writepages+0x41/0xd0
[5877185.755639] ? blk_mq_sched_dispatch_requests+0x11e/0x170
[5877185.755641] __writeback_single_inode+0x3d/0x350
[5877185.755642] writeback_sb_inodes+0x1e3/0x450
[5877185.755644] __writeback_inodes_wb+0x5d/0xb0
[5877185.755645] wb_writeback+0x25f/0x2f0
[5877185.755646] ? get_nr_inodes+0x35/0x50
[5877185.755648] ? cpumask_next+0x16/0x20
- After 1080 seconds (SCSi Timeout of 180 * 5 Retries + 1):
[5878128.028672] sd 0:0:1:0: timing out command, waited 1080s
[5878128.028701] sd 0:0:1:0: [sdb] tag#592 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_OK
[5878128.028703] sd 0:0:1:0: [sdb] tag#592 CDB: Write(10) 2a 00 06 0c b4
c8 00 00 08 00
[5878128.028704] print_req_error: I/O error, dev sdb, sector 101496008
[5878128.028736] EXT4-fs warning (device dm-2): ext4_end_bio:323: I/O
error 10 writing to inode 3145791 (offset 0 size 0 starting block 12686745)
So you see the I/O is getting aborted.
- When the SAN came back, then the filesystem went Read-Only:
[5878601.212415] EXT4-fs error (device dm-2):
ext4_journal_check_start:61: Detected aborted journal
[5878601.212491] JBD2: Detected IO errors while flushing file data on dm-2-8
[5878601.212571] EXT4-fs error (device dm-0):
ext4_journal_check_start:61: Detected aborted journal
[5878601.212879] EXT4-fs (dm-2): Remounting filesystem read-only
[5878601.213483] EXT4-fs error (device dm-0):
ext4_journal_check_start:61: Detected aborted journal
[5878601.213512] EXT4-fs error (device dm-2):
ext4_journal_check_start:61: Detected aborted journal
[5878601.213513] EXT4-fs (dm-2): Remounting filesystem read-only
[5878601.215667] EXT4-fs error (device dm-0):
ext4_journal_check_start:61: Detected aborted journal
[5878601.216347] EXT4-fs error (device dm-2):
ext4_journal_check_start:61: Detected aborted journal
[5878601.217089] EXT4-fs (dm-2): ext4_writepages: jbd2_start: 12288
pages, ino 786630; err -30
[5878601.217963] EXT4-fs error (device dm-0):
ext4_journal_check_start:61: Detected aborted journal
[5878601.218253] EXT4-fs error (device dm-0):
ext4_journal_check_start:61: Detected aborted journal
[5878601.219040] EXT4-fs error (device dm-2):
ext4_journal_check_start:61: Detected aborted journal
[5878601.219049] EXT4-fs error (device dm-2):
ext4_journal_check_start:61: Detected aborted journal
Now I did a hard reset of the machine, and a manual fsck was needed to
get it booting again.
Fsck was showing the following:
"Inodes that were part of a corrupted orphan linked list found."
Manual fsck started with the following:
Inodes that were part of a corrupted orphan linked list found. Fix<y>?
Inode 165708 was part of the orphaned inode list. FIXED
Block bitmap differences: -(863328--863355)
Fix<y>?
What worries me is that almost all of the VM's (out of 500) were showing
the same error. And even some (+-10) were completely corrupt.
If you would do a hard reset of the VM's as a test (without storage
outage), the ratio of VMs that are corrupt and need manual fsck is WAY
lower.
Is there for example a chance that the filesystem gets corrupted the
moment the SAN storage was back accessible?
I also have some snapshot available of a corrupted disk if some
additional debugging info is required.
It would be great to gather some feedback on how to improve the
situation (next to of course have no SAN outage :)).
On KVM for example there is a unlimited timeout (afaik) until the
storage is back, and the VM just continues running after storage recovery.
I tried to reproduce the issue in our lab (by cutting the network to the
SAN from the ESXi), but we were unable to simulate the exact same
filesystem error.
Thanks
Jean-Louis
On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:
>
> There was a short disruption of the SAN, which caused it to be unavailable
> for 20-25 minutes for the ESXi.
20-25 minutes is "short"? I guess it depends on your definition / POV. :-)
> What was observed in the VM was the following:
> <hung task warning>
OK, to be expected.
>
> - After 1080 seconds (SCSi Timeout of 180 * 5 Retries + 1):
> [5878128.028672] sd 0:0:1:0: timing out command, waited 1080s
> [5878128.028701] sd 0:0:1:0: [sdb] tag#592 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_OK
> [5878128.028703] sd 0:0:1:0: [sdb] tag#592 CDB: Write(10) 2a 00 06 0c b4 c8
> 00 00 08 00
> [5878128.028704] print_req_error: I/O error, dev sdb, sector 101496008
> [5878128.028736] EXT4-fs warning (device dm-2): ext4_end_bio:323: I/O error
> 10 writing to inode 3145791 (offset 0 size 0 starting block 12686745)
>
> So you see the I/O is getting aborted.
Also expected.
> - When the SAN came back, then the filesystem went Read-Only:
> [5878601.212415] EXT4-fs error (device dm-2): ext4_journal_check_start:61:
> Detected aborted journal
Yep....
> Now I did a hard reset of the machine, and a manual fsck was needed to get
> it booting again.
> Fsck was showing the following:
> "Inodes that were part of a corrupted orphan linked list found."
>
> Manual fsck started with the following:
> Inodes that were part of a corrupted orphan linked list found. Fix<y>?
> Inode 165708 was part of the orphaned inode list. FIXED
>
> Block bitmap differences: -(863328--863355)
> Fix<y>?
>
> What worries me is that almost all of the VM's (out of 500) were showing the
> same error.
So that's a bit surprising...
> And even some (+-10) were completely corrupt.
What do you mean by "completely corrupt"? Can you send an e2fsck
transcript of file systems that were "completely corrupt"?
> Is there for example a chance that the filesystem gets corrupted the moment
> the SAN storage was back accessible?
Hmm... the one possibility I can think of off the top of my head is
that in order to mark the file system as containing an error, we need
to write to the superblock. The head of the linked list of orphan
inodes is also in the superblock. If that had gotten modified in the
intervening 20-25 minutes, it's possible that this would result in
orphaned inodes not on the linked list, causing that error.
It doesn't explain the more severe cases of corruption, though.
> I also have some snapshot available of a corrupted disk if some additional
> debugging info is required.
Before e2fsck was run? Can you send me a copy of the output of
dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
copy of that snapshot?
> It would be great to gather some feedback on how to improve the situation
> (next to of course have no SAN outage :)).
Something that you could consider is setting up your system to trigger
a panic/reboot on a hung task timeout, or when ext4 detects an error
(see the man page of tune2fs and mke2fs and the -e option for those
programs).
There are tradeoffs with this, but if you've lost the SAN for 15-30
minutes, the file systems are going to need to be checked anyway, and
the machine will certainly not be serving. So forcing a reboot might
be the best thing to do.
> On KVM for example there is a unlimited timeout (afaik) until the storage is
> back, and the VM just continues running after storage recovery.
Well, you can adjust the SCSI timeout, if you want to give that a try....
Cheers,
- Ted
As the mail seems to have been trashed somewhere, I'll retry :)
Thanks
Jean-Louis
On 24/01/2020 21:37, Theodore Y. Ts'o wrote:
> On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:
>> There was a short disruption of the SAN, which caused it to be
>> unavailable
>> for 20-25 minutes for the ESXi.
> 20-25 minutes is "short"? I guess it depends on your definition / POV. :-)
Well more downtime was caused to recover (due to manual fsck) then the
time the storage was down :)
>
>> What worries me is that almost all of the VM's (out of 500) were
>> showing the
>> same error.
> So that's a bit surprising...
Indeed, that's were I thought, something went wrong here!
I've tried to simulate it, and were able to simulate the same error when
we let the san recover BEFORE VM is shutdown.
If I poweroff the VM and then recover the SAN, it does an automatic fsck
without problems.
So it really seems it breaks when the VM can write again to the SAN.
>
>> And even some (+-10) were completely corrupt.
> What do you mean by "completely corrupt"? Can you send an e2fsck
> transcript of file systems that were "completely corrupt"?
Well it was moving a tons of files to lost+found etc. So that was really
broken.
I'll see if I can recover some backup of one in broken state.
Anyway this was only a very small percentage, so worries me less then
the rest :)
>
>> Is there for example a chance that the filesystem gets corrupted the
>> moment
>> the SAN storage was back accessible?
> Hmm... the one possibility I can think of off the top of my head is
> that in order to mark the file system as containing an error, we need
> to write to the superblock. The head of the linked list of orphan
> inodes is also in the superblock. If that had gotten modified in the
> intervening 20-25 minutes, it's possible that this would result in
> orphaned inodes not on the linked list, causing that error.
>
> It doesn't explain the more severe cases of corruption, though.
If fixing that would have left us with only 10 corrupt disks instead of
500, would be a big win :)
>
>> I also have some snapshot available of a corrupted disk if some
>> additional
>> debugging info is required.
> Before e2fsck was run? Can you send me a copy of the output of
> dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
> copy of that snapshot?
Sure:
dumpe2fs -> see attachment
Fsck:
# e2fsck -fy /dev/mapper/vg01-root
e2fsck 1.44.5 (15-Dec-2018)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix? yes
Inode 165708 was part of the orphaned inode list. FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(863328--863355)
Fix? yes
Free blocks count wrong for group #26 (3485, counted=3513).
Fix? yes
Free blocks count wrong (1151169, counted=1151144).
Fix? yes
Inode bitmap differences: -4401 -165708
Fix? yes
Free inodes count wrong for group #0 (2489, counted=2490).
Fix? yes
Free inodes count wrong for group #20 (1298, counted=1299).
Fix? yes
Free inodes count wrong (395115, counted=395098).
Fix? yes
/dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous),
882520/2033664 blocks
>
>> It would be great to gather some feedback on how to improve the situation
>> (next to of course have no SAN outage :)).
> Something that you could consider is setting up your system to trigger
> a panic/reboot on a hung task timeout, or when ext4 detects an error
> (see the man page of tune2fs and mke2fs and the -e option for those
> programs).
>
> There are tradeoffs with this, but if you've lost the SAN for 15-30
> minutes, the file systems are going to need to be checked anyway, and
> the machine will certainly not be serving. So forcing a reboot might
> be the best thing to do.
Going to look into that! Thanks for the info.
>> On KVM for example there is a unlimited timeout (afaik) until the
>> storage is
>> back, and the VM just continues running after storage recovery.
> Well, you can adjust the SCSI timeout, if you want to give that a try....
It has some other disadvantages? Or is it quite safe to increment the
SCSI timeout?
>
> Cheers,
>
> - Ted
The dumpe2fs seems to get blocked.
Uploaded it here: http://dupondje.be/dumpe2fs.txt
On 20/02/2020 10:08, Jean-Louis Dupond wrote:
> As the mail seems to have been trashed somewhere, I'll retry :)
>
> Thanks
> Jean-Louis
>
>
> On 24/01/2020 21:37, Theodore Y. Ts'o wrote:
>> On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:
>>> There was a short disruption of the SAN, which caused it to be
>>> unavailable
>>> for 20-25 minutes for the ESXi.
>> 20-25 minutes is "short"? I guess it depends on your definition /
>> POV. :-)
> Well more downtime was caused to recover (due to manual fsck) then the
> time the storage was down :)
>>
>>> What worries me is that almost all of the VM's (out of 500) were
>>> showing the
>>> same error.
>> So that's a bit surprising...
> Indeed, that's were I thought, something went wrong here!
> I've tried to simulate it, and were able to simulate the same error
> when we let the san recover BEFORE VM is shutdown.
> If I poweroff the VM and then recover the SAN, it does an automatic
> fsck without problems.
> So it really seems it breaks when the VM can write again to the SAN.
>>
>>> And even some (+-10) were completely corrupt.
>> What do you mean by "completely corrupt"? Can you send an e2fsck
>> transcript of file systems that were "completely corrupt"?
> Well it was moving a tons of files to lost+found etc. So that was
> really broken.
> I'll see if I can recover some backup of one in broken state.
> Anyway this was only a very small percentage, so worries me less then
> the rest :)
>>
>>> Is there for example a chance that the filesystem gets corrupted the
>>> moment
>>> the SAN storage was back accessible?
>> Hmm... the one possibility I can think of off the top of my head is
>> that in order to mark the file system as containing an error, we need
>> to write to the superblock. The head of the linked list of orphan
>> inodes is also in the superblock. If that had gotten modified in the
>> intervening 20-25 minutes, it's possible that this would result in
>> orphaned inodes not on the linked list, causing that error.
>>
>> It doesn't explain the more severe cases of corruption, though.
> If fixing that would have left us with only 10 corrupt disks instead
> of 500, would be a big win :)
>>
>>> I also have some snapshot available of a corrupted disk if some
>>> additional
>>> debugging info is required.
>> Before e2fsck was run? Can you send me a copy of the output of
>> dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
>> copy of that snapshot?
> Sure:
> dumpe2fs -> see attachment
>
> Fsck:
> # e2fsck -fy /dev/mapper/vg01-root
> e2fsck 1.44.5 (15-Dec-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Inodes that were part of a corrupted orphan linked list found. Fix? yes
>
> Inode 165708 was part of the orphaned inode list. FIXED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences: -(863328--863355)
> Fix? yes
>
> Free blocks count wrong for group #26 (3485, counted=3513).
> Fix? yes
>
> Free blocks count wrong (1151169, counted=1151144).
> Fix? yes
>
> Inode bitmap differences: -4401 -165708
> Fix? yes
>
> Free inodes count wrong for group #0 (2489, counted=2490).
> Fix? yes
>
> Free inodes count wrong for group #20 (1298, counted=1299).
> Fix? yes
>
> Free inodes count wrong (395115, counted=395098).
> Fix? yes
>
>
> /dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous),
> 882520/2033664 blocks
>
>>
>>> It would be great to gather some feedback on how to improve the
>>> situation
>>> (next to of course have no SAN outage :)).
>> Something that you could consider is setting up your system to trigger
>> a panic/reboot on a hung task timeout, or when ext4 detects an error
>> (see the man page of tune2fs and mke2fs and the -e option for those
>> programs).
>>
>> There are tradeoffs with this, but if you've lost the SAN for 15-30
>> minutes, the file systems are going to need to be checked anyway, and
>> the machine will certainly not be serving. So forcing a reboot might
>> be the best thing to do.
> Going to look into that! Thanks for the info.
>>> On KVM for example there is a unlimited timeout (afaik) until the
>>> storage is
>>> back, and the VM just continues running after storage recovery.
>> Well, you can adjust the SCSI timeout, if you want to give that a
>> try....
> It has some other disadvantages? Or is it quite safe to increment the
> SCSI timeout?
>>
>> Cheers,
>>
>> - Ted
>
>
On Thu, Feb 20, 2020 at 10:08:44AM +0100, Jean-Louis Dupond wrote:
> dumpe2fs -> see attachment
Looking at the dumpe2fs output, it's interesting that it was "clean
with errors", without any error information being logged in the
superblock. What version of the kernel are you using? I'm guessing
it's a fairly old one?
> Fsck:
> # e2fsck -fy /dev/mapper/vg01-root
> e2fsck 1.44.5 (15-Dec-2018)
And that's a old version of e2fsck as well. Is this some kind of
stable/enterprise linux distro?
> Pass 1: Checking inodes, blocks, and sizes
> Inodes that were part of a corrupted orphan linked list found.? Fix? yes
>
> Inode 165708 was part of the orphaned inode list.? FIXED.
OK, this and the rest looks like it's relating to a file truncation or
deletion at the time of the disconnection.
> > > On KVM for example there is a unlimited timeout (afaik) until the
> > > storage is
> > > back, and the VM just continues running after storage recovery.
> > Well, you can adjust the SCSI timeout, if you want to give that a try....
> It has some other disadvantages? Or is it quite safe to increment the SCSI
> timeout?
It should be pretty safe.
Can you reliably reproduce the problem by disconnecting the machine
from the SAN?
- Ted
On 20/02/2020 16:50, Theodore Y. Ts'o wrote:
> On Thu, Feb 20, 2020 at 10:08:44AM +0100, Jean-Louis Dupond wrote:
>> dumpe2fs -> see attachment
> Looking at the dumpe2fs output, it's interesting that it was "clean
> with errors", without any error information being logged in the
> superblock. What version of the kernel are you using? I'm guessing
> it's a fairly old one?
Debian 10 (Buster), with kernel 4.19.67-2+deb10u1
>> Fsck:
>> # e2fsck -fy /dev/mapper/vg01-root
>> e2fsck 1.44.5 (15-Dec-2018)
> And that's a old version of e2fsck as well. Is this some kind of
> stable/enterprise linux distro?
Debian 10 indeed.
>> Pass 1: Checking inodes, blocks, and sizes
>> Inodes that were part of a corrupted orphan linked list found. Fix? yes
>>
>> Inode 165708 was part of the orphaned inode list. FIXED.
> OK, this and the rest looks like it's relating to a file truncation or
> deletion at the time of the disconnection.
>
> > > > On KVM for example there is a unlimited timeout (afaik) until the
>>>> storage is
>>>> back, and the VM just continues running after storage recovery.
>>> Well, you can adjust the SCSI timeout, if you want to give that a try....
>> It has some other disadvantages? Or is it quite safe to increment the SCSI
>> timeout?
> It should be pretty safe.
>
> Can you reliably reproduce the problem by disconnecting the machine
> from the SAN?
Yep, can be reproduced by killing the connection to the SAN while the VM
is running, and then after the scsi timeout passed, re-enabled the SAN
connection.
Then reset the machine, and then you need to run an fsck to have it back
online.
> - Ted
FYI,
Just did same test with e2fsprogs 1.45.5 (from buster backports) and
kernel 5.4.13-1~bpo10+1.
And having exactly the same issue.
The VM needs a manual fsck after storage outage.
Don't know if its useful to test with 5.5 or 5.6?
But it seems like the issue still exists.
Thanks
Jean-Louis
On 20/02/2020 17:14, Jean-Louis Dupond wrote:
>
> On 20/02/2020 16:50, Theodore Y. Ts'o wrote:
>> On Thu, Feb 20, 2020 at 10:08:44AM +0100, Jean-Louis Dupond wrote:
>>> dumpe2fs -> see attachment
>> Looking at the dumpe2fs output, it's interesting that it was "clean
>> with errors", without any error information being logged in the
>> superblock. What version of the kernel are you using? I'm guessing
>> it's a fairly old one?
> Debian 10 (Buster), with kernel 4.19.67-2+deb10u1
>>> Fsck:
>>> # e2fsck -fy /dev/mapper/vg01-root
>>> e2fsck 1.44.5 (15-Dec-2018)
>> And that's a old version of e2fsck as well. Is this some kind of
>> stable/enterprise linux distro?
> Debian 10 indeed.
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Inodes that were part of a corrupted orphan linked list found. Fix?
>>> yes
>>>
>>> Inode 165708 was part of the orphaned inode list. FIXED.
>> OK, this and the rest looks like it's relating to a file truncation or
>> deletion at the time of the disconnection.
>>
>> > > > On KVM for example there is a unlimited timeout (afaik) until
>> the
>>>>> storage is
>>>>> back, and the VM just continues running after storage recovery.
>>>> Well, you can adjust the SCSI timeout, if you want to give that a
>>>> try....
>>> It has some other disadvantages? Or is it quite safe to increment
>>> the SCSI
>>> timeout?
>> It should be pretty safe.
>>
>> Can you reliably reproduce the problem by disconnecting the machine
>> from the SAN?
> Yep, can be reproduced by killing the connection to the SAN while the
> VM is running, and then after the scsi timeout passed, re-enabled the
> SAN connection.
> Then reset the machine, and then you need to run an fsck to have it
> back online.
>> - Ted
On Tue, Feb 25, 2020 at 02:19:09PM +0100, Jean-Louis Dupond wrote:
> FYI,
>
> Just did same test with e2fsprogs 1.45.5 (from buster backports) and kernel
> 5.4.13-1~bpo10+1.
> And having exactly the same issue.
> The VM needs a manual fsck after storage outage.
>
> Don't know if its useful to test with 5.5 or 5.6?
> But it seems like the issue still exists.
This is going to be a long shot, but if you could try testing with
5.6-rc3, or with this commit cherry-picked into a 5.4 or later kernel:
commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6
Author: wangyan <[email protected]>
Date: Thu Feb 20 21:46:14 2020 +0800
jbd2: fix ocfs2 corrupt when clearing block group bits
I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
The running environment:
kernel version: 4.19
A cluster with two nodes, 5 luns mounted on two nodes, and do some
file operations like dd/fallocate/truncate/rm on every lun with storage
network disconnection.
The fallocate operation on dm-23-45 caused an null pointer dereference.
...
... it would be interesting to see if fixes things for you. I can't
guarantee that it will, but the trigger of the failure which wangyan
found is very similar indeed.
Thanks,
- Ted
On 25/02/2020 18:23, Theodore Y. Ts'o wrote:
> On Tue, Feb 25, 2020 at 02:19:09PM +0100, Jean-Louis Dupond wrote:
>> FYI,
>>
>> Just did same test with e2fsprogs 1.45.5 (from buster backports) and kernel
>> 5.4.13-1~bpo10+1.
>> And having exactly the same issue.
>> The VM needs a manual fsck after storage outage.
>>
>> Don't know if its useful to test with 5.5 or 5.6?
>> But it seems like the issue still exists.
> This is going to be a long shot, but if you could try testing with
> 5.6-rc3, or with this commit cherry-picked into a 5.4 or later kernel:
>
> commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6
> Author: wangyan <[email protected]>
> Date: Thu Feb 20 21:46:14 2020 +0800
>
> jbd2: fix ocfs2 corrupt when clearing block group bits
>
> I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
> The running environment:
> kernel version: 4.19
> A cluster with two nodes, 5 luns mounted on two nodes, and do some
> file operations like dd/fallocate/truncate/rm on every lun with storage
> network disconnection.
>
> The fallocate operation on dm-23-45 caused an null pointer dereference.
> ...
>
> ... it would be interesting to see if fixes things for you. I can't
> guarantee that it will, but the trigger of the failure which wangyan
> found is very similar indeed.
>
> Thanks,
>
> - Ted
Unfortunately it was a too long shot :)
Tested with a 5.4 kernel with that patch included, and also with 5.6-rc3.
But both had the same issue.
- Filesystem goes read-only when the storage comes back
- Manual fsck needed on bootup to recover from it.
It would be great if we could make it not corrupt the filesystem on
storage recovery.
I'm happy to test some patches if they are available :)
Thanks
Jean-Louis
On 28/02/2020 12:06, Jean-Louis Dupond wrote:
> On 25/02/2020 18:23, Theodore Y. Ts'o wrote:
>> This is going to be a long shot, but if you could try testing with
>> 5.6-rc3, or with this commit cherry-picked into a 5.4 or later kernel:
>>
>> commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6
>> Author: wangyan <[email protected]>
>> Date: Thu Feb 20 21:46:14 2020 +0800
>>
>> jbd2: fix ocfs2 corrupt when clearing block group bits
>> I found a NULL pointer dereference in
>> ocfs2_block_group_clear_bits().
>> The running environment:
>> kernel version: 4.19
>> A cluster with two nodes, 5 luns mounted on two
>> nodes, and do some
>> file operations like dd/fallocate/truncate/rm on
>> every lun with storage
>> network disconnection.
>> The fallocate operation on dm-23-45 caused an null
>> pointer dereference.
>> ...
>>
>> ... it would be interesting to see if fixes things for you. I can't
>> guarantee that it will, but the trigger of the failure which wangyan
>> found is very similar indeed.
>>
>> Thanks,
>>
>> - Ted
> Unfortunately it was a too long shot :)
>
> Tested with a 5.4 kernel with that patch included, and also with 5.6-rc3.
> But both had the same issue.
>
> - Filesystem goes read-only when the storage comes back
> - Manual fsck needed on bootup to recover from it.
>
> It would be great if we could make it not corrupt the filesystem on
> storage recovery.
> I'm happy to test some patches if they are available :)
>
> Thanks
> Jean-Louis
Did some more tests today.
Setting the SCSi timeout higher seems to be the most reliable solution.
When the storage recovers, the VM just recovers and we can continue :)
Also did test setting the filesystem option 'error=panic'.
When the storage recovers, the VM freezes. So a hard reset is needed.
But on boot a manual fsck is also needed like in the default situation.
So it seems like it still writes data to the FS before doing the panic?
You would expect it to not touch the fs anymore.
Would be nice if this situation could be a bit more error-proof :)
Thanks
Jean-Louis
On Mon, Mar 09, 2020 at 02:52:38PM +0100, Jean-Louis Dupond wrote:
> Did some more tests today.
>
> Setting the SCSi timeout higher seems to be the most reliable solution.
> When the storage recovers, the VM just recovers and we can continue :)
>
> Also did test setting the filesystem option 'error=panic'.
> When the storage recovers, the VM freezes. So a hard reset is needed. But on
> boot a manual fsck is also needed like in the default situation.
> So it seems like it still writes data to the FS before doing the panic?
> You would expect it to not touch the fs anymore.
>
> Would be nice if this situation could be a bit more error-proof :)
Did the panic happen immediately, or did things hang until the storage
recovered, and *then* it rebooted. Or did the hard reset and reboot
happened before the storage network connection was restored?
Fundamentally I think what's going on is that even though there is an
I/O error reported back to the OS, but in some cases, the outstanding
I/O actually happens. So in the error=panic case, we do update the
superblock saying that the file system contains inconsistencies. And
then we reboot. But it appears that even though host rebooted, the
storage area network *did* manage to send the I/O to the device.
I'm not sure what we can really do here, other than simply making the
SCSI timeout infinite. The problem is that storage area networks are
flaky. Sometimes I/O's make it through, and even though we get an
error, it's an error from the local SCSI layer --- and it's possible
that I/O will make it through. In other cases, even though the
storage area network was disconnected at the time we sent the I/O
saying the file system has problems, and then rebooted, the I/O
actually makes it through. Given that, assuming that if we're not
sure, forcing an full file system check is better part of valor.
And if it hangs forever, and we do a hard reset reboot, I don't know
*what* to trust from the storage area network. Ideally, there would
be some way to do a hard reset of the storage area network so that all
outstanding I/O's from the host that we are about to reset will get
forgotten before we do actually the hard reset.
- Ted
On 9/03/2020 16:18, Theodore Y. Ts'o wrote:
> Did the panic happen immediately, or did things hang until the storage
> recovered, and*then* it rebooted. Or did the hard reset and reboot
> happened before the storage network connection was restored?
The panic (well it was just frozen, no stacktrace or automatic reboot)
did happen *after* storage came back online.
So nothing happens while the storage is offline, even if we wait until
the scsi timeout is exceeded (180s * 6).
It's only when the storage returns that the filesystem goes read-only /
panic (depending on the error setting).
>
> Fundamentally I think what's going on is that even though there is an
> I/O error reported back to the OS, but in some cases, the outstanding
> I/O actually happens. So in the error=panic case, we do update the
> superblock saying that the file system contains inconsistencies. And
> then we reboot. But it appears that even though host rebooted, the
> storage area network*did* manage to send the I/O to the device.
It seems that by updating the superblock to state that filesystem
contains errors, things are made worse.
At the moment it does this, the storage is already accessible again, so
it seems logic the I/O is written.
>
> I'm not sure what we can really do here, other than simply making the
> SCSI timeout infinite. The problem is that storage area networks are
> flaky. Sometimes I/O's make it through, and even though we get an
> error, it's an error from the local SCSI layer --- and it's possible
> that I/O will make it through. In other cases, even though the
> storage area network was disconnected at the time we sent the I/O
> saying the file system has problems, and then rebooted, the I/O
> actually makes it through. Given that, assuming that if we're not
> sure, forcing an full file system check is better part of valor.
If we do reset the VM before storage is back, the filesystem check just
goes fine in automatic mode.
So I think we should (in some cases) not try to update the superblock
anymore on I/O errors, but just go read-only/panic.
Cause it seems like updating the superblock makes things worse.
Or changes could be made to e2fsck to allow automatic repair of this
kind of error for example?
>
> And if it hangs forever, and we do a hard reset reboot, I don't know
> *what* to trust from the storage area network. Ideally, there would
> be some way to do a hard reset of the storage area network so that all
> outstanding I/O's from the host that we are about to reset will get
> forgotten before we do actually the hard reset.
>
> - Ted
On Mon, Mar 09, 2020 at 04:33:52PM +0100, Jean-Louis Dupond wrote:
> On 9/03/2020 16:18, Theodore Y. Ts'o wrote:
> > Did the panic happen immediately, or did things hang until the storage
> > recovered, and*then* it rebooted. Or did the hard reset and reboot
> > happened before the storage network connection was restored?
>
> The panic (well it was just frozen, no stacktrace or automatic reboot) did
> happen *after* storage came back online.
> So nothing happens while the storage is offline, even if we wait until the
> scsi timeout is exceeded (180s * 6).
> It's only when the storage returns that the filesystem goes read-only /
> panic (depending on the error setting).
So I under why the scsi timeout isn't sufficient to keep the panic
from hanging.
> If we do reset the VM before storage is back, the filesystem check just goes
> fine in automatic mode.
> So I think we should (in some cases) not try to update the superblock
> anymore on I/O errors, but just go read-only/panic.
> Cause it seems like updating the superblock makes things worse.
The problem is that from the file system's perspective, we don't know
why the I/O error has happened. Is it because of timeout, or is it
because of a media error?
In the case where an SSD really was unable to write to a metadata
block, we *do* want to update the superblock.
There is a return status that the block device could send back,
BLK_STS_TIMEOUT, but it's not set by the SCSI layer. It is by the
network block device (nbd), but it looks like the SCSI layer just
returns BLK_STS_IOERR if I'm reading the code correctly.
> Or changes could be made to e2fsck to allow automatic repair of this kind of
> error for example?
The fundamental problem is we don't know what "kind of error" has
taken place. If we did, we could theoretically have some kind of
mount option which means "in case of timeout, reboot the system
without setting some kind of file system error". But we need to know
that the I/O error was caused by a timeout first.
- Ted