LinuxLists.cc - ext4 metadata corruption bug bites again

2015-08-21 10:22:17

Subject: ext4 metadata corruption bug bites again

Hi list,

On April 20th, 2014 there was a thread on this list w. Nathanial W
Filardo, Theodore Tso and myself about ext4 metadata corruption on large
ext4 volumes. See http://marc.info/?l=linux-ext4&m=139878494527370&w=2
and further messages from the archives.

A bit of debugging was done, i fell out of that loop as my problems
seemed to have disappeared with further kernel updates from my distro,
but recently these issues seem to have re-appeared (or they never went
away and i just didn't hit this specific bug-trigger-situation?).

Again i'm getting sporadic fs errors like this most recent one:
(More of these i've pasted at https://8n1.org/10745/cc34)
| [000d00h01m43s] EXT4-fs error (device vdb): ext4_mb_generate_buddy:757:
| group 79842, block bitmap and bg descriptor inconsistent: 10073 vs 10071
| free clusters
| [000d00h01m43s] Aborting journal on device vdb-8.

An e2fsck run shows:
| Pass 5: Checking group summary information
| Block bitmap differences: +(2616281446--2616281447)
| Fix? yes
|
| Free blocks count wrong (170942497, counted=129906218).
| Fix? yes
|
| Free inodes count wrong (670863012, counted=670860975).
| Fix? yes

My setup is largely the same, storage wise. Updated kernels here and
there, and the storage device /dev/vdb has grown to 10TiB which i
still use unpartitioned in this VM.

The VM is now running 3.19.0-25-generic #26-Ubuntu SMP Fri Jul 24
21:17:31 UTC 2015 x86_64

I'm able and willing to run patched kernels to trace this further.
Please advise.

With regards,
-Sander.
--
| > > Isn't this a stupid question?
| > Isn't this a stupid answer?
| No, it was another stupid question. This is the stupid answer.
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2

2015-08-24 13:52:18

by Sander Smeenk

[permalink] [raw]

Subject: Re: ext4 metadata corruption bug bites again

Quoting Sander Smeenk ([email protected]):

> | [000d00h01m43s] EXT4-fs error (device vdb): ext4_mb_generate_buddy:757:
> | group 79842, block bitmap and bg descriptor inconsistent: 10073 vs 10071
> | free clusters
> | [000d00h01m43s] Aborting journal on device vdb-8.

Today it went again:

[000d05h26m50s] EXT4-fs error (device vdb): ext4_mb_generate_buddy:757: group 45706, block bitmap and bg descriptor inconsistent: 1769 vs 1768 free clusters
[000d05h26m50s] Aborting journal on device vdb-8.
[000d05h26m50s] EXT4-fs (vdb): Remounting filesystem read-only
[000d05h16m42s] EXT4-fs (vdb): pa ffff88003527fc30: logic 331144, phys. 1363269705, len 511
[000d05h16m42s] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3772: group 41603, free 511, pa_free 507
[000d05h16m42s] EXT4-fs (vdb): pa ffff88002ed668f0: logic 286160, phys. 1362907136, len 511
[000d05h16m42s] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3772: group 41592, free 511, pa_free 503

I'm now running 3.19.8-ckt4 with WARN_ON(1); inserted in the above
mentioned error situations.

-Sndr.
--
| Do pilots take crash-courses?
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2

2015-09-14 08:22:18

by Sander Smeenk

[permalink] [raw]

Subject: Re: ext4 metadata corruption bug bites again

Quoting Sander Smeenk ([email protected]):

> I'm now running 3.19.8-ckt4 with WARN_ON(1); inserted in the above
> mentioned error situations.

And today the system went RO again. Now i have the WARN_ON(1) output:
| EXT4-fs (vdb): pa ffff880016544888: logic 982168, phys. 2469410748, len 104
| EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3773: group 75360, free 38, pa_free 36
| Aborting journal on device vdb-8.
| EXT4-fs (vdb): Remounting filesystem read-only
| ------------[ cut here ]------------
| WARNING: CPU: 1 PID: 1706 at fs/ext4/mballoc.c:3774 ext4_mb_release_inode_pa.isra.27+0x1cb/0x2c0()
| Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_tcpudp ip6_tables
| nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables
| x_tables cirrus ttm drm_kms_helper drm kvm_intel kvm ppdev syscopyarea sysfillrect
| 8250_fintek serio_raw i2c_piix4 sysimgblt pvpanic parport_pc mac_hid nfsd auth_rpcgss nfs_acl
| lockd grace sunrpc lp parport autofs4 psmouse floppy pata_acpi
| CPU: 1 PID: 1706 Comm: deluged Not tainted 3.19.8-ckt4 #1
| Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
| ffffffff81ab4fef ffff8800da1bb978 ffffffff817c3760 0000000000000007
| 0000000000000000 ffff8800da1bb9b8 ffffffff8107696a ffff8800da1bb9a8
| 0000000000000026 0000000000003825 0000000000003824 ffff880016544888
| Call Trace:
| [<ffffffff817c3760>] dump_stack+0x45/0x57
| [<ffffffff8107696a>] warn_slowpath_common+0x8a/0xc0
| [<ffffffff81076a5a>] warn_slowpath_null+0x1a/0x20
| [<ffffffff812b01bb>] ext4_mb_release_inode_pa.isra.27+0x1cb/0x2c0
| [<ffffffff812739df>] ? ext4_read_block_bitmap_nowait+0x26f/0x5f0
| [<ffffffff812b3c6a>] ext4_discard_preallocations+0x30a/0x490
| [<ffffffff8127b578>] ext4_da_update_reserve_space+0x178/0x1b0
| [<ffffffff812a9129>] ext4_ext_map_blocks+0xcd9/0xe50
| [<ffffffff8127b6d9>] ext4_map_blocks+0x129/0x570
| [<ffffffff8127e89d>] ? ext4_writepages+0x35d/0xca0
| [<ffffffff812ab3a9>] ? __ext4_journal_start_sb+0x69/0xe0
| [<ffffffff8127eac2>] ext4_writepages+0x582/0xca0
| [<ffffffff81187a4e>] do_writepages+0x1e/0x30
| [<ffffffff8117bbe9>] __filemap_fdatawrite_range+0x59/0x60
| [<ffffffff8117bc4c>] filemap_write_and_wait+0x2c/0x60
| [<ffffffff8120903d>] do_vfs_ioctl+0x3fd/0x4e0
| [<ffffffff812091a1>] SyS_ioctl+0x81/0xa0
| [<ffffffff817ca84d>] system_call_fastpath+0x16/0x1b
| ---[ end trace c7de4d0d78cb95b6 ]---
| EXT4-fs error (device vdb) in ext4_writepages:2412: IO failure
| EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854775751 pages, ino 84149503; err -30

After this, the system started logging:
| EXT4-fs error (device vdb): ext4_find_extent:900: inode #84149503: comm deluged:
| pblk 225181822 bad header/extent: invalid magic - magic 53fd, entries 37907, max 27407(0), depth 50401(0)

Ran e2fsck and got:
| Pass 5: Checking group summary information
| Block bitmap differences: +(1431556444--1431556445) +(2469410748-2469410749)
| Free blocks count wrong (134030133, counted=57970467).
| Free inodes count wrong (670746893, counted=670746452).

All fixed and system is running again.

HTH,
-Sander.
--
| Why does your nose run, and your feet smell?
| 4096R/20CC6CD2 - 6D40 1A20 B9AA 87D4 84C7 FBD6 F3A9 9442 20CC 6CD2