2012-11-02 14:33:59

by Ben Hutchings

[permalink] [raw]
Subject: Re: Bug#692104: linux-image-3.2.0-3-amd64: NULL pointer dereference in ext4fs

On Fri, 2012-11-02 at 09:50 +0100, Wilmer van der Gaast wrote:
[...]
> I don't know what exactly triggered this, but the result was that my /home
> was no longer accessible after this event. My root filesystem was still
> okay.

I assume this means it was no longer accessible until the next boot.

> Marking as important because filesystem bugs could potentially cause
> corruption/data loss, although my /home seems to be fine after a fsck.
> Don't know how lucky I was.
>
> I've done a Google search for this crash with no results other than one
> report with a tainted kernel.
>
> Sadly I have no idea how this could be reproduced. A few factors:
>
> * My laptop was up for >60d already, with many suspend-resume cycles.
> * My /home was recently (week ago?) online-resized.
> * It's on an SSD, with trim/discards enabled. LVM and dm-crypt in between
> the fs and the SSD.
[...]
> ** Kernel log:
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [<ffffffffa0160e4b>] ext4_mb_good_group+0x39/0xcd [ext4]
> PGD 134c62067 PUD 134c1b067 PMD 0
> Oops: 0000 [#1] SMP
> CPU 1
> Modules linked in: rndis_host cdc_ether usbnet mii pl2303 nls_utf8
> nls_cp437 sg usb_storage uas usbhid hid btrfs crc32c libcrc32c
> zlib_deflate ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs
> reiserfs tun iwlwifi ftdi_sio usbserial cpufreq_conservative
> cpufreq_userspace cpufreq_powersave cpufreq_stats parport_pc ppdev lp
> parport rfcomm bnep bluetooth uinput fuse nfsd nfs nfs_acl auth_rpcgss
> fscache lockd sunrpc kvm_intel kvm ext3 jbd ext2 loop
> snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss
> snd_mixer_oss arc4 snd_pcm snd_page_alloc snd_seq_midi
> snd_seq_midi_event snd_rawmidi snd_seq i915 psmouse pcspkr serio_raw
> coretemp iTCO_wdt evdev i2c_i801 iTCO_vendor_support snd_seq_device
> snd_timer thinkpad_acpi tpm_tis mac80211 ac battery acpi_cpufreq tpm
> power_supply tpm_bios nvram drm_kms_helper cfg80211 video snd rfkill wmi
> drm mperf i2c_algo_bit i2c_core soundcore processor button ext4 crc16
> jbd2 mbcache sha256_generic cryptd aes_x86_64 ae
>
>
> Pid: 12409, comm: xulrunner-stub Not tainted 3.2.0-3-amd64 #1 LENOVO

This is based on Linux 3.2.23. There haven't been any subsequent fixes
to fs/ext4/mballoc.c in the 3.2.y series, though other fixes might be
relevant.

> 7465CTO/7465CTO
> RIP: 0010:[<ffffffffa0160e4b>] [<ffffffffa0160e4b>]
> ext4_mb_good_group+0x39/0xcd [ext4]
> RSP: 0018:ffff8800b27798c8 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffff88012b9888d8 RCX: 0000000000000002

This means ext4_get_group_info() returned NULL.

> RDX: ffff88013467a000 RSI: 0000000000000050 RDI: ffff880135cb2800
> RBP: 0000000000000150 R08: ffff8801191d90f0 R09: ffff8801191d90f0
> R10: ffff8801191d90f0 R11: ffff8801191d90f0 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff880135cb2800 R15: 0000000000000000
> FS: 00007f6e1de8f700(0000) GS:ffff88013bc80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000092f4a000 CR4: 00000000000406e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process xulrunner-stub (pid: 12409, threadinfo ffff8800b2778000, task
> ffff880136ea8040)
> Stack:
> ffff8801191d90f0 ffff88012b9888d8 ffff880135cb2c00 ffff880135cb2800
> 0000000000000000 0000000000000148 0000000000000150 ffffffffa0162397
> 0000000200000000 ffff880135cb2ef8 00000000ffffffff ffff880136ea8040
> Call Trace:
> [<ffffffffa0162397>] ? ext4_mb_regular_allocator+0x110/0x264 [ext4]
> [<ffffffff81036457>] ? should_resched+0x5/0x23
> [<ffffffffa016350a>] ? ext4_mb_new_blocks+0x1c2/0x403 [ext4]
> [<ffffffffa015e00f>] ? __ext4_handle_dirty_metadata+0xd7/0xe8 [ext4]
> [<ffffffffa0166eab>] ? ext4_alloc_branch+0x1ab/0x468 [ext4]
> [<ffffffffa00b5bf1>] ? jbd2_journal_stop+0x209/0x21b [jbd2]
> [<ffffffffa0167922>] ? ext4_ind_map_blocks+0x289/0x4a6 [ext4]
> [<ffffffffa013d4be>] ? ext4_da_write_end+0x1f1/0x232 [ext4]
> [<ffffffff810bd5a1>] ? release_pages+0x68/0x14d
> [<ffffffff810bd5a1>] ? release_pages+0x68/0x14d
> [<ffffffff811ad035>] ? __lookup_tag+0xb6/0x120
> [<ffffffffa013a637>] ? ext4_map_blocks+0x114/0x1f0 [ext4]
> [<ffffffff811ad7a6>] ? radix_tree_gang_lookup_tag_slot+0x77/0x98
> [<ffffffff810f363e>] ? mem_cgroup_add_lru_list+0xd/0xaa
> [<ffffffffa013d58d>] ? mpage_da_map_and_submit+0x8e/0x2f9 [ext4]
> [<ffffffffa013dadf>] ? write_cache_pages_da+0x214/0x2c5 [ext4]
> [<ffffffffa013de32>] ? ext4_da_writepages+0x2a2/0x45b [ext4]
> [<ffffffff810b4c98>] ? __filemap_fdatawrite_range+0x4b/0x50
> [<ffffffffa01365aa>] ? ext4_release_file+0x1b/0x93 [ext4]
> [<ffffffff810fa285>] ? fput+0xf9/0x1a1
> [<ffffffff810f7fde>] ? filp_close+0x62/0x6a
> [<ffffffff810f8074>] ? sys_close+0x8e/0xcb
> [<ffffffff8134fb92>] ? system_call_fastpath+0x16/0x1b
> Code: 53 48 89 fb 41 52 4c 8b 77 08 49 8b 86 b0 02 00 00 4c 89 f7 44 8b
> b8 20 03 00 00 e8 fd dc ff ff 41 83 fd 03 49 89 c4 76 02 0f 0b <48> 8b
> 00 a8 01 74 0e 89 ee 4c 89 f7 e8 18 fe ff ff 85 c0 75 69
> RIP [<ffffffffa0160e4b>] ext4_mb_good_group+0x39/0xcd [ext4]
> RSP <ffff8800b27798c8>
> CR2: 0000000000000000
> ---[ end trace 160e5f4d37523c1f ]---
[...]

(Full bug report is at <http://bugs.debian.org/692104>.)

Ben.

--
Ben Hutchings
I'm always amazed by the number of people who take up solipsism because
they heard someone else explain it. - E*Borg on alt.fan.pratchett


Attachments:
signature.asc (828.00 B)
This is a digitally signed message part

2012-11-02 22:00:39

by Wilmer van der Gaast

[permalink] [raw]
Subject: Re: Bug#692104: linux-image-3.2.0-3-amd64: NULL pointer dereference in ext4fs

Hello,

On 02-11-12 15:33, Ben Hutchings wrote:
>> I don't know what exactly triggered this, but the result was that my /home
>> was no longer accessible after this event. My root filesystem was still
>> okay.
> I assume this means it was no longer accessible until the next boot.
>
Oh yes, definitely. Sorry for not being clear.

Also, this has just happened again. This time, after me not having
touched the laptop for over ten hours. I'm starting to wonder whether my
filesystem is corrupted. I'll make an LVM snapshot and then do a full fsck.

I'll attach the backtrace. Seems to be the same like this morning.


Wilmer v/d Gaast.

--
+-------- .''`. - -- ---+ + - -- --- ---- ----- ------+
| wilmer : :' : gaast.net | | OSS Programmer http://www.bitlbee.org |
| lintux `. `~' debian.org | | Full-time geek wilmer.gaast.net |
+--- -- - ` ---------------+ +------ ----- ---- --- -- - +


Attachments:
crash2.txt (3.87 kB)

2012-11-08 22:00:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug#692104: linux-image-3.2.0-3-amd64: NULL pointer dereference in ext4fs

On Fri, Nov 02, 2012 at 10:52:35PM +0100, Wilmer van der Gaast wrote:
> >
> Oh yes, definitely. Sorry for not being clear.
>
> Also, this has just happened again. This time, after me not having
> touched the laptop for over ten hours. I'm starting to wonder
> whether my filesystem is corrupted. I'll make an LVM snapshot and
> then do a full fsck.

Did you perform another on-line resize on the file system before it
failed?

It looks like a problem which I ran into (and fixed) when adding
support for online resizing for > 16TB file systems, but I was pretty
sure it couldn't happen with until we added support for resizing very
large file systems using the new meta_bg resizing scheme. The commit
where I cleaned this up (but which was not backported to stable
kernels since it was part of a new feature and I didn't think it could
be triggered w/o the new feature) was:

commit 28623c2f5b0dca3c3ea34fd6108940661352e276
Author: Theodore Ts'o <[email protected]>
Date: Wed Sep 5 01:31:50 2012 -0400

ext4: grow the s_group_info array as needed

Previously we allocated the s_group_info array with enough space for
any future possible growth of the file system via online resize. This
is unfortunate because it wastes memory, and it doesn't work for the
meta_bg scheme, since there is no limit based on the number of
reserved gdt blocks. So add the code to grow the s_group_info array
as needed.

Signed-off-by: "Theodore Ts'o" <[email protected]>

How big was the file system before the resize, and how much larger did
you resize it? If it is this bug, the s_group_info array is allocated
based on the file system size when the file system is mounted. So it
would only be happening after a online resize and before the file
system is unmounted and/or the system is rebooted and the file system
is mounted again.

- Ted

2012-11-08 22:25:35

by Wilmer van der Gaast

[permalink] [raw]
Subject: Re: Bug#692104: linux-image-3.2.0-3-amd64: NULL pointer dereference in ext4fs

Hello all,

This crash has happened to me three times now, but the last time is is
five days ago. Seems to have disappeared as mysteriously as it appeared.

On 08-11-12 15:30, Theodore Ts'o wrote:
> On Fri, Nov 02, 2012 at 10:52:35PM +0100, Wilmer van der Gaast wrote:
>> whether my filesystem is corrupted. I'll make an LVM snapshot and
>> then do a full fsck.
>
That fsck was completely clean by the way.

> Did you perform another on-line resize on the file system before it
> failed?
>
I don't think so. I've done one last weekend, after which I've
experienced one more crash. I'm quite sure that the last on-line resize
before I reported this bug is quite long ago though, likely before my
last reboot.

> It looks like a problem which I ran into (and fixed) when adding
> support for online resizing for> 16TB file systems, [...]

The filesystem is not quite that large, just 45G (from 40G).

I've attached tune2fs output for it just in case it helps. It was
created back in 2010 already apparently, although as an ext3 at the time.

> you resize it? If it is this bug, the s_group_info array is allocated
> based on the file system size when the file system is mounted. So it
> would only be happening after a online resize and before the file
> system is unmounted and/or the system is rebooted and the file system
> is mounted again.
>
Hmm, I'm quite sure a long time (and likely a reboot) had passed in
between the last resize and the first crash last weekend.

It looked like this crash always happened while handling a close()
syscall issued by Firefox. I've tried stracing my firefox process to see
which file was causing it, but the crashes had already disappeared by then.

I'll definitely ping this bug if this happens again.


Thanks,

Wilmer v/d Gaast.

--
+-------- .''`. - -- ---+ + - -- --- ---- ----- ------+
| wilmer : :' : gaast.net | | OSS Programmer http://www.bitlbee.org |
| lintux `. `~' debian.org | | Full-time geek wilmer.gaast.net |
+--- -- - ` ---------------+ +------ ----- ---- --- -- - +


Attachments:
tune2fs.txt (1.69 kB)

2012-11-24 00:00:13

by Wilmer van der Gaast

[permalink] [raw]
Subject: Re: Bug#692104: linux-image-3.2.0-3-amd64: NULL pointer dereference in ext4fs

Strangely, this has happened again a few days ago. Very similar
backtrace, and again triggered by Firefox. Annoyingly it was no longer
running within strace.


Wilmer v/d Gaast.

--
+-------- .''`. - -- ---+ + - -- --- ---- ----- ------+
| wilmer : :' : gaast.net | | OSS Programmer http://www.bitlbee.org |
| lintux `. `~' debian.org | | Full-time geek wilmer.gaast.net |
+--- -- - ` ---------------+ +------ ----- ---- --- -- - +