https://bugzilla.kernel.org/show_bug.cgi?id=45621
Summary: Kernel ooops: BUG: unable to handle kernel paging
request at 000000080000001c
Product: File System
Version: 2.5
Kernel Version: 3.4.7
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: ext4
AssignedTo: [email protected]
ReportedBy: [email protected]
Regression: No
During heavy io on my ext4 filesystems, I sometimes get this oops:
[10645.902287] BUG: unable to handle kernel paging request at 000000080000001c
[10645.902881] IP: [<ffffffff8110c4d1>] find_get_page+0x41/0xa0
[10645.903359] PGD 1e21cb067 PUD 0
[10645.903638] Oops: 0000 [#1] PREEMPT SMP
[10645.903986] CPU 1
[10645.904147] Modules linked in: md5 aes_x86_64 aes_generic xts gf128mul
dm_crypt dm_mod usb_storage uas nfsd exportfs tun w83627ehf hwmon_vid
iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_conntrack ip_tables x_tables rc_dib0700_rc5 dvb_usb_dib0700
dib3000mc dib8000 dib0070 dib7000m dib7000p dibx000_common dib0090 dvb_usb
dvb_core microcode i915 iTCO_wdt i2c_algo_bit drm_kms_helper intel_agp
iTCO_vendor_support drm psmouse ghash_clmulni_intel mei(C) evdev atl1c rc_core
intel_gtt pcspkr serio_raw i2c_i801 i2c_core acpi_cpufreq mperf processor
cryptd coretemp crc32c_intel video button loop fuse nfs nfs_acl lockd
auth_rpcgss sunrpc fscache ext4 crc16 jbd2 mbcache sd_mod ahci libahci ehci_hcd
xhci_hcd libata scsi_mod usbcore usb_common
[10645.910482]
[10645.910602] Pid: 2958, comm: rsync Tainted: G C 3.4.7-1-ARCH #1 To
Be Filled By O.E.M. To Be Filled By O.E.M./H61M/U3S3
[10645.911595] RIP: 0010:[<ffffffff8110c4d1>] [<ffffffff8110c4d1>]
find_get_page+0x41/0xa0
[10645.912276] RSP: 0018:ffff8801fe1eba28 EFLAGS: 00010246
[10645.912713] RAX: ffff880100ad1198 RBX: 0000000800000000 RCX:
00000000fffffffa
[10645.913303] RDX: 0000000000000001 RSI: ffff880100ad1198 RDI:
0000000000000000
[10645.913893] RBP: ffff8801fe1eba48 R08: 0000000800000000 R09:
ffff880100ad0f88
[10645.914481] R10: ffffffffa0188e00 R11: 0000000000000000 R12:
ffff88008a307058
[10645.915071] R13: 00000000000084bf R14: 000000000102005a R15:
0000000000000050
[10645.915663] FS: 00007f8c4d7f4700(0000) GS:ffff88021f280000(0000)
knlGS:0000000000000000
[10645.916333] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10645.916807] CR2: 000000080000001c CR3: 00000001d296a000 CR4:
00000000000407e0
[10645.917393] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[10645.917985] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[10645.918574] Process rsync (pid: 2958, threadinfo ffff8801fe1ea000, task
ffff880211fcafa0)
[10645.919250] Stack:
[10645.919413] ffff8801fe1eba48 ffff88008a307050 00000000000084bf
00000000000084bf
[10645.920059] ffff8801fe1eba78 ffffffff8110c6d6 ffff8801fe1eba68
ffffffffa00558d3
[10645.920702] 0000000000000004 ffff88008a307050 ffff8801fe1ebac8
ffffffff8110cdf2
[10645.921344] Call Trace:
[10645.942181] [<ffffffff8110c6d6>] find_lock_page+0x26/0x80
[10645.963412] [<ffffffffa00558d3>] ? jbd2_journal_start+0x13/0x20 [jbd2]
[10645.984833] [<ffffffff8110cdf2>] grab_cache_page_write_begin+0x72/0x100
[10645.984853] [<ffffffffa0149bf0>] ext4_da_write_begin+0xa0/0x230 [ext4]
[10645.984858] [<ffffffffa014c47d>] ? ext4_da_write_end+0xad/0x390 [ext4]
[10645.984861] [<ffffffff8110be74>] generic_file_buffered_write+0x124/0x2b0
[10645.984864] [<ffffffff8110da4a>] __generic_file_aio_write+0x22a/0x440
[10645.984868] [<ffffffff8146775e>] ? __mutex_lock_slowpath+0x24e/0x340
[10645.984871] [<ffffffff8110dcd1>] generic_file_aio_write+0x71/0xe0
[10645.984876] [<ffffffffa014334f>] ext4_file_write+0xaf/0x260 [ext4]
[10645.984879] [<ffffffff8116e286>] do_sync_write+0xe6/0x120
[10645.984883] [<ffffffff811f8a9c>] ? security_file_permission+0x2c/0xb0
[10645.984885] [<ffffffff8116e871>] ? rw_verify_area+0x61/0xf0
[10645.984887] [<ffffffff8116eb88>] vfs_write+0xa8/0x180
[10645.984888] [<ffffffff8116eeca>] sys_write+0x4a/0xa0
[10645.984891] [<ffffffff8146aaa9>] system_call_fastpath+0x16/0x1b
[10645.984892] Code: 89 f5 4c 8d 63 08 e8 3f 8e fc ff 4c 89 ee 4c 89 e7 e8 f4
77 13 00 48 85 c0 48 89 c6 74 44 48 8b 18 48 85 db 74 22 f6 c3 03 75 3f <8b> 53
1c 85 d2 74 d9 8d 7a 01 89 d0 f0 0f b1 7b 1c 39 c2 75 26
[10645.984908] RIP [<ffffffff8110c4d1>] find_get_page+0x41/0xa0
[10645.984910] RSP <ffff8801fe1eba28>
[10645.984911] CR2: 000000080000001c
[10646.075497] ---[ end trace 9841da8b9a0cb390 ]---
Using archlinux stable.
Anything else I can do to hunt this bug down?
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
--- Comment #1 from [email protected] 2012-08-04 23:45:51 ---
Ah and one thing: Filesystem is dead afterwards (e.g. shell simply does not
return and hangs forever if I do a `ls` or so), have to reboot system or
reattach usb device to mount/access it again.
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
Theodore Tso <[email protected]> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
--- Comment #2 from Theodore Tso <[email protected]> 2012-08-05 22:12:14 ---
This looks like the inode's mapping->page_tree radix tree has gotten corrupted,
and find_get_page is dying in the call to radix_tree_deref_slot() (line #687 in
mm/filemap.c for v3.4.7). This radix tree is supposed to be protected by a RCU
lock, so this looks like it's some kind of mm bug.
I can try and see if we can get an mm developer to take a look at this, but
while we try to get an appropriate developer's attention, can you upload your
kernel's .config file, so whoever ends up debugging this knows your RCU
configuration options?
Thanks!!
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
--- Comment #3 from [email protected] 2012-08-05 22:24:31 ---
Created an attachment (id=76851)
--> (https://bugzilla.kernel.org/attachment.cgi?id=76851)
kernel .config
Here's the config. Thanks for helping out!
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
--- Comment #4 from Theodore Tso <[email protected]> 2012-08-05 22:36:14 ---
Hi, I'm hoping this rings a bell with an mm developer:
https://bugzilla.kernel.org/show_bug.cgi?id=45621
It looks like the user is reporting a OOPS which was caused by the
inode's mapping->page_tree having gotten corrupted. The call stack
was from a write system call while the system was undergoing heavy
I/O, on a v3.4.7 kernel.
If someone could take a quick look at this, I'd really appreciate it.
Thanks!!
- Ted
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
--- Comment #5 from Hugh Dickins <[email protected]> 2012-08-06 15:15:06 ---
On Sun, Aug 5, 2012 at 3:36 PM, Theodore Ts'o <[email protected]> wrote:
> Hi, I'm hoping this rings a bell with an mm developer:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=45621
>
> It looks like the user is reporting a OOPS which was caused by the
> inode's mapping->page_tree having gotten corrupted. The call stack
> was from a write system call while the system was undergoing heavy
> I/O, on a v3.4.7 kernel.
>
> If someone could take a quick look at this, I'd really appreciate it.
> Thanks!!
It looks exactly like a classic bitflip from bad DRAM.
The faulting instruction is where find_get_page()'s call to
page_cache_get_speculative(page) tries to access page->_count, at
offset 0x1c into struct page.
The page pointer found in the radix tree slot should have been NULL
(quite normal when writing like this, I'm a little surprised we mark
that as "unlikely"),
but it's got that 0x800000000 bit set.
Markus, you say that you sometimes get that oops: it might be helpful
if you could attach a few more examples to the bugzilla entry. I've
not stopped to see whether the address of the radix tree slot in
question is actually still in the registers or on the stack shown, but
if it is, then seeing the same or nearby addresses in other cases
would tend to confirm bad memory.
Worth checking with memtest86 or memtest86+. But of course, it could
also be a kernel bug putting that bit of corruption there.
Hugh
>
> - Ted
>
>
> Summary: Kernel ooops: BUG: unable to handle kernel paging
> request at 000000080000001c
> Product: File System
> Version: 2.5
> Kernel Version: 3.4.7
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: ext4
> AssignedTo: [email protected]
> ReportedBy: [email protected]
> Regression: No
>
>
> During heavy io on my ext4 filesystems, I sometimes get this oops:
>
>
> [10645.902287] BUG: unable to handle kernel paging request at 000000080000001c
> [10645.902881] IP: [<ffffffff8110c4d1>] find_get_page+0x41/0xa0
> [10645.903359] PGD 1e21cb067 PUD 0
> [10645.903638] Oops: 0000 [#1] PREEMPT SMP
> [10645.903986] CPU 1
> [10645.904147] Modules linked in: md5 aes_x86_64 aes_generic xts gf128mul
> dm_crypt dm_mod usb_storage uas nfsd exportfs tun w83627ehf hwmon_vid
> iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_conntrack ip_tables x_tables rc_dib0700_rc5 dvb_usb_dib0700
> dib3000mc dib8000 dib0070 dib7000m dib7000p dibx000_common dib0090 dvb_usb
> dvb_core microcode i915 iTCO_wdt i2c_algo_bit drm_kms_helper intel_agp
> iTCO_vendor_support drm psmouse ghash_clmulni_intel mei(C) evdev atl1c rc_core
> intel_gtt pcspkr serio_raw i2c_i801 i2c_core acpi_cpufreq mperf processor
> cryptd coretemp crc32c_intel video button loop fuse nfs nfs_acl lockd
> auth_rpcgss sunrpc fscache ext4 crc16 jbd2 mbcache sd_mod ahci libahci ehci_hcd
> xhci_hcd libata scsi_mod usbcore usb_common
> [10645.910482]
> [10645.910602] Pid: 2958, comm: rsync Tainted: G C 3.4.7-1-ARCH #1 To
> Be Filled By O.E.M. To Be Filled By O.E.M./H61M/U3S3
> [10645.911595] RIP: 0010:[<ffffffff8110c4d1>] [<ffffffff8110c4d1>]
> find_get_page+0x41/0xa0
> [10645.912276] RSP: 0018:ffff8801fe1eba28 EFLAGS: 00010246
> [10645.912713] RAX: ffff880100ad1198 RBX: 0000000800000000 RCX:
> 00000000fffffffa
> [10645.913303] RDX: 0000000000000001 RSI: ffff880100ad1198 RDI:
> 0000000000000000
> [10645.913893] RBP: ffff8801fe1eba48 R08: 0000000800000000 R09:
> ffff880100ad0f88
> [10645.914481] R10: ffffffffa0188e00 R11: 0000000000000000 R12:
> ffff88008a307058
> [10645.915071] R13: 00000000000084bf R14: 000000000102005a R15:
> 0000000000000050
> [10645.915663] FS: 00007f8c4d7f4700(0000) GS:ffff88021f280000(0000)
> knlGS:0000000000000000
> [10645.916333] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [10645.916807] CR2: 000000080000001c CR3: 00000001d296a000 CR4:
> 00000000000407e0
> [10645.917393] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [10645.917985] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [10645.918574] Process rsync (pid: 2958, threadinfo ffff8801fe1ea000, task
> ffff880211fcafa0)
> [10645.919250] Stack:
> [10645.919413] ffff8801fe1eba48 ffff88008a307050 00000000000084bf
> 00000000000084bf
> [10645.920059] ffff8801fe1eba78 ffffffff8110c6d6 ffff8801fe1eba68
> ffffffffa00558d3
> [10645.920702] 0000000000000004 ffff88008a307050 ffff8801fe1ebac8
> ffffffff8110cdf2
> [10645.921344] Call Trace:
> [10645.942181] [<ffffffff8110c6d6>] find_lock_page+0x26/0x80
> [10645.963412] [<ffffffffa00558d3>] ? jbd2_journal_start+0x13/0x20 [jbd2]
> [10645.984833] [<ffffffff8110cdf2>] grab_cache_page_write_begin+0x72/0x100
> [10645.984853] [<ffffffffa0149bf0>] ext4_da_write_begin+0xa0/0x230 [ext4]
> [10645.984858] [<ffffffffa014c47d>] ? ext4_da_write_end+0xad/0x390 [ext4]
> [10645.984861] [<ffffffff8110be74>] generic_file_buffered_write+0x124/0x2b0
> [10645.984864] [<ffffffff8110da4a>] __generic_file_aio_write+0x22a/0x440
> [10645.984868] [<ffffffff8146775e>] ? __mutex_lock_slowpath+0x24e/0x340
> [10645.984871] [<ffffffff8110dcd1>] generic_file_aio_write+0x71/0xe0
> [10645.984876] [<ffffffffa014334f>] ext4_file_write+0xaf/0x260 [ext4]
> [10645.984879] [<ffffffff8116e286>] do_sync_write+0xe6/0x120
> [10645.984883] [<ffffffff811f8a9c>] ? security_file_permission+0x2c/0xb0
> [10645.984885] [<ffffffff8116e871>] ? rw_verify_area+0x61/0xf0
> [10645.984887] [<ffffffff8116eb88>] vfs_write+0xa8/0x180
> [10645.984888] [<ffffffff8116eeca>] sys_write+0x4a/0xa0
> [10645.984891] [<ffffffff8146aaa9>] system_call_fastpath+0x16/0x1b
> [10645.984892] Code: 89 f5 4c 8d 63 08 e8 3f 8e fc ff 4c 89 ee 4c 89 e7 e8 f4
> 77 13 00 48 85 c0 48 89 c6 74 44 48 8b 18 48 85 db 74 22 f6 c3 03 75 3f <8b> 53
> 1c 85 d2 74 d9 8d 7a 01 89 d0 f0 0f b1 7b 1c 39 c2 75 26
> [10645.984908] RIP [<ffffffff8110c4d1>] find_get_page+0x41/0xa0
> [10645.984910] RSP <ffff8801fe1eba28>
> [10645.984911] CR2: 000000080000001c
> [10646.075497] ---[ end trace 9841da8b9a0cb390 ]---
>
> Using archlinux stable.
>
> Anything else I can do to hunt this bug down?
>
> --
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
--- Comment #6 from [email protected] 2012-08-06 22:14:42 ---
Just ran memtest for 3+ hours, 3 passes finished successfully. I know, longer
is better, but cannot run it longer atm (since it's a server with crucial
infrastructure). System is not overclocked or so, everything default.
The oops happend to me about 3-4 times in the last 3 months, often on heavy io
but maybe once when idling (can't say exactly).
So I'll wait for the next one and report back - if there's a next one.
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
On 08/06/2012 11:14 PM, [email protected] wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=45621
>
>
>
>
>
> --- Comment #6 from [email protected] 2012-08-06 22:14:42 ---
> Just ran memtest for 3+ hours, 3 passes finished successfully. I know, longer
> is better, but cannot run it longer atm (since it's a server with crucial
> infrastructure). System is not overclocked or so, everything default.
memtest86 is completely useless at finding marginal memory. It'll only
typically find a problem if there is a properly, permanently dead or
stuck bit in the RAM.
If Windows is an option temporarily for testing use OCCT. If it isn't,
I've found something like this to work pretty well for finding CPU and
memory stability issues:
#####
# Assuming your /dev/shm is mounted on tmpfs - make sure it's big
# enough. Adjust count= accordingly
dd if=/dev/urandom bs=1M count=1024 of=/dev/shm/test
# Run one instance of this per CPU thread
#!/bin/bash
sum=`sha512sum /tmp/test`
while (/bin/true); do
echo -n .
sum2=`sha512sum /tmp/test`
if [ "$sum" != "$sum2" ]; then
echo “”
echo “Error detected. $sum != $sum2″
exit
fi
done
#####
That should be a lot better at finding CPU/cache/MCH/RAM issues.
Gordan
https://bugzilla.kernel.org/show_bug.cgi?id=45621
Ivan Shapovalov <[email protected]> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
--- Comment #7 from Ivan Shapovalov <[email protected]> 2012-11-07 09:58:39 ---
This (or similar) seems to be reproducible with reiser4 (another out-of-tree
filesystem) on a custom kernel.
Symptoms:
1) on newly allocated pages, sometimes page_private(pg) != 0ul
2) sometimes BUGs happen on 0000_0000_0000_000e in a following scenario:
pg = find_or_create_page(...);
// check for pg != NULL
// pg->mapping is accessed
Happens 2-3 times a day. Also different BUGs happened (addresses 0016, 00b0,
0f91) but the backtraces seem to be invalid and I can't describe the actual
in-code scenario.
Directly observed on kernels 3.2 to latest stable (but current reiser4
maintainer, Edward Shishkin, seen this on 2.6.39 - though just 1-2 times at
all).
In my case, the kernel is vanilla + -ck patchset + various config changes (KSM,
big pagetables, cleancache+frontswap+zcache, sparse memory model, transparent
hugepages).
The config is here: ftp://intelfx.homeip.net/kernel_config
Hardware is Acer 3830TG (HM65 chipset, SNB cpu, 4 GB of RAM). Running in x86_64
mode.
I know that this is too small level of detail - but can anything be done to
track down/collect more info?
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
Dmitry Monakhov <[email protected]> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
--- Comment #8 from Dmitry Monakhov <[email protected]> 2012-11-07 12:42:48 ---
So you use raiser4(which is known to have such type of corruption) and
wondering that other parts of dcache becomes corrupted? Have you ever reproduce
this w/o raiser4-pc ? I'm pretty sure that his is raiser's bug and absolutely
not related with ext4
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
Shapovalov Ivan <[email protected]> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
--- Comment #9 from Shapovalov Ivan <[email protected]> 2012-11-07 14:18:31 ---
(In reply to comment #8)
> So you use raiser4(which is known to have such type of corruption) and
> wondering that other parts of dcache becomes corrupted? Have you ever reproduce
> this w/o raiser4-pc ? I'm pretty sure that his is raiser's bug and absolutely
> not related with ext4
Could you please elaborate - which type of corruption reiser4 is known to have?
I'm not aware of these...
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=45621
Alan <[email protected]> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
CC| |[email protected]
Resolution|--- |OBSOLETE
--
You are receiving this mail because:
You are watching the assignee of the bug.