Hi,
2.6.25-rc6 is a strong beast :)
Another[0] BUG is printed and the box is still alive:
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c0179114>] __d_lookup+0x94/0x150
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: fuse sha256_generic xt_tcpudp ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ftp nf_nat nf_conntrack_ftp xt_conntrack nf_conntrack iptable_filter ip_tables ipt_ULOG x_tables nfsd lockd nfs_acl auth_rpcgss exportfs tun sunrpc twofish_i586 twofish_common eeprom w83l785ts asb100 hwmon_vid usb_storage zd1211rw firmware_class mac80211 snd_intel8x0 snd_ac97_codec i2c_nforce2 cfg80211 ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc i2c_core [last unloaded: fuse]
Pid: 15705, comm: imap Not tainted (2.6.25-rc6 #5)
EIP: 0060:[<c0179114>] EFLAGS: 00010286 CPU: 0
EIP is at __d_lookup+0x94/0x150
EAX: 00000000 EBX: 0006bc44 ECX: 00000001 EDX: d60634e8
ESI: c2020a00 EDI: c56ebf30 EBP: c478ad6c ESP: c56ebd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process imap (pid: 15705, ti=c56eb000 task=e153c000 task.ti=c56eb000)
Stack: 00000002 00000001 c0179080 f4826be0 00000246 c56ebe08 0000000b f66a800b
d60634e8 c56ebe08 0000002f c56ebf30 c56ebe08 c016f388 c56ebe14 f7faff80
c016ee97 01eb3b48 c56ebe08 0000002f c56ebe14 f66a8017 c0170a70 c56ebf30
Call Trace:
[<c0179080>] __d_lookup+0x0/0x150
[<c016f388>] do_lookup+0x28/0x1a0
[<c016ee97>] permission+0xb7/0x120
[<c0170a70>] __link_path_walk+0x140/0xcd0
[<c043f5e4>] _spin_unlock+0x14/0x20
[<c02c3e1a>] _atomic_dec_and_lock+0x2a/0x40
[<c0179855>] dput+0x65/0xf0
[<c017163a>] link_path_walk+0x3a/0xa0
[<c043f5e4>] _spin_unlock+0x14/0x20
[<c01662bb>] get_unused_fd_flags+0xab/0xd0
[<c017189e>] do_path_lookup+0x6e/0x180
[<c0169088>] get_empty_filp+0xa8/0x120
[<c01724b1>] __path_lookup_intent_open+0x51/0xa0
[<c0172590>] path_lookup_open+0x20/0x30
[<c0172686>] open_namei+0x66/0x5f0
[<c01665ae>] do_filp_open+0x2e/0x60
[<c043f5e4>] _spin_unlock+0x14/0x20
[<c01662bb>] get_unused_fd_flags+0xab/0xd0
[<c016662c>] do_sys_open+0x4c/0xe0
[<c01666fc>] sys_open+0x1c/0x20
[<c0102dee>] sysenter_past_esp+0x5f/0xa5
=======================
Code: 53 c0 e8 20 08 fc ff c1 e3 02 8b 14 33 89 54 24 20 8b 44 24 20 85 c0 75 10 eb 51 8b 12 89 54 24 20 8b 44 24 20 85 c0 74 43 8b 02 <0f> 18 00 90 8d 5a d8 39 6b 34 75 e4 8b 7c 24 0c 39 7b 30 75 db
EIP: [<c0179114>] __d_lookup+0x94/0x150 SS:ESP 0068:c56ebd7c
---[ end trace 274145890e21aa9a ]---
I've put some more details (.config, dmesg, some sysrq printouts) on:
http://nerdbynature.de/bits/2.6.25-rc6/Oops_d_lookup/
Please tell me not to worry :)
Christian.
[0] http://lkml.org/lkml/2008/3/23/245
--
BOFH excuse #85:
Windows 95 undocumented "feature"
On Wed, 26 Mar 2008 00:08:48 +0100 (CET) Christian Kujau <[email protected]> wrote:
> Hi,
>
> 2.6.25-rc6 is a strong beast :)
> Another[0] BUG is printed and the box is still alive:
>
> BUG: unable to handle kernel NULL pointer dereference at 00000000
> IP: [<c0179114>] __d_lookup+0x94/0x150
> *pde = 00000000
> Oops: 0000 [#1]
> Modules linked in: fuse sha256_generic xt_tcpudp ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ftp nf_nat nf_conntrack_ftp xt_conntrack nf_conntrack iptable_filter ip_tables ipt_ULOG x_tables nfsd lockd nfs_acl auth_rpcgss exportfs tun sunrpc twofish_i586 twofish_common eeprom w83l785ts asb100 hwmon_vid usb_storage zd1211rw firmware_class mac80211 snd_intel8x0 snd_ac97_codec i2c_nforce2 cfg80211 ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc i2c_core [last unloaded: fuse]
> Pid: 15705, comm: imap Not tainted (2.6.25-rc6 #5)
> EIP: 0060:[<c0179114>] EFLAGS: 00010286 CPU: 0
> EIP is at __d_lookup+0x94/0x150
> EAX: 00000000 EBX: 0006bc44 ECX: 00000001 EDX: d60634e8
> ESI: c2020a00 EDI: c56ebf30 EBP: c478ad6c ESP: c56ebd7c
> DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
> Process imap (pid: 15705, ti=c56eb000 task=e153c000 task.ti=c56eb000)
> Stack: 00000002 00000001 c0179080 f4826be0 00000246 c56ebe08 0000000b f66a800b
> d60634e8 c56ebe08 0000002f c56ebf30 c56ebe08 c016f388 c56ebe14 f7faff80
> c016ee97 01eb3b48 c56ebe08 0000002f c56ebe14 f66a8017 c0170a70 c56ebf30
> Call Trace:
> [<c0179080>] __d_lookup+0x0/0x150
> [<c016f388>] do_lookup+0x28/0x1a0
> [<c016ee97>] permission+0xb7/0x120
> [<c0170a70>] __link_path_walk+0x140/0xcd0
> [<c043f5e4>] _spin_unlock+0x14/0x20
> [<c02c3e1a>] _atomic_dec_and_lock+0x2a/0x40
> [<c0179855>] dput+0x65/0xf0
> [<c017163a>] link_path_walk+0x3a/0xa0
> [<c043f5e4>] _spin_unlock+0x14/0x20
> [<c01662bb>] get_unused_fd_flags+0xab/0xd0
> [<c017189e>] do_path_lookup+0x6e/0x180
> [<c0169088>] get_empty_filp+0xa8/0x120
> [<c01724b1>] __path_lookup_intent_open+0x51/0xa0
> [<c0172590>] path_lookup_open+0x20/0x30
> [<c0172686>] open_namei+0x66/0x5f0
> [<c01665ae>] do_filp_open+0x2e/0x60
> [<c043f5e4>] _spin_unlock+0x14/0x20
> [<c01662bb>] get_unused_fd_flags+0xab/0xd0
> [<c016662c>] do_sys_open+0x4c/0xe0
> [<c01666fc>] sys_open+0x1c/0x20
> [<c0102dee>] sysenter_past_esp+0x5f/0xa5
> =======================
> Code: 53 c0 e8 20 08 fc ff c1 e3 02 8b 14 33 89 54 24 20 8b 44 24 20 85 c0 75 10 eb 51 8b 12 89 54 24 20 8b 44 24 20 85 c0 74 43 8b 02 <0f> 18 00 90 8d 5a d8 39 6b 34 75 e4 8b 7c 24 0c 39 7b 30 75 db
> EIP: [<c0179114>] __d_lookup+0x94/0x150 SS:ESP 0068:c56ebd7c
> ---[ end trace 274145890e21aa9a ]---
>
>
> I've put some more details (.config, dmesg, some sysrq printouts) on:
> http://nerdbynature.de/bits/2.6.25-rc6/Oops_d_lookup/
>
> Please tell me not to worry :)
> Christian.
>
> [0] http://lkml.org/lkml/2008/3/23/245
Markus reported what looks to be the same thing here:
http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
I guess you've confirmed that this wasn't a mystery
once-off-on-that-machine.
I can't think what we did to cause this. Were you doing anything unusual
on that machine? I see the fuse module was loaded - was it being used?
Were any oddball (ie: non-ext3 ;)) filesystems being used? etc.
On Wednesday, 26 of March 2008, Andrew Morton wrote:
> On Wed, 26 Mar 2008 00:08:48 +0100 (CET) Christian Kujau <[email protected]> wrote:
>
> > Hi,
> >
> > 2.6.25-rc6 is a strong beast :)
> > Another[0] BUG is printed and the box is still alive:
> >
> > BUG: unable to handle kernel NULL pointer dereference at 00000000
> > IP: [<c0179114>] __d_lookup+0x94/0x150
> > *pde = 00000000
> > Oops: 0000 [#1]
> > Modules linked in: fuse sha256_generic xt_tcpudp ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ftp nf_nat nf_conntrack_ftp xt_conntrack nf_conntrack iptable_filter ip_tables ipt_ULOG x_tables nfsd lockd nfs_acl auth_rpcgss exportfs tun sunrpc twofish_i586 twofish_common eeprom w83l785ts asb100 hwmon_vid usb_storage zd1211rw firmware_class mac80211 snd_intel8x0 snd_ac97_codec i2c_nforce2 cfg80211 ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc i2c_core [last unloaded: fuse]
> > Pid: 15705, comm: imap Not tainted (2.6.25-rc6 #5)
> > EIP: 0060:[<c0179114>] EFLAGS: 00010286 CPU: 0
> > EIP is at __d_lookup+0x94/0x150
> > EAX: 00000000 EBX: 0006bc44 ECX: 00000001 EDX: d60634e8
> > ESI: c2020a00 EDI: c56ebf30 EBP: c478ad6c ESP: c56ebd7c
> > DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
> > Process imap (pid: 15705, ti=c56eb000 task=e153c000 task.ti=c56eb000)
> > Stack: 00000002 00000001 c0179080 f4826be0 00000246 c56ebe08 0000000b f66a800b
> > d60634e8 c56ebe08 0000002f c56ebf30 c56ebe08 c016f388 c56ebe14 f7faff80
> > c016ee97 01eb3b48 c56ebe08 0000002f c56ebe14 f66a8017 c0170a70 c56ebf30
> > Call Trace:
> > [<c0179080>] __d_lookup+0x0/0x150
> > [<c016f388>] do_lookup+0x28/0x1a0
> > [<c016ee97>] permission+0xb7/0x120
> > [<c0170a70>] __link_path_walk+0x140/0xcd0
> > [<c043f5e4>] _spin_unlock+0x14/0x20
> > [<c02c3e1a>] _atomic_dec_and_lock+0x2a/0x40
> > [<c0179855>] dput+0x65/0xf0
> > [<c017163a>] link_path_walk+0x3a/0xa0
> > [<c043f5e4>] _spin_unlock+0x14/0x20
> > [<c01662bb>] get_unused_fd_flags+0xab/0xd0
> > [<c017189e>] do_path_lookup+0x6e/0x180
> > [<c0169088>] get_empty_filp+0xa8/0x120
> > [<c01724b1>] __path_lookup_intent_open+0x51/0xa0
> > [<c0172590>] path_lookup_open+0x20/0x30
> > [<c0172686>] open_namei+0x66/0x5f0
> > [<c01665ae>] do_filp_open+0x2e/0x60
> > [<c043f5e4>] _spin_unlock+0x14/0x20
> > [<c01662bb>] get_unused_fd_flags+0xab/0xd0
> > [<c016662c>] do_sys_open+0x4c/0xe0
> > [<c01666fc>] sys_open+0x1c/0x20
> > [<c0102dee>] sysenter_past_esp+0x5f/0xa5
> > =======================
> > Code: 53 c0 e8 20 08 fc ff c1 e3 02 8b 14 33 89 54 24 20 8b 44 24 20 85 c0 75 10 eb 51 8b 12 89 54 24 20 8b 44 24 20 85 c0 74 43 8b 02 <0f> 18 00 90 8d 5a d8 39 6b 34 75 e4 8b 7c 24 0c 39 7b 30 75 db
> > EIP: [<c0179114>] __d_lookup+0x94/0x150 SS:ESP 0068:c56ebd7c
> > ---[ end trace 274145890e21aa9a ]---
> >
> >
> > I've put some more details (.config, dmesg, some sysrq printouts) on:
> > http://nerdbynature.de/bits/2.6.25-rc6/Oops_d_lookup/
> >
> > Please tell me not to worry :)
> > Christian.
> >
> > [0] http://lkml.org/lkml/2008/3/23/245
>
> Markus reported what looks to be the same thing here:
> http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
>
> I guess you've confirmed that this wasn't a mystery
> once-off-on-that-machine.
>
> I can't think what we did to cause this. Were you doing anything unusual
> on that machine? I see the fuse module was loaded - was it being used?
> Were any oddball (ie: non-ext3 ;)) filesystems being used? etc.
Well, we seem to get mm-related traces on x86-32 at random places.
http://www.ussg.iu.edu/hypermail/linux/kernel/0803.3/0782.html for example.
I'm starting to think there's some arch-related mm issue lurking in there.
On Tue, 25 Mar 2008, Andrew Morton wrote:
> Markus reported what looks to be the same thing here:
> http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
Yes, I've found 3 more reports for __d_lookup on kerneloops.org, first
seen for 2.6.25-rc5-git5.
> I can't think what we did to cause this. Were you doing anything unusual
> on that machine?
Well, I was reading mail...and suddenly alpine complained that the imap
server was gone - and indeed "imap" was in the Oops message. But apart
from that, nothing exotic going on.
> I see the fuse module was loaded - was it being used?
No, it's loaded, but it was not in use.
> Were any oddball (ie: non-ext3 ;)) filesystems being used? etc.
There's ext2/3/4, jfs, xfs, reiserfs (not reiser4) - the whole family.
The only oddball coming to mind is zd1211rw with its binary firmware. But
no SMP, no ACPI, no preempt...
Christian.
--
BOFH excuse #90:
Budget cuts
On Tue, 25 Mar 2008, Andrew Morton wrote:
> > Code: 53 c0 e8 20 08 fc ff c1 e3 02 8b 14 33 89 54 24 20 8b 44 24 20 85 c0 75 10 eb 51 8b 12 89 54 24 20 8b 44 24 20 85 c0 74 43 8b 02 <0f> 18 00 90 8d 5a d8 39 6b 34 75 e4 8b 7c 24 0c 39 7b 30 75 db
It faults in a prefetch.
> Markus reported what looks to be the same thing here:
> http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
Same here. And both are AMD X2 early stepping machines.
> I guess you've confirmed that this wasn't a mystery
> once-off-on-that-machine.
>
> I can't think what we did to cause this.
I had a lengthy bug decoding session with Ingo and we found the root
cause:
A dropped workaround for the prefetch bug in early X2s and
Opterons. Patch below.
Thanks,
tglx
--------------->
Subject: x86: fix prefetch workaround
From: Ingo Molnar <[email protected]>
Date: Thu Mar 27 15:58:28 CET 2008
some early Athlon XP's and Opterons generate bogus faults on prefetch
instructions. The workaround for this regressed over .24 - reinstate it.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/mm/fault.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-x86.q/arch/x86/mm/fault.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/fault.c
+++ linux-x86.q/arch/x86/mm/fault.c
@@ -104,7 +104,8 @@ static int is_prefetch(struct pt_regs *r
unsigned char *max_instr;
#ifdef CONFIG_X86_32
- if (!(__supported_pte_mask & _PAGE_NX))
+ /* Catch an obscure case of prefetch inside an NX page: */
+ if ((__supported_pte_mask & _PAGE_NX) && (error_code & 16))
return 0;
#endif
* Thomas Gleixner <[email protected]> wrote:
> On Tue, 25 Mar 2008, Andrew Morton wrote:
> > > Code: 53 c0 e8 20 08 fc ff c1 e3 02 8b 14 33 89 54 24 20 8b 44 24 20 85 c0 75 10 eb 51 8b 12 89 54 24 20 8b 44 24 20 85 c0 74 43 8b 02 <0f> 18 00 90 8d 5a d8 39 6b 34 75 e4 8b 7c 24 0c 39 7b 30 75 db
>
> It faults in a prefetch.
>
> > Markus reported what looks to be the same thing here:
> > http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
>
> Same here. And both are AMD X2 early stepping machines.
>
> > I guess you've confirmed that this wasn't a mystery
> > once-off-on-that-machine.
> >
> > I can't think what we did to cause this.
>
> I had a lengthy bug decoding session with Ingo and we found the root
> cause:
>
> A dropped workaround for the prefetch bug in early X2s and
> Opterons. Patch below.
can also be tested by picking up x86.git/latest, which has this patch
included:
http://people.redhat.com/mingo/x86.git/README
Ingo
Thomas Gleixner schrieb:
> On Tue, 25 Mar 2008, Andrew Morton wrote:
>> http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
>
> Same here. And both are AMD X2 early stepping machines.
> A dropped workaround for the prefetch bug in early X2s and
> Opterons. Patch below.
The patch cures it. Tested with rc5-git5, and it was 100%
reproducible here.
Markus
On Thu, 27 Mar 2008, Markus Rehbach wrote:
> Thomas Gleixner schrieb:
> > On Tue, 25 Mar 2008, Andrew Morton wrote:
>
> >> http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
> >
> > Same here. And both are AMD X2 early stepping machines.
>
> > A dropped workaround for the prefetch bug in early X2s and
> > Opterons. Patch below.
>
> The patch cures it. Tested with rc5-git5, and it was 100%
> reproducible here.
Thanks for testing. Fix is queued for Linus.
Thanks,
tglx
On 2008.03.27 16:20:53 +0100, Thomas Gleixner wrote:
> On Tue, 25 Mar 2008, Andrew Morton wrote:
> > > Code: 53 c0 e8 20 08 fc ff c1 e3 02 8b 14 33 89 54 24 20 8b 44 24 20 85 c0 75 10 eb 51 8b 12 89 54 24 20 8b 44 24 20 85 c0 74 43 8b 02 <0f> 18 00 90 8d 5a d8 39 6b 34 75 e4 8b 7c 24 0c 39 7b 30 75 db
>
> It faults in a prefetch.
>
> > Markus reported what looks to be the same thing here:
> > http://lkml.org/lkml/2008/3/21/202 and it's already in the regresison list.
>
> Same here. And both are AMD X2 early stepping machines.
>
> > I guess you've confirmed that this wasn't a mystery
> > once-off-on-that-machine.
> >
> > I can't think what we did to cause this.
>
> I had a lengthy bug decoding session with Ingo and we found the root
> cause:
>
> A dropped workaround for the prefetch bug in early X2s and
> Opterons. Patch below.
>
> Thanks,
>
> tglx
>
> --------------->
> Subject: x86: fix prefetch workaround
> From: Ingo Molnar <[email protected]>
> Date: Thu Mar 27 15:58:28 CET 2008
>
> some early Athlon XP's and Opterons generate bogus faults on prefetch
^^
Umh, XP? Didn't you say X2 above? And looking at the patch, X2 seems
more plausible as well, I don't think that the XP supported the NX bit,
did it?
Bj?rn
> instructions. The workaround for this regressed over .24 - reinstate it.
>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
>
> ---
> arch/x86/mm/fault.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Index: linux-x86.q/arch/x86/mm/fault.c
> ===================================================================
> --- linux-x86.q.orig/arch/x86/mm/fault.c
> +++ linux-x86.q/arch/x86/mm/fault.c
> @@ -104,7 +104,8 @@ static int is_prefetch(struct pt_regs *r
> unsigned char *max_instr;
>
> #ifdef CONFIG_X86_32
> - if (!(__supported_pte_mask & _PAGE_NX))
> + /* Catch an obscure case of prefetch inside an NX page: */
> + if ((__supported_pte_mask & _PAGE_NX) && (error_code & 16))
> return 0;
> #endif
>
>
>
On Thu, 27 Mar 2008, Thomas Gleixner wrote:
> I had a lengthy bug decoding session with Ingo and we found the root
> cause:
> A dropped workaround for the prefetch bug in early X2s and
> Opterons. Patch below.
Although I reported it, I could not repoduce the bug. Anyway, I've applied
your patch to -rc7 and no BUG so far :)
Thanks!
Christian.
--
BOFH excuse #385:
Dyslexics retyping hosts file on servers
On Fri, 28 Mar 2008, Björn Steinbrink wrote:
>> Subject: x86: fix prefetch workaround
>> From: Ingo Molnar <[email protected]>
>> Date: Thu Mar 27 15:58:28 CET 2008
>>
>> some early Athlon XP's and Opterons generate bogus faults on prefetch
> ^^
>
> Umh, XP? Didn't you say X2 above? And looking at the patch, X2 seems
> more plausible as well, I don't think that the XP supported the NX bit,
> did it?
Hm, would be a shame because I have an XP 2600+. /proc/cpuinfo tells me:
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr mca cmov pat pse36
mmx fxsr sse syscall mmxext 3dnowext 3dnow ts
...no NX in there. I wonder why this (already applied) patch should do
anything on my box at all.
Thanks,
C.
--
BOFH excuse #216:
What office are you in? Oh, that one. Did you know that your building was built over the universities first nuclear research site? And wow, aren't you the lucky one, your office is right over where the core is buried!
* Christian Kujau <[email protected]> wrote:
> On Thu, 27 Mar 2008, Thomas Gleixner wrote:
>> I had a lengthy bug decoding session with Ingo and we found the root
>> cause:
>> A dropped workaround for the prefetch bug in early X2s and
>> Opterons. Patch below.
>
> Although I reported it, I could not repoduce the bug. Anyway, I've
> applied your patch to -rc7 and no BUG so far :)
yeah, the condition would normally be very sporadic and it can easily
depend on a specific layout of your kernel image, etc.
the (updated) fix is in Linus' latest git tree as well, and in
x86.git/latest.
Ingo