2004-01-23 06:21:10

by Glenn Johnson

[permalink] [raw]
Subject: 2.6.2-rc1-mm1 oops with X

I am getting the oops pasted below. It only happens when the X server
is restarting. I do not see it all of the time but frequently enough
that I can call it reproducible. I started seeing it with 2.6.1-mm4 and
can trigger it fairly regularly with 2.6.2-rc1-mm1. However, I do not
see it with 2.6.1-mm3 nor with 2.6.2-rc1. Hopefully, that narrows the
field as to what may be the culprit.

Some relevant hardware and kernel configuration information:

- 2.4GHz P4c with HyperThreading (I do have CONFIG_SMT set)
- Radeon 9100 graphics card (DRI enabled)

---begin oops---
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c02a2c56
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
CPU: 0
EIP: 0060:[<c02a2c56>] Not tainted VLI
EFLAGS: 00010286
EIP is at vt_ioctl+0x1e/0x1f4b
eax: 00000000 ebx: daf10000 ecx: 00000007 edx: 00000007
esi: 00005607 edi: daf10000 ebp: daf48080 esp: dc6d7ea0
ds: 007b es: 007b ss: 0068
Process X (pid: 6100, threadinfo=dc6d6000 task=dbe56d00)
Stack: 00000006 df8e6005 00000003 df8e6005 dc6d7f70 dffd8f00 00000000 dfda1480
df963b00 c01746e1 dfda1480 c0526e80 00000000 dfbaf780 df963b00 c016b982
00021480 00000000 00000000 00000001 df4f9200 c01668e6 c0579800 daf48280
Call Trace:
[<c01746e1>] dput+0x22/0x2b1
[<c016b982>] link_path_walk+0x690/0x9ea
[<c01668e6>] cdev_put+0x17/0x69
[<c0166503>] chrdev_open+0x160/0x291
[<c011dfaa>] recalc_task_prio+0x90/0x1aa
[<c012082d>] schedule+0x39b/0x6d7
[<c02a2c38>] vt_ioctl+0x0/0x1f4b
[<c029d8dd>] tty_ioctl+0x472/0x570
[<c016fb3a>] sys_ioctl+0x119/0x2a3
[<c041b3da>] sysenter_past_esp+0x43/0x65

Code: ff e8 5f 77 e6 ff e9 6d ff ff ff 90 90 55 57 56 53 81 ec b8 00 00 00 8b bc 24 cc 00 00 00 8b b4 24 d4 00 00 00 8b 87 78 09 00 00 <8b> 18 8b 04 9d 00 9e 57 c0 89 1c 24 89 44 24 50 e8 4e 6a 00 00
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c02a2c56
*pde = 00000000
Oops: 0000 [#2]
PREEMPT SMP
CPU: 0
EIP: 0060:[<c02a2c56>] Not tainted VLI
EFLAGS: 00010286
EIP is at vt_ioctl+0x1e/0x1f4b
eax: 00000000 ebx: d1ec7000 ecx: 00000008 edx: 00000008
esi: 00005607 edi: d1ec7000 ebp: daf48180 esp: d0927ea0
ds: 007b es: 007b ss: 0068
Process X (pid: 6117, threadinfo=d0926000 task=dbe56080)
Stack: 00000007 df9df005 00000003 df9df005 d0927f70 dffd8f00 00000000 dfda1480
df963b00 c01746e1 dfda1480 c0526e80 00000000 dfbaf780 df963b00 c016b982
00021480 00000000 00000000 00000001 dee34a00 c01668e6 c0579800 daeab180
Call Trace:
[<c01746e1>] dput+0x22/0x2b1
[<c016b982>] link_path_walk+0x690/0x9ea
[<c01668e6>] cdev_put+0x17/0x69
[<c0166503>] chrdev_open+0x160/0x291
[<c011dfaa>] recalc_task_prio+0x90/0x1aa
[<c012082d>] schedule+0x39b/0x6d7
[<c02a2c38>] vt_ioctl+0x0/0x1f4b
[<c029d8dd>] tty_ioctl+0x472/0x570
[<c016fb3a>] sys_ioctl+0x119/0x2a3
[<c041b3da>] sysenter_past_esp+0x43/0x65

Code: ff e8 5f 77 e6 ff e9 6d ff ff ff 90 90 55 57 56 53 81 ec b8 00 00 00 8b bc 24 cc 00 00 00 8b b4 24 d4 00 00 00 8b 87 78 09 00 00 <8b> 18 8b 04 9d 00 9e 57 c0 89 1c 24 89 44 24 50 e8 4e 6a 00 00
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c02a2c56
*pde = 00000000
Oops: 0000 [#3]
PREEMPT SMP
CPU: 0
EIP: 0060:[<c02a2c56>] Not tainted VLI
EFLAGS: 00010286
EIP is at vt_ioctl+0x1e/0x1f4b
eax: 00000000 ebx: d1ec6000 ecx: 00000009 edx: 00000009
esi: 00005607 edi: d1ec6000 ebp: ce675380 esp: cd0f9ea0
ds: 007b es: 007b ss: 0068
Process X (pid: 6134, threadinfo=cd0f8000 task=df4c8d00)
Stack: 00000008 c1738005 00000003 c1738005 cd0f9f70 dffd8f00 00000000 dfda1480
df963b00 c01746e1 dfda1480 c0526e80 00000000 dfbaf780 df963b00 c016b982
00021480 00000000 00000000 00000001 dee34800 c01668e6 c0579800 cd177e80
Call Trace:
[<c01746e1>] dput+0x22/0x2b1
[<c016b982>] link_path_walk+0x690/0x9ea
[<c01668e6>] cdev_put+0x17/0x69
[<c0166503>] chrdev_open+0x160/0x291
[<c011dfaa>] recalc_task_prio+0x90/0x1aa
[<c012082d>] schedule+0x39b/0x6d7
[<c02a2c38>] vt_ioctl+0x0/0x1f4b
[<c029d8dd>] tty_ioctl+0x472/0x570
[<c016fb3a>] sys_ioctl+0x119/0x2a3
[<c041b3da>] sysenter_past_esp+0x43/0x65

Code: ff e8 5f 77 e6 ff e9 6d ff ff ff 90 90 55 57 56 53 81 ec b8 00 00 00 8b bc 24 cc 00 00 00 8b b4 24 d4 00 00 00 8b 87 78 09 00 00 <8b> 18 8b 04 9d 00 9e 57 c0 89 1c 24 89 44 24 50 e8 4e 6a 00 00
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c02a2c56
*pde = 00000000
Oops: 0000 [#4]
PREEMPT SMP
CPU: 1
EIP: 0060:[<c02a2c56>] Not tainted VLI
EFLAGS: 00010286
EIP is at vt_ioctl+0x1e/0x1f4b
eax: 00000000 ebx: c7f3a000 ecx: 0000000a edx: 0000000a
esi: 00005607 edi: c7f3a000 ebp: d089a980 esp: cd0f9ea0
ds: 007b es: 007b ss: 0068
Process X (pid: 6151, threadinfo=cd0f8000 task=df4c8d00)
Stack: 00000009 df069005 00000003 df069005 cd0f9f70 dffd8f00 00000000 dfda1480
df963b00 c01746e1 dfda1480 c0526e80 00000000 dfbaf780 df963b00 c016b982
00021480 00000000 00000000 00000001 df4f9000 c01668e6 c0579800 daeab180
Call Trace:
[<c01746e1>] dput+0x22/0x2b1
[<c016b982>] link_path_walk+0x690/0x9ea
[<c01668e6>] cdev_put+0x17/0x69
[<c0166503>] chrdev_open+0x160/0x291
[<c011dfaa>] recalc_task_prio+0x90/0x1aa
[<c012082d>] schedule+0x39b/0x6d7
[<c02a2c38>] vt_ioctl+0x0/0x1f4b
[<c029d8dd>] tty_ioctl+0x472/0x570
[<c016fb3a>] sys_ioctl+0x119/0x2a3
[<c041b3da>] sysenter_past_esp+0x43/0x65

Code: ff e8 5f 77 e6 ff e9 6d ff ff ff 90 90 55 57 56 53 81 ec b8 00 00 00 8b bc 24 cc 00 00 00 8b b4 24 d4 00 00 00 8b 87 78 09 00 00 <8b> 18 8b 04 9d 00 9e 57 c0 89 1c 24 89 44 24 50 e8 4e 6a 00 00
---end oops---

Regards,

Glenn


2004-01-23 07:17:26

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

[email protected] (Glenn Johnson) wrote:
>
> I am getting the oops pasted below. It only happens when the X server
> is restarting. I do not see it all of the time but frequently enough
> that I can call it reproducible. I started seeing it with 2.6.1-mm4 and
> can trigger it fairly regularly with 2.6.2-rc1-mm1. However, I do not
> see it with 2.6.1-mm3 nor with 2.6.2-rc1. Hopefully, that narrows the
> field as to what may be the culprit.
>
> Some relevant hardware and kernel configuration information:
>
> - 2.4GHz P4c with HyperThreading (I do have CONFIG_SMT set)
> - Radeon 9100 graphics card (DRI enabled)
>
> ---begin oops---
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> c02a2c56
> *pde = 00000000
> Oops: 0000 [#1]
> PREEMPT SMP
> CPU: 0
> EIP: 0060:[<c02a2c56>] Not tainted VLI
> EFLAGS: 00010286
> EIP is at vt_ioctl+0x1e/0x1f4b
> eax: 00000000 ebx: daf10000 ecx: 00000007 edx: 00000007
> esi: 00005607 edi: daf10000 ebp: daf48080 esp: dc6d7ea0
> ds: 007b es: 007b ss: 0068
> Process X (pid: 6100, threadinfo=dc6d6000 task=dbe56d00)
> Stack: 00000006 df8e6005 00000003 df8e6005 dc6d7f70 dffd8f00 00000000 dfda1480
> df963b00 c01746e1 dfda1480 c0526e80 00000000 dfbaf780 df963b00 c016b982
> 00021480 00000000 00000000 00000001 df4f9200 c01668e6 c0579800 daf48280
> Call Trace:
> [<c01746e1>] dput+0x22/0x2b1
> [<c016b982>] link_path_walk+0x690/0x9ea
> [<c01668e6>] cdev_put+0x17/0x69
> [<c0166503>] chrdev_open+0x160/0x291
> [<c011dfaa>] recalc_task_prio+0x90/0x1aa
> [<c012082d>] schedule+0x39b/0x6d7
> [<c02a2c38>] vt_ioctl+0x0/0x1f4b
> [<c029d8dd>] tty_ioctl+0x472/0x570
> [<c016fb3a>] sys_ioctl+0x119/0x2a3
> [<c041b3da>] sysenter_past_esp+0x43/0x65

Sorry, this is the mysterious tty close race. We end up setting
tty->driver_data to zero somewhere, somehow, when someone else is still
using the tty.

If you revert

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.1-rc1/2.6.1-rc1-mm1/broken-out/sysfs-add-vc-class.patch

does it go away?

It is maddeningly hard to debug even when you can reproduce it, which I can
no longer do.

2004-01-23 09:05:05

by David Woodhouse

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Thu, 2004-01-22 at 23:18 -0800, Andrew Morton wrote:
> It is maddeningly hard to debug even when you can reproduce it, which I can
> no longer do.

This is what GDB watchpoints were invented for, surely?

--
dwmw2


2004-01-23 09:11:23

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

David Woodhouse <[email protected]> wrote:
>
> On Thu, 2004-01-22 at 23:18 -0800, Andrew Morton wrote:
> > It is maddeningly hard to debug even when you can reproduce it, which I can
> > no longer do.
>
> This is what GDB watchpoints were invented for, surely?
>

I suppose that might help. But for me the bug triggered towards the end of
initscripts (it moves around) after we've been through that code path a
zillion times. It probably needs to be solved by inspection. If one can
get it to happen reliably.

I found that sysfs-class-10-vc.patch caused it to happen and
use-kthread-primitives.patch made it go away again. Neither patch has
anything to do with tty refcounting and locking.

2004-01-23 09:40:50

by David Woodhouse

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Fri, 2004-01-23 at 01:11 -0800, Andrew Morton wrote:
> > This is what GDB watchpoints were invented for, surely?
>
> I suppose that might help. But for me the bug triggered towards the end of
> initscripts (it moves around) after we've been through that code path a
> zillion times. It probably needs to be solved by inspection. If one can
> get it to happen reliably.

Can't you script it just to automatically show a backtrace and continue,
then peruse the logs later?

--
dwmw2


2004-01-23 09:44:08

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

David Woodhouse <[email protected]> wrote:
>
> On Fri, 2004-01-23 at 01:11 -0800, Andrew Morton wrote:
> > > This is what GDB watchpoints were invented for, surely?
> >
> > I suppose that might help. But for me the bug triggered towards the end of
> > initscripts (it moves around) after we've been through that code path a
> > zillion times. It probably needs to be solved by inspection. If one can
> > get it to happen reliably.
>
> Can't you script it just to automatically show a backtrace and continue,
> then peruse the logs later?

I can't make it happen any more! I suspect Heisenberg would get in the way
anyway.

2004-01-23 16:42:37

by Glenn Johnson

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Fri, Jan 23, 2004 at 01:44:53AM -0800, Andrew Morton wrote:

> David Woodhouse <[email protected]> wrote:
>
> > Can't you script it just to automatically show a backtrace and
> > continue, then peruse the logs later?
>
> I can't make it happen any more! I suspect Heisenberg would get in
> the way anyway.

--
Glenn Johnson
U.S. Department of Agriculture
Agricultural Research Service
Southern Regional Research Center
New Orleans, LA

2004-01-23 17:05:11

by Glenn Johnson

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

I have no idea what happened to the body of the message in my previous
reply. Let me try again.

On Fri, Jan 23, 2004 at 01:44:53AM -0800, Andrew Morton wrote:

> David Woodhouse <[email protected]> wrote:
>
> > Can't you script it just to automatically show a backtrace and
> > continue, then peruse the logs later?
>
> I can't make it happen any more! I suspect Heisenberg would get in
> the way anyway.

I can make it happen fairly regularly. Let me know if there is
something I can do to provide more useful output.

--
Glenn Johnson

2004-01-23 21:56:43

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Fri, Jan 23, 2004 at 11:05:04AM -0600, Glenn Johnson wrote:
> On Fri, Jan 23, 2004 at 01:44:53AM -0800, Andrew Morton wrote:
> > I can't make it happen any more! I suspect Heisenberg would get in
> > the way anyway.
>
> I can make it happen fairly regularly. Let me know if there is
> something I can do to provide more useful output.

How, starting X?

How far does it get through, and at what point in X startup does the kernel
oops?

2004-01-24 04:36:28

by Glenn Johnson

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Fri, Jan 23, 2004 at 01:56:24PM -0800, Mike Fedyk wrote:

> On Fri, Jan 23, 2004 at 11:05:04AM -0600, Glenn Johnson wrote:
>
> > On Fri, Jan 23, 2004 at 01:44:53AM -0800, Andrew Morton wrote:
> >
> > > I can't make it happen any more! I suspect Heisenberg would get
> > > in the way anyway.
> >
> > I can make it happen fairly regularly. Let me know if there is
> > something I can do to provide more useful output.
>
> How, starting X?

Yes.

> How far does it get through, and at what point in X startup does the
> kernel oops?

I would say fairly early on but I do not know how to quantify that.

--
Glenn Johnson

2004-01-27 04:05:15

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Fri, Jan 23, 2004 at 10:33:27PM -0600, Glenn Johnson wrote:
> On Fri, Jan 23, 2004 at 01:56:24PM -0800, Mike Fedyk wrote:
> > > I can make it happen fairly regularly. Let me know if there is
> > > something I can do to provide more useful output.
> >
> > How, starting X?
>
> Yes.
>
> > How far does it get through, and at what point in X startup does the
> > kernel oops?
>
> I would say fairly early on but I do not know how to quantify that.

Post your log file, and show where in the log the oops occours...

2004-01-27 06:32:22

by Glenn Johnson

[permalink] [raw]
Subject: Re: 2.6.2-rc1-mm1 oops with X

On Mon, Jan 26, 2004 at 08:04:46PM -0800, Mike Fedyk wrote:

> On Fri, Jan 23, 2004 at 10:33:27PM -0600, Glenn Johnson wrote:
>
> > On Fri, Jan 23, 2004 at 01:56:24PM -0800, Mike Fedyk wrote:
> >
> > > How far does it get through, and at what point in X startup does
> > > the kernel oops?
> >
> > I would say fairly early on but I do not know how to quantify that.
>
> Post your log file, and show where in the log the oops occours...

Okay, below is the XFree log file. It is truncated at the point where
the oops occurs. I sent an e-mail to Andrew confirming that backing out
sysfs-class-10-vc.patch fixes the problem for me.

---begin X log for kernel oops---

XFree86 Version 4.3.0
Release Date: 27 February 2003
X Protocol Version 11, Revision 0, Release 6.6
Build Operating System: Linux 2.6.1-mm5 i686 [ELF]
Build Date: 20 January 2004
Before reporting problems, check http://www.XFree86.Org/
to make sure that you have the latest version.
Module Loader present
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/XFree86.0.log", Time: Mon Jan 26 22:33:10 2004
(==) Using config file: "/etc/X11/XF86Config"
(==) ServerLayout "XFree86 Configured"
(**) |-->Screen "Screen0" (0)
(**) | |-->Monitor "Monitor0"
(**) | |-->Device "Card0"
(**) |-->Input Device "Mouse0"
(**) |-->Input Device "Keyboard0"
(==) Keyboard: CustomKeycode disabled
(**) FontPath set to "unix/:-1"
(**) RgbPath set to "/usr/X11R6/lib/X11/rgb"
(**) ModulePath set to "/usr/X11R6/lib/modules"
(++) using VT number 8

---end X log for kernel oops

The VT number it tries to use jumps around. Sometimes it is VT10, VT11,
etc. It should be using VT7 however.

Cheers,

Glenn