2004-03-24 21:14:41

by Andrew Reiter

[permalink] [raw]
Subject: NULL pointer in proc_pid_stat -- oops.

Hi,

We've recently been seeing some crashes in the system (2.6.3 SMP,
HIGHMEM64G) that have the basic error message of "kernel: Unable to
handle kernel NULL pointer." I have seen this on a couple of different
kernel configurations, but all crashes occurred on SMP machines with SMP
configured. The process is 'ps' and the EIP is at proc_pid_stat +
141decimal (0x8d). After disassembling array.o (from fs/proc/array.c),
I see the following (full disassem out put attached):

0x000004d4 <proc_pid_stat+124>: test %ecx,%ecx
0x000004d6 <proc_pid_stat+126>: je 0x510 <proc_pid_stat+184>
0x000004d8 <proc_pid_stat+128>: mov 0x98(%ecx),%eax
0x000004de <proc_pid_stat+134>: mov %eax,0x20(%esp,1)
0x000004e2 <proc_pid_stat+138>: mov 0x4(%ecx),%edx
0x000004e5 <proc_pid_stat+141>: movswl 0x64(%edx),%eax
0x000004e9 <proc_pid_stat+145>: movswl 0x66(%edx),%edx
0x000004ed <proc_pid_stat+149>: shl $0x14,%eax
0x000004f0 <proc_pid_stat+152>: or %edx,%eax
0x000004f2 <proc_pid_stat+154>: add 0x8(%ecx),%eax

And from the oops trace output (that is attached), we can see that %edx
is 0x0; so we can easily see here why we're crashing at least. After
examining the C source, I see that we're dying in the call to
task_name() (inline) from proc_pid_stat().

This has occurred a couple of times, mostly after a lot of threads
running and various uptimes of the machine (1 week for first crash we
saw), but I am unable to create a test that reliably causes this crash.
Please let me know how I can help out with figuring out the issue. Like
I mentioned above, attached is the oops dump (from var log messages) and
the disassembly output of proc_pid_stat (from gdb fs/proc/array.o,
disassem proc_pid_stat). If possible, I would like to be CC'd to this
as I am not on the mailing list.

Thanks, ahead of time, for any help.

Cheers,
Andrew


Attachments:
proc_pid_stat-disassem.txt (13.57 kB)
proc_pid_stat-disassem.txt
dmesg_oops.txt (1.72 kB)
dmesg_oops.txt
Download all attachments

2004-03-28 17:00:21

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: NULL pointer in proc_pid_stat -- oops.

"Andrew Reiter" <[email protected]> writes:

> 0x000004d4 <proc_pid_stat+124>: test %ecx,%ecx
> 0x000004d6 <proc_pid_stat+126>: je 0x510 <proc_pid_stat+184>
> 0x000004d8 <proc_pid_stat+128>: mov 0x98(%ecx),%eax
> 0x000004de <proc_pid_stat+134>: mov %eax,0x20(%esp,1)
> 0x000004e2 <proc_pid_stat+138>: mov 0x4(%ecx),%edx
> 0x000004e5 <proc_pid_stat+141>: movswl 0x64(%edx),%eax
> 0x000004e9 <proc_pid_stat+145>: movswl 0x66(%edx),%edx
> 0x000004ed <proc_pid_stat+149>: shl $0x14,%eax
> 0x000004f0 <proc_pid_stat+152>: or %edx,%eax
> 0x000004f2 <proc_pid_stat+154>: add 0x8(%ecx),%eax
>
> And from the oops trace output (that is attached), we can see that %edx
> is 0x0; so we can easily see here why we're crashing at least. After
> examining the C source, I see that we're dying in the call to
> task_name() (inline) from proc_pid_stat().

Looks like this problem is same with BSD acct Oops.

if (task->tty) {
tty_pgrp = task->tty->pgrp;
tty_nr = new_encode_dev(tty_devnum(task->tty));
}

Some place doesn't take the any lock for ->tty. I think we need to
take the lock for ->tty.
--
OGAWA Hirofumi <[email protected]>

2004-03-30 17:10:00

by Andrew Reiter

[permalink] [raw]
Subject: RE: NULL pointer in proc_pid_stat -- oops.

Oops, I was getting confused since proc_pid_status() is directly above
proc_pid_stat() in fs/proc/array.c, so I misinterpreted some of the
asm<-->C translation and was stating we were dying in task_name().
Sorry for the confusion. Will apply Ricky's patch for now as a work
around as it seems fine to me for the moment.

Thanks for the help...

Cheers,
Andrew

-----Original Message-----
From: OGAWA Hirofumi [mailto:[email protected]]
Sent: Sunday, March 28, 2004 9:00 AM
To: Andrew Reiter
Cc: [email protected]
Subject: Re: NULL pointer in proc_pid_stat -- oops.


"Andrew Reiter" <[email protected]> writes:

> 0x000004d4 <proc_pid_stat+124>: test %ecx,%ecx
> 0x000004d6 <proc_pid_stat+126>: je 0x510 <proc_pid_stat+184>
> 0x000004d8 <proc_pid_stat+128>: mov 0x98(%ecx),%eax
> 0x000004de <proc_pid_stat+134>: mov %eax,0x20(%esp,1)
> 0x000004e2 <proc_pid_stat+138>: mov 0x4(%ecx),%edx
> 0x000004e5 <proc_pid_stat+141>: movswl 0x64(%edx),%eax
> 0x000004e9 <proc_pid_stat+145>: movswl 0x66(%edx),%edx
> 0x000004ed <proc_pid_stat+149>: shl $0x14,%eax
> 0x000004f0 <proc_pid_stat+152>: or %edx,%eax
> 0x000004f2 <proc_pid_stat+154>: add 0x8(%ecx),%eax
>
> And from the oops trace output (that is attached), we can see that
%edx
> is 0x0; so we can easily see here why we're crashing at least. After
> examining the C source, I see that we're dying in the call to
> task_name() (inline) from proc_pid_stat().

Looks like this problem is same with BSD acct Oops.

if (task->tty) {
tty_pgrp = task->tty->pgrp;
tty_nr = new_encode_dev(tty_devnum(task->tty));
}

Some place doesn't take the any lock for ->tty. I think we need to
take the lock for ->tty.
--
OGAWA Hirofumi <[email protected]>