Hi everyone,
Again I cross-post, and hope that people will accept my apologies. I do
not know if this is a vanilla kernel problem, or something that has to
do with XFS.
I have been noticing problems on two of my boxes here with some
processes somehow getting stuck in "D" state, which cannot be killed
even by a SIGKILL by root, and stay on running forever. I have noticed
the problem, so far, on 2.4.18-xfs and 2.4.19-rc2-xfs kernels.
The 2.4.18-xfs kernel did not have any binary-only modules loaded (like
NVdriver) and aside from the modules from the kernel only had the i2c
and lm-sensors modules. It was compiled using gcc 2.95.4 and was running
on a single Pentium III.
The 2.4.19-rc2-xfs kernel had the NVdriver binary-only module loaded
(coincidentally the same unit I reported kernel BUG reports in
page_alloc.c:91 about earlier), plus kernel modules and the i2c and
lm-sensors modules. It was compiled using gcc 3.1.1 (Debian prerelease)
and was running on a single AMD Duron.
I have not found a pattern to the occurence of these stuck processes.
They don't happen at given intervales, or with only a particular type of
process, or on certain workloads or tasks.
They've happened on everything from X11 (consequently freezing the box),
to the apt-method of dpkg (consequently preventing me from doing any
apt-get installations), to the dhcp3 daemon (consequently preventing new
boxes from getting IP addresses). They've also happened on boxes with
two days uptime, or with 7 days uptime, or with 14.
So far the only "solution" I've found to this problem has been to
reboot.
There is unfortunately absolutely nothing I can find in the logs to help
explain what's going on with these processes. Because I have not figured
out how to reproduce this, I do not know which process to run through
strace or a similar tool to figure out what's going on.
I hope someone can help shed light on this. Thank you very much. I have
upgraded the 2.4.18-xfs box to 2.4.19-rc3-xfs, and will upgrade the
2.4.19-rc2-xfs box to 2.4.19-rc3-xfs as well, and will see if the
problem goes away. If anyone else is experiencing similar "processes
stuck in state 'D'" problems, please do chime in and add your
experiences about the problem.
--> Jijo
--
Federico Sevilla III : <http://jijo.free.net.ph/>
Network Administrator : The Leather Collection, Inc.
GnuPG Key ID : 0x93B746BE
On Sun, Jul 28, 2002 at 06:22:46PM +0800, Federico Sevilla III wrote:
> I have been noticing problems on two of my boxes here with some
> processes somehow getting stuck in "D" state, which cannot be killed
> even by a SIGKILL by root, and stay on running forever. I have noticed
> the problem, so far, on 2.4.18-xfs and 2.4.19-rc2-xfs kernels.
I do not know how small a tidbit this will be, but together with these
processes stuck in state "D", there is a continuing rise in the load
averages of the system as reported by various interfaces
(top/uptime/w/phpSysInfo). On the system running 2.4.18-xfs this rose to
up to 25 before I eventually rebooted the box.
Another bit: on the 2.4.18-xfs box, the number of processes getting
stuck in state "D" kept growing. Various `ps ax` and `sync` processes in
particular, would get stuck and stall as I would issue them. It may be
interesting to note that with 2.4.19-rc2-xfs, with which I had my "latest
encounter" with this problem, no `ps ax` or `sync` processes got stuck.
Only the couple of apt-method processes that got stuck (and any new ones
I would launch in an attempt to download a package from the Internet for
installation) were there.
Both units have already been upgraded to 2.4.19-rc3-xfs. I will send
feedback if/when I run into any processes stuck in state "D" for
abnormally long periods of time.
--> Jijo
--
Federico Sevilla III : <http://jijo.free.net.ph/>
Network Administrator : The Leather Collection, Inc.
GnuPG Key ID : 0x93B746BE
On 28 July 2002 09:35, Federico Sevilla III wrote:
> On Sun, Jul 28, 2002 at 06:22:46PM +0800, Federico Sevilla III wrote:
> > I have been noticing problems on two of my boxes here with some
> > processes somehow getting stuck in "D" state, which cannot be killed
> > even by a SIGKILL by root, and stay on running forever. I have noticed
> > the problem, so far, on 2.4.18-xfs and 2.4.19-rc2-xfs kernels.
>
> I do not know how small a tidbit this will be, but together with these
> processes stuck in state "D", there is a continuing rise in the load
> averages of the system as reported by various interfaces
> (top/uptime/w/phpSysInfo). On the system running 2.4.18-xfs this rose to
> up to 25 before I eventually rebooted the box.
>
> Another bit: on the 2.4.18-xfs box, the number of processes getting
> stuck in state "D" kept growing. Various `ps ax` and `sync` processes in
> particular, would get stuck and stall as I would issue them. It may be
> interesting to note that with 2.4.19-rc2-xfs, with which I had my "latest
> encounter" with this problem, no `ps ax` or `sync` processes got stuck.
> Only the couple of apt-method processes that got stuck (and any new ones
> I would launch in an attempt to download a package from the Internet for
> installation) were there.
D state processes are sitting in kernel code waiting for something
to happen. It is ok to sit in D state for milliseconds, it is acceptable
to sit for seconds. If those processes are stuck forever, it's a bug.
Capture Alt-SysRq-T output and ksymoops relevant part
Yes it means you should have ksymoops installed and tested,
which is easy to get wrong. I've done that too often.
--
vda
On Sun, Jul 28, 2002 at 04:09:33PM -0200, Denis Vlasenko wrote:
> D state processes are sitting in kernel code waiting for something to
> happen. It is ok to sit in D state for milliseconds, it is acceptable
> to sit for seconds. If those processes are stuck forever, it's a bug.
The processes I refer to get stuck in D state forever. I have other
processes that are in D state legitimately, and for reasonable amounts
of time depending on the task, but it is only these random processes
that occur once in awhile that stay there forever and drive the load
levels way beyond their normal levels.
> Capture Alt-SysRq-T output and ksymoops relevant part Yes it means you
> should have ksymoops installed and tested, which is easy to get wrong.
> I've done that too often.
It also requires access the console, right? Or is it possible to get a
similar task information dump when logged on remotely via SSH? I have
ksymoops working and tested on both of these systems, and console access
to one of them. It would be great to know what to do to help the kernel
hackers figure out what's going on as soon as the next stuck process
shows up.
--> Jijo
--
Federico Sevilla III : <http://jijo.free.net.ph/>
Network Administrator : The Leather Collection, Inc.
GnuPG Key ID : 0x93B746BE
On 29 July 2002 05:22, Federico Sevilla III wrote:
> On Sun, Jul 28, 2002 at 04:09:33PM -0200, Denis Vlasenko wrote:
> > D state processes are sitting in kernel code waiting for something to
> > happen. It is ok to sit in D state for milliseconds, it is acceptable
> > to sit for seconds. If those processes are stuck forever, it's a bug.
>
> The processes I refer to get stuck in D state forever. I have other
> processes that are in D state legitimately, and for reasonable amounts
> of time depending on the task, but it is only these random processes
> that occur once in awhile that stay there forever and drive the load
> levels way beyond their normal levels.
>
> > Capture Alt-SysRq-T output and ksymoops relevant part Yes it means you
> > should have ksymoops installed and tested, which is easy to get wrong.
> > I've done that too often.
>
> It also requires access the console, right? Or is it possible to get a
> similar task information dump when logged on remotely via SSH?
It is logged by syslog. /var/log/messages if your conf is standard.
--
vda
On Mon, 29 Jul 2002, Denis Vlasenko wrote:
| On 29 July 2002 05:22, Federico Sevilla III wrote:
| > On Sun, Jul 28, 2002 at 04:09:33PM -0200, Denis Vlasenko wrote:
| > > D state processes are sitting in kernel code waiting for something to
| > > happen. It is ok to sit in D state for milliseconds, it is acceptable
| > > to sit for seconds. If those processes are stuck forever, it's a bug.
| >
| > The processes I refer to get stuck in D state forever. I have other
| > processes that are in D state legitimately, and for reasonable amounts
| > of time depending on the task, but it is only these random processes
| > that occur once in awhile that stay there forever and drive the load
| > levels way beyond their normal levels.
| >
| > > Capture Alt-SysRq-T output and ksymoops relevant part Yes it means you
| > > should have ksymoops installed and tested, which is easy to get wrong.
| > > I've done that too often.
| >
| > It also requires access the console, right? Or is it possible to get a
| > similar task information dump when logged on remotely via SSH?
|
| It is logged by syslog. /var/log/messages if your conf is standard.
| --
That helps on the output side, sure, but I (mis?)understood the question
to be about the ability to do Alt-SysRq-x via ssh. Is that possible?
Not that I know of, but I could be wrong about that.
So if you really need Alt-SysRq over a network connection (or even
a serial console connection)...
A few months ago I cooked up a patch so that "echo {magickey}"
mimics SysRq via proc/sysctl. Patch against 2.4.18 is here:
http://www.osdl.org/archive/rddunlap/patches/sys-magic.dif
Usage is: echo {key} > /proc/sys/kernel/magickey
--
~Randy
On Mon, Jul 29, 2002 at 09:13:33AM -0700, Randy.Dunlap wrote:
> On Mon, 29 Jul 2002, Denis Vlasenko wrote:
> | It is logged by syslog. /var/log/messages if your conf is standard.
> That helps on the output side, sure, but I (mis?)understood the question
> to be about the ability to do Alt-SysRq-x via ssh. Is that possible?
No you didn't misunderstand my question. Alt-SysRq-x via ssh doesn't
work, and that's what I was wondering about. :)
> Not that I know of, but I could be wrong about that.
> So if you really need Alt-SysRq over a network connection (or even
> a serial console connection)...
> A few months ago I cooked up a patch so that "echo {magickey}"
> mimics SysRq via proc/sysctl. Patch against 2.4.18 is here:
> http://www.osdl.org/archive/rddunlap/patches/sys-magic.dif
> Usage is: echo {key} > /proc/sys/kernel/magickey
I'm curious: can anyone logged on do this? With the physical Alt-SysRq-x
people have to actually go into the server room, up to the server,
connect a keyboard, and do their mumbo-jumbo. With this anybody can say,
unmount all filesystems, right?
:(
But thanks, anyway. I'm thinking about whether or not I should do this
(and just restrict logins to root, or something like that).
--> Jijo
--
Federico Sevilla III : <http://jijo.free.net.ph/>
Network Administrator : The Leather Collection, Inc.
GnuPG Key ID : 0x93B746BE
On Tue, 30 Jul 2002, Federico Sevilla III wrote:
> I'm curious: can anyone logged on do this? With the physical Alt-SysRq-x
> people have to actually go into the server room, up to the server,
> connect a keyboard, and do their mumbo-jumbo. With this anybody can say,
> unmount all filesystems, right?
Magic SysRq works fine over serial console ...
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, 30 Jul 2002, Federico Sevilla III wrote:
| On Mon, Jul 29, 2002 at 09:13:33AM -0700, Randy.Dunlap wrote:
| > On Mon, 29 Jul 2002, Denis Vlasenko wrote:
| > | It is logged by syslog. /var/log/messages if your conf is standard.
| > That helps on the output side, sure, but I (mis?)understood the question
| > to be about the ability to do Alt-SysRq-x via ssh. Is that possible?
|
| No you didn't misunderstand my question. Alt-SysRq-x via ssh doesn't
| work, and that's what I was wondering about. :)
|
| > Not that I know of, but I could be wrong about that.
| > So if you really need Alt-SysRq over a network connection (or even
| > a serial console connection)...
| > A few months ago I cooked up a patch so that "echo {magickey}"
| > mimics SysRq via proc/sysctl. Patch against 2.4.18 is here:
| > http://www.osdl.org/archive/rddunlap/patches/sys-magic.dif
| > Usage is: echo {key} > /proc/sys/kernel/magickey
|
| I'm curious: can anyone logged on do this? With the physical Alt-SysRq-x
| people have to actually go into the server room, up to the server,
| connect a keyboard, and do their mumbo-jumbo. With this anybody can say,
| unmount all filesystems, right?
|
| :(
The 'magickey' /proc file is mode 0644 (read-write for root, read-only
for others), so jo_user can't write to it.
| But thanks, anyway. I'm thinking about whether or not I should do this
| (and just restrict logins to root, or something like that).
Sure.
--
~Randy