2001-03-06 20:27:09

by Vibol Hou

[permalink] [raw]
Subject: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and 2.4.0)

Hi,

This is a follow up report on a server I run which is now using 2.4.2-ac5.
It was suggested that the problem might be a NIC driver issue, but that
seems unlikely at this point.

You can find my previous posts at the following links to get a better idea
of what I am encountering:

http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.3/0470.html
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0102.3/0401.html

The problem still persists with the new 2.4.2-ac5 kernel, and I have a
feeling it has to do with the VM subsystem. The system runs Apache, MySQL,
and Sendmail. It has ~900MB RAM. The first lockup in 2.4.2-ac5 occured
right after I transferred a large and busy MySQL DB to the server. I took
down services before the big transfer, and after the DB was switched over
and sevices turned on, it began receiving the regular ~80 queries/second.
The key_buffer for MySQL is set to 256M, which is shared amongst the MySQL
threads. Everything ran fine until 5 minutes later at which time the system
started crawling again. Load was normal ~1's across the board. I was not
able to get much useful information from this failure as SSH stopped
responding before I could get commands entered. Getty on the serial console
wasn't responding (sysrq was). The system was only up for 1 day at the
time.

The second time it occured was a few hours ago, 3 days after the last system
reboot (last failure). I grabbed all the SysRQ information I could before
restarting the system. I have that info attached. It includes memory
readings and process lists before and after killing/terminating processes.
I don't really know how to interpret the information given, so I am hoping
someone can help sift through the information. There was little I could
retrieve from the SSH shell I was in when the system slowed down.

A note, the serial console does function with SysRQ, so it seems getty is
also affected by the slowdown.

I would appreciate any guidance you can provide on this issue.

Thanks,
Vibol Hou
KhmerConnection


Attachments:
dmesg.txt (13.32 kB)
sysrq.txt (60.96 kB)
Download all attachments

2001-03-07 08:49:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and 2.4.0)

On Tue, 6 Mar 2001, Vibol Hou wrote:

> Hi,
>
> This is a follow up report on a server I run which is now using 2.4.2-ac5.
> It was suggested that the problem might be a NIC driver issue, but that
> seems unlikely at this point.
>
> You can find my previous posts at the following links to get a better idea
> of what I am encountering:
>
> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.3/0470.html
> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0102.3/0401.html
>
> The problem still persists with the new 2.4.2-ac5 kernel, and I have a
> feeling it has to do with the VM subsystem. The system runs Apache, MySQL,
> and Sendmail. It has ~900MB RAM. The first lockup in 2.4.2-ac5 occured

Hi,

This portion of your log...

httpd S 7FFFFFFF 3700 15058 1684 (NOTLB) 15061 15055
Call Trace: [<IN MWaI tcWahtdochg do[] <ce01ted9c9t2edd> L] OC[K<UPc0 o1ndb
1CPadU0>], re[<gic0st1e1r3bsf:
8>] C[<c<c001f1f2f2afa66>>]] [ < c0 01c013f05:[5<>]c 010[<7ec0411>c4]3
8d>EF] LA
GS: 0 00 00 0 8[7
<c0 eac0x:2 8ac042208a[>]<c 0 13 e32bxd3: >]ce 8[4a<c00010 26 e6ec0>x:]
f7[<94ce0013803 43 5>ed]x : [0<0c00010008e00c
b>] e
i:dht tp d 0 0 Se dDi:7 A4ce3F8428a0 00 0eb 1p:5 0c601 28 a
412680 4 es p: ce8 (4NbeOTfLc
B) ds: 0018 es: 0018 ss: 0018

...leads me to believe that the NMI-Watchdog fired. (I think that points
the finger away from VM, but..) It looks like it's saying that a lock
of cpu 0 was detected if I squint at it right. It's too munged to tell.

-Mike

2001-03-07 09:33:28

by Vibol Hou

[permalink] [raw]
Subject: RE: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and 2.4.0)

You're right,

It looks like it says NMI Watchdog LOCKUP CPU0. I didn't think much of it,
but I suspected someone see that piece. Now, I have the tasklist from the
lockup that occured 3 days before this one:

httpd S CE4F3F28 148 28524 22643 (NOTLB) 28569 28523
Call Trace: [<c011473a>] [<c0114664>] [<c014276e>] [<c0142b12>] [<c013af14>]
[<c0108ehNMdoIg Wa0[<02cb01>]00 02ebte>]ct ed
LhOCttKUpdP o n CPUS0, 7 rFeFFgiFsFFteF rs :
0 CP U :0
E IP0
: E I 0P: 0 01 0:(N[<OcTL01B0) 7e4 1>28]
570EF 2LA85G2S:4
00C00al0l0 8T7
racea4ax2:[11<4c601d714>]6 d7 >] e bx[:< cf0173cc6d00b700> ] e[<cxc0:1
ff7b8820d6>a8] 0 [ <ced01x:db 080f06>0]00 0[0
<c01ec1seif: 80c100ef[5><]c0 1f 2 ef5d5i>: ] f7[3c<0c0001c0 3 fb e1>b]p: c
02 8a 42 0 [es<pc:01 fc7403cb1e>ef]c
[<c d0s:01 800 18 e s ][ <c 01000818ec b >]ss :
00h1t8t
pdssPr kolcesZsA 0F7 Ag7dF3 (A0p id 18 04, 2 s85t7ac0 kp 2ag2e64=f3 73 c
10 00 )
StacStka: ck: c10fe1f06e0

What would cause the NMI Watchdog to fire?

-Vibol

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Mike Galbraith
Sent: Wednesday, March 07, 2001 12:48 AM
To: Vibol Hou
Cc: Linux-Kernel; [email protected]; [email protected];
[email protected]
Subject: Re: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and
2.4.0)


On Tue, 6 Mar 2001, Vibol Hou wrote:

> Hi,
>
> This is a follow up report on a server I run which is now using 2.4.2-ac5.
> It was suggested that the problem might be a NIC driver issue, but that
> seems unlikely at this point.
>
> You can find my previous posts at the following links to get a better idea
> of what I am encountering:
>
> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.3/0470.html
> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0102.3/0401.html
>
> The problem still persists with the new 2.4.2-ac5 kernel, and I have a
> feeling it has to do with the VM subsystem. The system runs Apache,
MySQL,
> and Sendmail. It has ~900MB RAM. The first lockup in 2.4.2-ac5 occured

Hi,

This portion of your log...

httpd S 7FFFFFFF 3700 15058 1684 (NOTLB) 15061 15055
Call Trace: [<IN MWaI tcWahtdochg do[] <ce01ted9c9t2edd> L] OC[K<UPc0 o1ndb
1CPadU0>], re[<gic0st1e1r3bsf:
8>] C[<c<c001f1f2f2afa66>>]] [ < c0 01c013f05:[5<>]c 010[<7ec0411>c4]3
8d>EF] LA
GS: 0 00 00 0 8[7
<c0 eac0x:2 8ac042208a[>]<c 0 13 e32bxd3: >]ce 8[4a<c00010 26 e6ec0>x:]
f7[<94ce0013803 43 5>ed]x : [0<0c00010008e00c
b>] e
i:dht tp d 0 0 Se dDi:7 A4ce3F8428a0 00 0eb 1p:5 0c601 28 a
412680 4 es p: ce8 (4NbeOTfLc
B) ds: 0018 es: 0018 ss: 0018

...leads me to believe that the NMI-Watchdog fired. (I think that points
the finger away from VM, but..) It looks like it's saying that a lock
of cpu 0 was detected if I squint at it right. It's too munged to tell.

-Mike

2001-03-07 09:44:38

by Andrew Morton

[permalink] [raw]
Subject: Re: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and2.4.0)

Mike Galbraith wrote:
>
> On Tue, 6 Mar 2001, Vibol Hou wrote:
>
> > Hi,
> >
> > This is a follow up report on a server I run which is now using 2.4.2-ac5.
> > It was suggested that the problem might be a NIC driver issue, but that
> > seems unlikely at this point.
> >
> > You can find my previous posts at the following links to get a better idea
> > of what I am encountering:
> >
> > http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.3/0470.html
> > http://www.uwsg.indiana.edu/hypermail/linux/kernel/0102.3/0401.html
> >
> > The problem still persists with the new 2.4.2-ac5 kernel, and I have a
> > feeling it has to do with the VM subsystem. The system runs Apache, MySQL,
> > and Sendmail. It has ~900MB RAM. The first lockup in 2.4.2-ac5 occured
>
> Hi,
>
> This portion of your log...
>
> ...
>
> ...leads me to believe that the NMI-Watchdog fired.

yes, it did. But this is not the problem. The log was
captured on a serial console. Doing an ALT_SYSRQ-T (or
BREAK/T) will cause a large amount of output to be written
to the serial port while interrupts are disabled. It
takes so long that the NMI watchdog decides the CPU
is stuck.

Actually, I think the remove-the-console-lock patch which
went into 2.4.2-ac13 will fix this - timer interrupts
should now continue to be serviced while the task table
is being dumped out.

I'm going to pretend I meant this to happen :)

I note that the Mem-info dump only shows the page table cache
size for the local CPU. It should be showing the info for all
CPUs. Minor thing.

But the failing of Vibol's server remains a mystery. I suggest
an upgrade to 2.4.2-ac13 would be worthwhile - at least we'll
get a full task table dump.


-

2001-03-07 18:27:27

by Vibol Hou

[permalink] [raw]
Subject: RE: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and2.4.0)

> But the failing of Vibol's server remains a mystery. I suggest
> an upgrade to 2.4.2-ac13 would be worthwhile - at least we'll
> get a full task table dump.

I'll get it up and running and report back with the trace next time
something goes awry.

-Vibol

2001-03-08 21:12:58

by Vibol Hou

[permalink] [raw]
Subject: RE: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20 and2.4.0)

So, after finally getting 2.4.2-ac14 to compile and installed, the system
crashed twice within an hour of each other. Once because of a kernel BUG in
printk.c, which was coupled with an NMI Watchdog trigger according to
Andreas Dilger. The other because of the same apparant reason the system
has been bogging down before. Not good.

Since I do not want to risk trashing the drives any more than I already have
and suffering from loss of sleep having to babysit this server, I am
discontinuing testing of the 2.4.x series kernel on my system. I will
attempt to setup a copy of the system locally, but I don't quite have the
resources necessary to generate the varied type of loads that the remote
server receives on a daily basis. It's odd that I seem to be the only
person experiencing this problem, so it can be attributed to some hardware
problems that the 2.2.x series is immune to or a miracle combination of
hardware that happens to dislike 2.4, or maybe it's just plain bad luck ;)

Anyways, think about this one: What would cause bind to spit out many of
these messages as the system is beginning to slowdown to a crawl?

Mar 8 09:26:28 omega named[167]: ns_req: sendto([216.104.96.10].2483):
Resource temporarily unavailable

Bind does this every time the system starts bogging down. The system is
obviously receiving TCP packets, but can't send them back out due to the
aforementioned error. The console doesn't really lock as SysRQ still works;
though getty isn't responsive after the system hits the slowdown state. I
hope all the other information I've collected thus far along with this small
piece will help determine the location of the problem and maybe a fix for it
(if it is a problem at all).

In the meanwhile, good hacking! I appreciate all the help I've gotten from
the kind people on this list. All the kernel developers have done an
excellent job on streamlining this kernel to make it faster and more
efficient, but it's still not as stable as the 2.2.x kernel (at least for
me).

-Vibol

-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Vibol Hou
Sent: Wednesday, March 07, 2001 10:26 AM
To: Andrew Morton; Mike Galbraith
Cc: Vibol Hou; Linux-Kernel; [email protected];
[email protected]
Subject: RE: System slowdown on 2.4.2-ac5 (recurring from 2.4.1-ac20
and2.4.0)


> But the failing of Vibol's server remains a mystery. I suggest
> an upgrade to 2.4.2-ac13 would be worthwhile - at least we'll
> get a full task table dump.

I'll get it up and running and report back with the trace next time
something goes awry.

-Vibol