2002-08-23 07:54:32

by Maurice Volaski

[permalink] [raw]
Subject: SMP Netfinity 340 hangs under 2.4.19

A single processor Netfinity 340 running RedHat 7.1 and kernel 2.4.18
was recently upgraded to

1) 1 GB RAM
2) second processor (1 Ghz Xeon)
3) 2.4.19 for SMP with bigmem and added NFS server patches and
ext3-related patches.

Heavily used processes are netatalk, samba, and NFS.

The box is now hard locking periodically (every several days).

Lore elsewhere on the Internet says Netfinity SMP boxes have had
trouble with the nmi-watchdog and the screen blanker. The former was
turning off via LILO and the latter turned off in script (for both
terminal and for X).

It seemed that box was OK (for about 2 weeks) when it was not
attached to external RAID hardware (via Adaptec 29160LP card). At
least one hang occurred during fsck of the hardware RAID and another
during what was probably heavy disk activity on the RAID.

The memory was reverted back to the original but it still hung.
Presumably, this rules out #1.

In the latest hang, the keyboard is locked up, but the Ethernet card
(e1000) has link and ssh and https and ping respond on scan but
that's it.Also, heartbeat runs on the box and it stopped reporting to
the motherboard Ethernet and serial port being watched by failover
node's heartbeat.

Note that another box configured virtually identically except for the
e1000 and the Adaptec card (no external RAID) has not hung.


Is there significance to the fact that keyboard and mouse are frozen
but apparently some processes are still up?

Does anyone one think this could be an issue with the patched SMP kernel?

More keywords: crash, freeze, hung, frozen, locked up.
--

Maurice Volaski, [email protected]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University


2002-08-23 08:15:31

by Alan

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

On Fri, 2002-08-23 at 08:58, Maurice Volaski wrote:
> A single processor Netfinity 340 running RedHat 7.1 and kernel 2.4.18
> was recently upgraded to
>
> 1) 1 GB RAM
> 2) second processor (1 Ghz Xeon)
> 3) 2.4.19 for SMP with bigmem and added NFS server patches and
> ext3-related patches.

Exactly what patches ? and does unpatched 2.4.19 behave ?

> Is there significance to the fact that keyboard and mouse are frozen
> but apparently some processes are still up?

Interrupts are running but its stuck looping in kernel space I suspect.

2002-08-23 08:43:50

by Maurice Volaski

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

> > A single processor Netfinity 340 running RedHat 7.1 and kernel 2.4.18
>> was recently upgraded to
>>
>> 1) 1 GB RAM
>> 2) second processor (1 Ghz Xeon)
>> 3) 2.4.19 for SMP with bigmem and added NFS server patches and
>> ext3-related patches.
>
>Exactly what patches ? and does unpatched 2.4.19 behave ?

These from Neil Brown:
patch-A-UmemWarn Fix the compile warning ....
patch-B-0-9-18 Latest Ext3 patches
patch-C-Ext3Fixes Fix some problems with ext3
patch-D-NfsdLookupTidy Tidy up code in nfsd_lookup
patch-E-NfsdInit Tidyup init/exit fof nfsd module
patch-F-NfsdFsid Support fsid= export option to be device
number independent
patch-G-NfsdLocksExplock Change export table lock to (SMP
safe) rwsemaphore
patch-H-NfsdLocksCachelock Lock reply cache with SMP safety.
patch-I-NfsdLocksNfssvc Tidy up locking in nfssvc - preparing for BKL removal
patch-J-NfsdLocksRename protect rename and related operations by kernel_lock
patch-K-NfsdFhLock protect file handle lookup by kernel lock
patch-L-NfsdLocksRacache protect read-ahead cache with SMP safe locking
patch-M-NfsdBKLgone Remove last unneeded bit of BKL from knfsd
patch-N-RpcLists Change sunrpc to use more list.h lists
patch-O-RpcInit Get sunrpc to use module_init properly
patch-P-RpcSvcLocking Tidy up SMP locking for svc_sock
patch-Q-RpcTcpCloseBad Detect and close tcp connections that we
cannot work with.
patch-R-RpcTcpCloseIdle Close idle rpc/tcp sockets
patch-S-RpcTcpReserve Make sure there is alway adequate sndbuf
space for replies.
patch-T-RpcSvcTcpLimit Limit number of active tcp connections to an
RPC service
patch-U-NfsdTcpEnable Enable NFS over TCP via config option
patch-c-NfsfhErrFix Correct some error codes reutrned in nfsfh.c

I haven't tried plain 2.4.19 yet. Should I have reason to not trust
these patches?

> > Is there significance to the fact that keyboard and mouse are frozen
>> but apparently some processes are still up?
>
>Interrupts are running but its stuck looping in kernel space I suspect.

So could this be taken to mean the issue is most likely software
(presumably kernel)-related?
--

Maurice Volaski, [email protected]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

2002-08-23 08:52:01

by Alan

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

On Fri, 2002-08-23 at 09:47, Maurice Volaski wrote:
> I haven't tried plain 2.4.19 yet. Should I have reason to not trust
> these patches?

In the sense that they are not tested by the majority of 2.4.19 users
its always worth checking that.

> So could this be taken to mean the issue is most likely software
> (presumably kernel)-related?

It normally points to a kernel locking error

2002-08-23 14:16:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

> I haven't tried plain 2.4.19 yet.
> Should I have reason to not trust these patches?

The fact that your box is hanging would seem like a good
reason to me ;-)

M.

2002-09-03 17:19:35

by Maurice Volaski

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

Regarding this hang issue, I just had the vanilla 2.4.19 lockup, so
it looks like the problem is not with the patches. Any ideas on how
troubleshoot it further?

>On Fri, 2002-08-23 at 09:47, Maurice Volaski wrote:
> > I haven't tried plain 2.4.19 yet. Should I have reason to not trust
>> these patches?
>
>In the sense that they are not tested by the majority of 2.4.19 users
>its always worth checking that.
>
>> So could this be taken to mean the issue is most likely software
>> (presumably kernel)-related?
>
>It normally points to a kernel locking error


--

Maurice Volaski, [email protected]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University

2002-09-04 20:20:45

by Bill Davidsen

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

On Fri, 23 Aug 2002, Maurice Volaski wrote:

> A single processor Netfinity 340 running RedHat 7.1 and kernel 2.4.18
> was recently upgraded to
>
> 1) 1 GB RAM
> 2) second processor (1 Ghz Xeon)
> 3) 2.4.19 for SMP with bigmem and added NFS server patches and
> ext3-related patches.
>
> Heavily used processes are netatalk, samba, and NFS.
>
> The box is now hard locking periodically (every several days).

Boot with the noapic option. Try a stock kernel before you add all those
patches.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-05 15:52:26

by Maurice Volaski

[permalink] [raw]
Subject: Re: SMP Netfinity 340 hangs under 2.4.19

>Boot with the noapic option. Try a stock kernel before you add all those
>patches.

Thanks for your reply. The crash still occurs with a stock 2.4.19 and
the noapic option.
--

Maurice Volaski, [email protected]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University
--

Maurice Volaski, [email protected]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University