Hi David,
On Sun, 29 Feb 2004, David Luyer wrote:
> On Thu, Feb 05, 2004 at 10:44:31AM -0200, Marcelo Tosatti wrote:
> > Here goes the first release candidate.
> >
> > It contains mostly networking updates, XFS update, amongst others.
> >
> > This release contains a fix for excessive inode memory pressure with
> > highmem boxes. Help is specially wanted with testing this on heavy-load
> > highmem machines.
>
> How is this likely to manifest itself?
Basically this modification makes the inode reclaiming code rip inodes
with highmem-pagecache attached when its necessary.
What happened before was that the low memory could get filled with
unreclaimable inodes. Which would screw up the performance badly (and
probably crash the system in extreme situations).
> We just had a box which crashed only 2 hour from deployment, and
> reading over the recent changes this seems like a potential cause
> (although being new, faulty hardware is always a possibility); the
> last items on its serial console were:
>
> INIT: Sending processes the TERM signal
> memory.c:100: bad pmd 000001e3.
> memory.c
This looks like hardware fault to me or a (maybe, not sure) badly behaving
driver. The inode-highmem modifications can't cause such breakage, as far
as I can see.
Rik, Andrew ?
> Was still responding to ICMP after crash.
>
> Details:
>
> * running 2.4.25 (release with small local patch to put MPT SCSI devices
> before Adaptec SCSI devices, as "scsihosts" cannot do this)
>
> * IBM x335
>
> * dual Xeon 3.066GHz (hyperthreads in 2.4.25)
>
> * 2.5Gb RAM (HIGHMEM4G)
>
> * high CPU load (two processes around 75% of a CPU each at time of crash,
> being bzip2 compression of gigabytes of data, and some other processes
> using somewhat less CPU for network and disk IO)
>
> * moderate IO load (3Mbps on tg3 ethernet, more than double that on each
> of MPT SCSI and AIC SCSI)
>
> * high inode / file descriptor load -- a single process may open hundreds
> of thousands of file desciptors over a 5 minute period and then close
> them all at once; file-max is set to 1024^2
>
> * newly deployed (ie no track record of stability to refer to)
>
> Role of system is basically to receive a constant 3Mbps stream of UDP data
> which is then written to the internal RAID array, this data is then read
> from the internal array and written to an external RAID array (in the process
> doubling in volume; each piece of data ends up in two places; and sometimes
> being read/written a few times before ending up in the right place) and
> ultimately compressed and archived.
>
> I've rebooted into 2.4.25 for a second chance but if it fails again,
> will reboot to 2.4.24 and then if that fails, revert to old hardware
> and kernel (which was running kernel 2.4.24 on an old Intel ISP2150).
OK, waiting for you input.
On Mon, 1 Mar 2004, Marcelo Tosatti wrote:
> > We just had a box which crashed only 2 hour from deployment, and
> > reading over the recent changes this seems like a potential cause
> > (although being new, faulty hardware is always a possibility); the
> > last items on its serial console were:
> >
> > INIT: Sending processes the TERM signal
> > memory.c:100: bad pmd 000001e3.
> > memory.c
>
> This looks like hardware fault to me or a (maybe, not sure) badly
> behaving driver. The inode-highmem modifications can't cause such
> breakage, as far as I can see.
Agreed, this looks like a hardware fault.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Mon, Mar 01, 2004 at 10:20:46AM -0500, Rik van Riel wrote:
> On Mon, 1 Mar 2004, Marcelo Tosatti wrote:
> > > INIT: Sending processes the TERM signal
> > > memory.c:100: bad pmd 000001e3.
> > > memory.c
> >
> > This looks like hardware fault to me or a (maybe, not sure) badly
> > behaving driver. The inode-highmem modifications can't cause such
> > breakage, as far as I can see.
>
> Agreed, this looks like a hardware fault.
I swapped CPU, memory and kernel all at once which resolved the
fault, as I had a second failure after this and I had to resolve
the fault ASAP so I couldn't trouble-shoot changing things one by one.
I'll re-upgrade to 2.4.25 after the system has been stable for around
a week; the original CPU and memory have been placed in a test box and
have shown no faults running a memory tester for 24 hours but perhaps
it was just a seating issue on a component; I'll report back if there
are any problems after re-upgrading.
David.