2003-08-29 16:39:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)


> Ville,
>
> Which kernel doesnt hang on your box? 2.4 something ?

> 2.4.20pre7 ran for over 9 months before it suddenly begun locking up (I
> _suppose_ it could just mean the bug/problem is hard to trigger.)
> Nothing had been changed: the box had been up for that nine month
> period, and the same oracle dump cron job had been running each night.

Strange.

> Earlier 2.4's had too many problems with aic7xxx (crashes and so on), so
> I can't comment on them.

> After 2.4.20pre7, I tried 2.4.21-jam1 (based on -aa patchset) and
> 2.4.22-pre8. I also tried compiling 2.4.21-jam1 with gcc-3.2.1 instead
> of 2.96. All of those locked up eventually, sometimes within a day from
> reboot, some times it takes weeks. At one point, 2.4.21-jam1 seemed to
> reliably lock up when compiling kernel, but it no longer happens no
> matter how hard I try. Usually the lock up happens during nightly oracle
> backup dump.

So NMI and sysrq doesnt help. I suggest you a few things:

Try to make the bug easy to reproduce. Force the Oracle dumps again and
again to crash the box. Can you try it or its a production machine?

BTW, can you describe this "Oracle dumps" in more detail? What do they do?
Save lots of data to disk and thats all or ?

Hope we can trace this down.


2003-08-29 20:06:21

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Fri, Aug 29, 2003 at 01:35:25PM -0300, you [Marcelo Tosatti] wrote:
>
> So NMI and sysrq doesnt help. I suggest you a few things:
>
> Try to make the bug easy to reproduce. Force the Oracle dumps again and
> again to crash the box.

I happened to work towards that direction this morning (before I read your
mail). Taking the stance that this very probably had something to do with io
stress, I played around with several io loads. Eventually I found out that
fsx on scsi disk reliably caused the box to either lock up or the aic7xxx
driver to barf. What's more, it took under 15 minutes to trigger.

So I copied the rootfs and everything else from the scsi disk to the ide
disk (just barely had enough space), and took all the scsi disk partitions
away from fstab. After reboot, I have been unable to lock it up with fsx
(scsi disk is not accessed at all), but it will take several weeks before
I'm confident that the lock up is cured.

aic7xxx / scsi hw seems quite strong suspect for the lock ups. 2.2 possibly
worked because it has the older aic7xxx 5.x driver.

> Can you try it or its a production machine?

It is a sort-of-a production machine -- that's way I have been so wary on
trying different things. Sorry for that...

> BTW, can you describe this "Oracle dumps" in more detail? What do they do?
> Save lots of data to disk and thats all or ?

They dump the oracle data base to a backup file.

${ORAHOME}/bin/exp \
***/*** full=Y grants=Y \
file=${DMPDIR}/fullexp.dmp 1>${LOGDIR}/fulllog.`date '+%a'` 2>&1

So basically just heavy IO afaict.

> Hope we can trace this down.

I'm still not 100% sure that the aic7xxx brafs (see
http://lkml.org/lkml/2003/7/29/33 for an example) and the lockups are of
the same origin. But it seems at least 99.5% certain.

If aic7xxx/scsi is to blame, then is it the
- 2940 scsi adapter
- the disk
- the cabling or something (I've checked the termination)
- the motherboard (irq routing?)
- the aic7xxx driver?
- some other kernel issue?

The hw is:
Intel 815EEA2LU (i815 Chipset)
Celeron 1.3GHz (Tualatin)
Adaptec AHA-2940 / AIC-7871
- Disk (rootfs) SEAGATE Model: ST19171W Rev: 0024
- Tape Drive HP Model: C1537A Rev: L708
30GB IDE disk (scratch)


-- v --

[email protected]

2003-09-09 07:06:31

by Ville Herva

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Fri, Aug 29, 2003 at 10:57:37PM +0300, you [Ville Herva] wrote:
> On Fri, Aug 29, 2003 at 01:35:25PM -0300, you [Marcelo Tosatti] wrote:
> >
> > So NMI and sysrq doesnt help. I suggest you a few things:
> >
> > Try to make the bug easy to reproduce. Force the Oracle dumps again and
> > again to crash the box.
>
> I happened to work towards that direction this morning (before I read your
> mail). Taking the stance that this very probably had something to do with io
> stress, I played around with several io loads. Eventually I found out that
> fsx on scsi disk reliably caused the box to either lock up or the aic7xxx
> driver to barf. What's more, it took under 15 minutes to trigger.
>
> So I copied the rootfs and everything else from the scsi disk to the ide
> disk (just barely had enough space), and took all the scsi disk partitions
> away from fstab. After reboot, I have been unable to lock it up with fsx
> (scsi disk is not accessed at all), but it will take several weeks before
> I'm confident that the lock up is cured.

And indeed it did lock even though the scsi disk is not used at all. It just
took weeks.

At the time no heavy IO was going on afaict (but there might have been some
io.)

I'm completely out of ideas here. What the heck is the culprit...? Perhaps a
faulty motherboard?

> The hw is:
> Intel 815EEA2LU (i815 Chipset)
> Celeron 1.3GHz (Tualatin)
> Adaptec AHA-2940 / AIC-7871 (NOT USED)
> - Disk SEAGATE Model: ST19171W Rev: 0024 (NOT USED)
> - Tape Drive HP Model: C1537A Rev: L708
> 30GB IDE disk (All fs's here at the moment)



-- v --

[email protected]

2003-09-09 08:48:14

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

On Tue, 9 Sep 2003 10:05:07 +0300
Ville Herva <[email protected]> wrote:

> On Fri, Aug 29, 2003 at 10:57:37PM +0300, you [Ville Herva] wrote:
> > [...]
> > So I copied the rootfs and everything else from the scsi disk to the ide
> > disk (just barely had enough space), and took all the scsi disk partitions
> > away from fstab. After reboot, I have been unable to lock it up with fsx
> > (scsi disk is not accessed at all), but it will take several weeks before
> > I'm confident that the lock up is cured.
>
> And indeed it did lock even though the scsi disk is not used at all. It just
> took weeks.
>
> At the time no heavy IO was going on afaict (but there might have been some
> io.)
>
> I'm completely out of ideas here. What the heck is the culprit...? Perhaps a
> faulty motherboard?

Hm, after my experiences I would advise you to save time and headache and try
to replace everything but the ide disk at once. This is an easy and fast action
and gives you a chance to tilt any form of hardware error.

Regards,
Stephan