2002-03-04 07:35:23

by Oliver.Schersand

[permalink] [raw]
Subject: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM

Hi,

on saturday a had a nice day with 16 houre to find a workaround to bring
linux stable.
I had moved the server from reiserfs to ext2 for all datafile areas. The
move with tar
runs without any crash. I had an about 60 to 75 MB/second transfer ( read +
write) on the
move of the oracle datafiles.

After startup of oracle and backup the open datafiles ( i know this is
nonsens but its a good stress test)
i get a crash. On a reiserfs this would crash immediately. On ext2 crash
happend after about 2.5houres of backup ( about 80GB datafiles).
After this i switched backup to kernel version 2.2.19. ---> The system runs
now without crash.
On other server without oracle but which are have tsm backup we had no
problems with 2.4.16 ( at the moment only about 15 Servers)

Its seems that you are right an we have a serious vm bug. This bug is only
viewable if you user oracle and tsm (tivoli storage manager) .... Strange.

Kinds regards

Oliver Schersand


2002-03-05 00:12:01

by Hans Reiser

[permalink] [raw]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM

[email protected] wrote:

>Hi,
>
>on saturday a had a nice day with 16 houre to find a workaround to bring
>linux stable.
>I had moved the server from reiserfs to ext2 for all datafile areas. The
>move with tar
>runs without any crash. I had an about 60 to 75 MB/second transfer ( read +
>write) on the
>move of the oracle datafiles.
>
>After startup of oracle and backup the open datafiles ( i know this is
>nonsens but its a good stress test)
>i get a crash. On a reiserfs this would crash immediately. On ext2 crash
>happend after about 2.5houres of backup ( about 80GB datafiles).
>After this i switched backup to kernel version 2.2.19. ---> The system runs
>now without crash.
>On other server without oracle but which are have tsm backup we had no
>problems with 2.4.16 ( at the moment only about 15 Servers)
>
>Its seems that you are right an we have a serious vm bug. This bug is only
>viewable if you user oracle and tsm (tivoli storage manager) .... Strange.
>
>Kinds regards
>
>Oliver Schersand
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
Wasn't 2.4.16 the known unstable vm release of 2.4? Why do you go to
such effort to stick with a bad kernel? Go to 2.4.18.

Hans


2002-03-05 17:08:01

by Chris Mason

[permalink] [raw]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM



On Monday, March 04, 2002 06:07:19 PM +0300 Hans Reiser <[email protected]> wrote:


> Wasn't 2.4.16 the known unstable vm release of 2.4? Why do you go to
> such effort to stick with a bad kernel? Go to 2.4.18.

I'm not sure exactly which vm problems you mean, but He's running the
suse 2.4.16, which is heavily patched. When your running big production
databases, upgrading to the kernel of the week isn't an option.

I think we've found the bug, it looks like a race in the proc code.

Oliver, someone will contact you a little later with instructions on
getting a kernel with the fix. If you only see this oops during backups,
make sure you aren't trying to backup /proc.

-chris

2002-03-07 07:13:44

by Petro

[permalink] [raw]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM

On Mon, Mar 04, 2002 at 08:35:36AM +0100, [email protected] wrote:
> happend after about 2.5houres of backup ( about 80GB datafiles).
> After this i switched backup to kernel version 2.2.19. ---> The system runs
> now without crash.
> On other server without oracle but which are have tsm backup we had no
> problems with 2.4.16 ( at the moment only about 15 Servers)
>
> Its seems that you are right an we have a serious vm bug. This bug is only
> viewable if you user oracle and tsm (tivoli storage manager) .... Strange.

Are you getting a complete OS crash, or just Oracle going bang?

--
Share and Enjoy.

2002-03-07 07:14:44

by Petro

[permalink] [raw]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM

On Tue, Mar 05, 2002 at 12:06:43PM -0500, Chris Mason wrote:
> On Monday, March 04, 2002 06:07:19 PM +0300 Hans Reiser <[email protected]> wrote:
> > Wasn't 2.4.16 the known unstable vm release of 2.4? Why do you go to
> > such effort to stick with a bad kernel? Go to 2.4.18.
> I'm not sure exactly which vm problems you mean, but He's running the
> suse 2.4.16, which is heavily patched. When your running big production
> databases, upgrading to the kernel of the week isn't an option.
> I think we've found the bug, it looks like a race in the proc code.
> Oliver, someone will contact you a little later with instructions on
> getting a kernel with the fix. If you only see this oops during backups,
> make sure you aren't trying to backup /proc.

Is this in the generic kernel, or the patches?

--
Share and Enjoy.

2002-03-08 23:05:10

by James Washer

[permalink] [raw]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM


Chris,

I just took a look at what little information I have available on this
situation.. Namely the 'block-o-oops' from many ps processes..

I'm not sure I agree with you that it is a race in proc code. There are
several ps processes that oops'd over a period of 58 seconds. My guess is
that there is (was) a process out there that has a corrupt p->sig (==
0x00003296). Hence, each time the user runs ps, the new ps trips over the
same corrupt task.

What really confuses me is what any of this has to do with the original
complaint about the system hanging.. Has that behaviour gone away?

- jim

Chris Mason <[email protected]>@vger.kernel.org on 03/05/2002 09:06:43 AM

Sent by: [email protected]


To: Hans Reiser <[email protected]>,
[email protected]
cc: Alessandro Suardi <[email protected]>,
[email protected], [email protected],
[email protected]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and
Tivolie TSM





On Monday, March 04, 2002 06:07:19 PM +0300 Hans Reiser
<[email protected]> wrote:


> Wasn't 2.4.16 the known unstable vm release of 2.4? Why do you go to
> such effort to stick with a bad kernel? Go to 2.4.18.

I'm not sure exactly which vm problems you mean, but He's running the
suse 2.4.16, which is heavily patched. When your running big production
databases, upgrading to the kernel of the week isn't an option.

I think we've found the bug, it looks like a race in the proc code.

Oliver, someone will contact you a little later with instructions on
getting a kernel with the fix. If you only see this oops during backups,
make sure you aren't trying to backup /proc.

-chris

2002-03-11 08:14:44

by Oliver.Schersand

[permalink] [raw]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie TSM

Hi,

i have switch to kernel 2.2.19. But after about 5 day's ( Friday ) i had a
the same hang on the system. which leads me to the
diagnostic that we have a possible hardware problem or a problem in the
compaq smart array or compaq 5300 Raid Array controller
driver. On 2.2.19 i have cpqarray 1.0.12 and cciss 1.0.4. On the 2.4.16
kernel i have the cpqarray 2.4.5 and the cciss 2.4.6.

Kinds Regards

Oliver Schersand

---------------------- Weitergeleitet von Oliver Schersand/BCS/BASF am
11.03.2002 08:43 ---------------------------


"James Washer" <[email protected]> am 09.03.2002 00:07:07

An: Chris Mason <[email protected]>
Kopie: Hans Reiser <[email protected]>, Oliver Schersand/BCS/BASF@EUROPE,
Alessandro Suardi <[email protected]>,
[email protected], [email protected],
[email protected]
Thema: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and Tivolie
TSM




Chris,

I just took a look at what little information I have available on this
situation.. Namely the 'block-o-oops' from many ps processes..

I'm not sure I agree with you that it is a race in proc code. There are
several ps processes that oops'd over a period of 58 seconds. My guess is
that there is (was) a process out there that has a corrupt p->sig (==
0x00003296). Hence, each time the user runs ps, the new ps trips over the
same corrupt task.

What really confuses me is what any of this has to do with the original
complaint about the system hanging.. Has that behaviour gone away?

- jim

Chris Mason <[email protected]>@vger.kernel.org on 03/05/2002 09:06:43 AM

Sent by: [email protected]


To: Hans Reiser <[email protected]>,
[email protected]
cc: Alessandro Suardi <[email protected]>,
[email protected], [email protected],
[email protected]
Subject: Re: Antwort: Re: Kernel Hangs 2.4.16 on heay io Oracle and
Tivolie TSM





On Monday, March 04, 2002 06:07:19 PM +0300 Hans Reiser
<[email protected]> wrote:


> Wasn't 2.4.16 the known unstable vm release of 2.4? Why do you go to
> such effort to stick with a bad kernel? Go to 2.4.18.

I'm not sure exactly which vm problems you mean, but He's running the
suse 2.4.16, which is heavily patched. When your running big production
databases, upgrading to the kernel of the week isn't an option.

I think we've found the bug, it looks like a race in the proc code.

Oliver, someone will contact you a little later with instructions on
getting a kernel with the fix. If you only see this oops during backups,
make sure you aren't trying to backup /proc.

-chris