2003-01-08 00:33:08

by Brian Tinsley

[permalink] [raw]
Subject: long stalls

We have been having terrible problems with long stalls, meaning from a
couple of minutes to an hour, happening when filesystem I/O load gets
high. The system time as reported by vmstat or sar will increase up to
99% and as it spreads to each procesor, the system becomes completely
unresponsive (except that it responds to pings just fine -
interesting!). When the system finally returns to the world of the
living, the only evidence that something bad has happened is the runtime
for kswapd is abnormally high. I have seen this happen with the stock
2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV machines (either
4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've searched the lkml
archives and google and have found several similar postings, but there
is never an explanation or resolution. Any help would be *very* much
appreciated! If any info from the system in question is desired, I will
be glad to provide it.



--

-[========================]-
-[ Brian Tinsley ]-
-[ Chief Systems Engineer ]-
-[ Emageon ]-
-[========================]-



2003-01-08 01:49:14

by Russell Leighton

[permalink] [raw]
Subject: Re: long stalls


I can't help, but I can echo a "me too".

We only see it when I have 2 file I/O intensive processes...they both
will just stop for some few seconds, system seems idle...then
they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID Controller .

Brian Tinsley wrote:

> We have been having terrible problems with long stalls, meaning from a
> couple of minutes to an hour, happening when filesystem I/O load gets
> high. The system time as reported by vmstat or sar will increase up to
> 99% and as it spreads to each procesor, the system becomes completely
> unresponsive (except that it responds to pings just fine -
> interesting!). When the system finally returns to the world of the
> living, the only evidence that something bad has happened is the
> runtime for kswapd is abnormally high. I have seen this happen with
> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV
> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've
> searched the lkml archives and google and have found several similar
> postings, but there is never an explanation or resolution. Any help
> would be *very* much appreciated! If any info from the system in
> question is desired, I will be glad to provide it.
>
>
>


2003-01-08 02:07:41

by Brian Tinsley

[permalink] [raw]
Subject: Re: long stalls

Out of curiosity, which RH kernel are you using? I moved on to 2.4.19
and 2.4.20 primarily because the RH 2.4.18 series of kernels apparently
has a scheduler bug (at least one) that causes the heartbeat software
from Linux-HA to loose heartbeat signals and failover. Not a good
scenario when you are trying to provide HA systems to hospitals!


Russell Leighton wrote:

>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
> Controller .
>
> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from
>> a couple of minutes to an hour, happening when filesystem I/O load
>> gets high. The system time as reported by vmstat or sar will increase
>> up to 99% and as it spreads to each procesor, the system becomes
>> completely unresponsive (except that it responds to pings just fine -
>> interesting!). When the system finally returns to the world of the
>> living, the only evidence that something bad has happened is the
>> runtime for kswapd is abnormally high. I have seen this happen with
>> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV
>> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs).
>> I've searched the lkml archives and google and have found several
>> similar postings, but there is never an explanation or resolution.
>> Any help would be *very* much appreciated! If any info from the
>> system in question is desired, I will be glad to provide it.
>>
>>
>>
>

--

-[========================]-
-[ Brian Tinsley ]-
-[ Chief Systems Engineer ]-
-[ Emageon ]-
-[========================]-



2003-01-08 02:39:08

by Brian Tinsley

[permalink] [raw]
Subject: Re: long stalls

Thanks for the reply!

I thought highmem I/O was addressed in 2.4.20? Am I off-base here?

I actually just built a 2.4.20 kernel with highmem debugging turned on.
We'll see if anything pops up.


Brian Gerst wrote:

> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from a
>> couple of minutes to an hour, happening when filesystem I/O load gets
>> high. The system time as reported by vmstat or sar will increase up to
>> 99% and as it spreads to each procesor, the system becomes completely
>> unresponsive (except that it responds to pings just fine -
>> interesting!). When the system finally returns to the world of the
>> living, the only evidence that something bad has happened is the runtime
>> for kswapd is abnormally high. I have seen this happen with the stock
>> 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV machines (either
>> 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've searched the lkml
>> archives and google and have found several similar postings, but there
>> is never an explanation or resolution. Any help would be *very* much
>> appreciated! If any info from the system in question is desired, I will
>> be glad to provide it.
>>
>>
>>
> With 4GB of memory you are likely boucing I/O requests to low memory.
> This has been fixed in 2.5. I do not know if a backport exists for 2.4.
>
> --
> Brian Gerst


--

-[========================]-
-[ Brian Tinsley ]-
-[ Chief Systems Engineer ]-
-[ Emageon ]-
-[========================]-



2003-01-08 02:33:09

by Brian Gerst

[permalink] [raw]
Subject: Re: long stalls

Brian Tinsley wrote:

> We have been having terrible problems with long stalls, meaning from a
> couple of minutes to an hour, happening when filesystem I/O load gets
> high. The system time as reported by vmstat or sar will increase up to
> 99% and as it spreads to each procesor, the system becomes completely
> unresponsive (except that it responds to pings just fine -
> interesting!). When the system finally returns to the world of the
> living, the only evidence that something bad has happened is the runtime
> for kswapd is abnormally high. I have seen this happen with the stock
> 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV machines (either
> 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've searched the lkml
> archives and google and have found several similar postings, but there
> is never an explanation or resolution. Any help would be *very* much
> appreciated! If any info from the system in question is desired, I will
> be glad to provide it.
>
>
>
With 4GB of memory you are likely boucing I/O requests to low memory.
This has been fixed in 2.5. I do not know if a backport exists for 2.4.

--
Brian Gerst

2003-01-08 03:58:12

by Russell Leighton

[permalink] [raw]
Subject: Re: long stalls


Minor correction: 3ware RAID controller.

Russell Leighton wrote:

>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
> Controller .
>
> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from
>> a couple of minutes to an hour, happening when filesystem I/O load
>> gets high. The system time as reported by vmstat or sar will increase
>> up to 99% and as it spreads to each procesor, the system becomes
>> completely unresponsive (except that it responds to pings just fine -
>> interesting!). When the system finally returns to the world of the
>> living, the only evidence that something bad has happened is the
>> runtime for kswapd is abnormally high. I have seen this happen with
>> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV
>> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs).
>> I've searched the lkml archives and google and have found several
>> similar postings, but there is never an explanation or resolution.
>> Any help would be *very* much appreciated! If any info from the
>> system in question is desired, I will be glad to provide it.
>>
>>
>>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>


2003-01-08 04:04:48

by Russell Leighton

[permalink] [raw]
Subject: Re: long stalls


I am pretty sure we are at 2.4.19.

Brian Tinsley wrote:

> Out of curiosity, which RH kernel are you using? I moved on to 2.4.19
> and 2.4.20 primarily because the RH 2.4.18 series of kernels
> apparently has a scheduler bug (at least one) that causes the
> heartbeat software from Linux-HA to loose heartbeat signals and
> failover. Not a good scenario when you are trying to provide HA
> systems to hospitals!
>
>
> Russell Leighton wrote:
>
>>
>> I can't help, but I can echo a "me too".
>>
>> We only see it when I have 2 file I/O intensive processes...they both
>> will just stop for some few seconds, system seems idle...then
>> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
>> Controller .
>>
>> Brian Tinsley wrote:
>>
>>> We have been having terrible problems with long stalls, meaning from
>>> a couple of minutes to an hour, happening when filesystem I/O load
>>> gets high. The system time as reported by vmstat or sar will
>>> increase up to 99% and as it spreads to each procesor, the system
>>> becomes completely unresponsive (except that it responds to pings
>>> just fine - interesting!). When the system finally returns to the
>>> world of the living, the only evidence that something bad has
>>> happened is the runtime for kswapd is abnormally high. I have seen
>>> this happen with the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP
>>> PIII and PIV machines (either 4GB or 8GB RAM, all SCSI disks, dual
>>> GigE NICs). I've searched the lkml archives and google and have
>>> found several similar postings, but there is never an explanation or
>>> resolution. Any help would be *very* much appreciated! If any info
>>> from the system in question is desired, I will be glad to provide it.
>>>
>>>
>>>
>>
>


2003-01-08 15:03:32

by Juergen Sawinski

[permalink] [raw]
Subject: Re: long stalls

On Wed, 2003-01-08 at 02:51, Russell Leighton wrote:
>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID Controller .

Same thing here with a Promise SX6000 RAID controller (P4, 1GB RAM,
system is completely on RAID, 2.4.20-pre10-ac1). But, this seems not to
be related. At least in my case, it's the controller that causes the
stalls, 'cause only processes depending on file IO (including swap) get
into D state. Everything else just runs fine.

George

--
Juergen "George" Sawinski | Phone: +49-6221-486-308
Max-Planck Institute for Medical Research | Fax: +49-6221-486-325
Dept. of Biomedical Optics | Mobile: +49-171-532 5302
Jahnstr. 29 |
D-69120 Heidelberg |
Germany |

GPG Key/Fingerprint: 9A5F7A31/86F2E5D5EDF4D9983BDD3F23986F154F9A5F7A31