2004-03-05 18:13:22

by Lawrence Walton

[permalink] [raw]
Subject: server migration

Hi all!

I tried about four months ago to migrate a busy server to 2.6.0-test9,
and failed miserably. Lightly loaded it worked well but as the number
of users increased, the number of processes in uninterruptible sleep
increased to the hundreds and then the server fell on it's face. I never
found out exactly why or what processes where hanging if I guessed it
would be openldap.

I'd like to take another shot at it with 2.6.3, I'd also like to get
some hints on how better to debug the problem; remember it is a live
server with live users, I can't spend much time before rebooting back to
a 2.4 kernel and yes 2.4.25 runs fine.

Things that are non-standard

Lots of open files, it's not unusual to have 50000 open files.
ext3 is mounted noatime,data=writeback on /home and /var
Total processes are usually around 300 to 350.

Main applications are:

imap, exim and openldap running on Debian.


Questions, comments, flames are welcome.



--
*--* Mail: [email protected]
*--* Voice: 425.739.4247
*--* Fax: 425.827.9577
*--* HTTP://the-penguin.otak.com/~lawrence
--------------------------------------
- - - - - - O t a k i n c . - - - - -



2004-03-05 18:22:24

by dth

[permalink] [raw]
Subject: Re: server migration

Lawrence Walton <[email protected]> wrote:
>I'd like to take another shot at it with 2.6.3,

Don't!

<personal experience, ymmv!>
Problems after sync, difficulties in the blocklayer/queuing/plugging.
Our newsgateway has gone back to 2.6.0-test11 since that's the
only one that seems to survive "hard-work".

2.6.4-rc1(-mm1) crashed hard on me, doing single-user stuff.
_i_ would wait a while if i were in your position.

Danny
--
/"\ | Dying is to be avoided because
\ / ASCII RIBBON CAMPAIGN | it can ruin your whole career
X against HTML MAIL |
/ \ and POSTINGS | - Bob Hope

2004-03-06 23:33:52

by Denis Vlasenko

[permalink] [raw]
Subject: Re: server migration

On Friday 05 March 2004 20:13, Lawrence Walton wrote:
> Hi all!
>
> I tried about four months ago to migrate a busy server to 2.6.0-test9,
> and failed miserably. Lightly loaded it worked well but as the number
> of users increased, the number of processes in uninterruptible sleep
> increased to the hundreds and then the server fell on it's face. I never
> found out exactly why or what processes where hanging if I guessed it
> would be openldap.

Why do you guess? Determine what processes are stuck.

> I'd like to take another shot at it with 2.6.3, I'd also like to get
> some hints on how better to debug the problem; remember it is a live
> server with live users, I can't spend much time before rebooting back to
> a 2.4 kernel and yes 2.4.25 runs fine.
>
> Things that are non-standard
>
> Lots of open files, it's not unusual to have 50000 open files.
> ext3 is mounted noatime,data=writeback on /home and /var
> Total processes are usually around 300 to 350.
>
> Main applications are:
>
> imap, exim and openldap running on Debian.
>
> Questions, comments, flames are welcome.

Compile with stack pointers, capture SysRq-T, post stack traces
of D processes to lkml.
--
vda


2004-03-07 01:34:58

by Lawrence Walton

[permalink] [raw]
Subject: Re: server migration

Denis Vlasenko [[email protected]] wrote:
> On Friday 05 March 2004 20:13, Lawrence Walton wrote:
> > Hi all!
> >
> > I tried about four months ago to migrate a busy server to 2.6.0-test9,
> > and failed miserably. Lightly loaded it worked well but as the number
> > of users increased, the number of processes in uninterruptible sleep
> > increased to the hundreds and then the server fell on it's face. I never
> > found out exactly why or what processes where hanging if I guessed it
> > would be openldap.
>
> Why do you guess? Determine what processes are stuck.
>
Because I did not expect it to happen, I had lots of users screaming at
me to fix it now, when it did happen. The server had been up sense the
night before. It was not until users started showing up in the morning
that the problem manifested itself.

The point is I was hoping to get a list of things to try to capture in
case it happened again, testing is all well and good, but getting
information from a production box can be valuable, as long as it's not
some odd corner case.

Capturing SysRq-T was on my list to do.
I'll investigate stack pointers, and If I can post stack traces.

I was hoping to get pointers like below before I tried it again.

<snip>
> Compile with stack pointers, capture SysRq-T, post stack traces
> of D processes to lkml.
> --
> vda
>

--
*--* Mail: [email protected]
*--* Voice: 425.739.4247
*--* Fax: 425.827.9577
*--* HTTP://the-penguin.otak.com/~lawrence
--------------------------------------
- - - - - - O t a k i n c . - - - - -


2004-03-07 10:49:22

by Denis Vlasenko

[permalink] [raw]
Subject: Re: server migration

On Sunday 07 March 2004 03:35, Lawrence Walton wrote:
> Denis Vlasenko [[email protected]] wrote:
> > On Friday 05 March 2004 20:13, Lawrence Walton wrote:
> > > Hi all!
> > >
> > > I tried about four months ago to migrate a busy server to 2.6.0-test9,
> > > and failed miserably. Lightly loaded it worked well but as the number
> > > of users increased, the number of processes in uninterruptible sleep
> > > increased to the hundreds and then the server fell on it's face. I
> > > never found out exactly why or what processes where hanging if I
> > > guessed it would be openldap.
> >
> > Why do you guess? Determine what processes are stuck.
>
> Because I did not expect it to happen, I had lots of users screaming at
> me to fix it now, when it did happen. The server had been up sense the
> night before. It was not until users started showing up in the morning
> that the problem manifested itself.
>
> The point is I was hoping to get a list of things to try to capture in
> case it happened again, testing is all well and good, but getting
> information from a production box can be valuable, as long as it's not
> some odd corner case.
>
> Capturing SysRq-T was on my list to do.
> I'll investigate stack pointers, and If I can post stack traces.

Well. That's easy. Just press SysRq-T and look into syslog.
--
vda

2004-03-07 12:21:33

by Michael Frank

[permalink] [raw]
Subject: Re: server migration

>> > On Friday 05 March 2004 20:13, Lawrence Walton wrote:
>> > > Hi all!
>> > >
>> > > I tried about four months ago to migrate a busy server to 2.6.0-test9,
>> > > and failed miserably. Lightly loaded it worked well but as the number
>> > > of users increased, the number of processes in uninterruptible sleep
>> > > increased to the hundreds and then the server fell on it's face. I
>> > > never found out exactly why or what processes where hanging if I
>> > > guessed it would be openldap.

-Test9 was the "oddest" kernel I ever ran (since 2.2.x) - even got it
repeatably to hardlock lock by loading it a bit with dd ;)

Since then, Nick Pigin has put a hell of an effort into the
anticipatory scheduler and much else all over has been refined too.

I have done a bit of stress testing of io, network and cpu and
IMO, 2.6.3 will perform nicely in a server environment and there
will be no significant problems.

Input from production use is essential though and it would be much
appreciated if you would go for it :)

Regards
Michael



2004-03-08 21:51:15

by Mike Fedyk

[permalink] [raw]
Subject: Re: server migration

Danny ter Haar wrote:
> Lawrence Walton <[email protected]> wrote:
>
>>I'd like to take another shot at it with 2.6.3,
>
>
> Don't!
>
> <personal experience, ymmv!>
> Problems after sync, difficulties in the blocklayer/queuing/plugging.
> Our newsgateway has gone back to 2.6.0-test11 since that's the
> only one that seems to survive "hard-work".
>
> 2.6.4-rc1(-mm1) crashed hard on me, doing single-user stuff.
> _i_ would wait a while if i were in your position.

I have everything except for my GW/Firewall running 2.6.3 + two NFS
patches and everything is working great.

Maybe you should find out which driver is giving you trouble, and help
debug that.

Did you enable the NMI watchdog?
What about sysrq, did that still respond during your "hang"?

Mike