2002-06-18 19:44:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] 2.4 VM sucks. Again

Roy Sigurd Karlsbakk wrote:
>
> > > > Any plans to merge this into the main kernel, giving a choice
> > > > (in config or /proc) to enable this?
> > >
> > > I don't think Andrew is ready to submit this yet ... before anything
> > > gets merged back, it'd be very worthwhile testing the relative
> > > performance of both solutions ... the more testers we have the
> > > better ;-)
> >
> > Cripes no. It's pretty experimental. Andrea spotted a bug, too. Fixed
> > version is below.
>
> Any more plans?
> The patch has been working great for some time now, and I'd really like to see
> this in the official tree

Roy, all we know is that "nuke-buffers stops your machine from locking up".
But we don't know why your machine locks up in the first place. This just
isn't sufficient grounds to apply it! We need to know exactly why your
kernel is failing. We don't know what the bug is.

You have two gigabytes of RAM, yes? It's very weird that stripping buffers
prevents a lockup on a machine with such a small highmem/lowmem ratio.

I'll have yet another shot at reproducing it. So, again, could you please
tell me *exactly*, in great deatail, what I need to do to reproduce this
problem?

- memory size
- number of CPUs
- IO system
- kernel version, any applied patches, compiler version
- exact sequence of commands
- anything else you can think of

Have you been able to reproduce the failure on any other machine?

> Also - I guess this patch will eliminate any
> caching whatsoever, and therefore not really a good thing for file or web
> servers?

No, not at all. All the pagecache is still there - the patch just
throws away the buffer_heads which are attached to those pagecache
pages.

The 2.5 kernel does it tons better. Have you tried it?

-


2002-06-19 11:26:55

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: [BUG] 2.4 VM sucks. Again

> Roy, all we know is that "nuke-buffers stops your machine from locking up".
> But we don't know why your machine locks up in the first place. This just
> isn't sufficient grounds to apply it! We need to know exactly why your
> kernel is failing. We don't know what the bug is.

The bug, as previously described, occurs when multiple (20+) clients downloads
large files (3-6Gigs each) at a speed of ~5Mbps. The error does _not_ occur
when a fewer number of clients are downloading at speeds close to disk speed.
All testing is being done on gigE crossover.

> You have two gigabytes of RAM, yes? It's very weird that stripping buffers
> prevents a lockup on a machine with such a small highmem/lowmem ratio.

No. I have 1GB - highmem (which is disabled) giving me ~900MB

> I'll have yet another shot at reproducing it. So, again, could you please
> tell me *exactly*, in great deatail, what I need to do to reproduce this
> problem?

> - memory size

1GB - highmem

> - number of CPUs

1 Athlon 1133Mz, 256kB cache

> - IO system

standard 33MHz/32bit single peer PCI motherboard (SiS based)
on-board SiS IDE/ATA 100 controller.
promise 20269 controller
realtek 100mbps nic
e1000 gigE nic
4 IBM 40gig 120GXP drives - one on each IDE channel
data partition on RAID-0 across all drives

> - kernel version, any applied patches, compiler version
kernel 2.4.19-pre8+tux+akpm buffer patch
I have tried _many_ different kernels, and as I needed the 20269 support, I
chose 2.4.19-pre, Tux is there as I did some testing with that. The problem
is _not_ tux specific, as I've tried with other server software (custom or
standard) as well.
gcc2.95.3

> - exact sequence of commands

start http server software
start 20+ downloads. each downloaded file is 3-6 gigs
after some time most processes are killed OOM

> - anything else you can think of

I have not tried to give it coffee yet, although that might help. I'm usually
pretty pissed if I haven't got my morning coffee

> Have you been able to reproduce the failure on any other machine?

yes. I have set up one other machine with exact same setup and one with
slightly different setup and reproduced it.

> No, not at all. All the pagecache is still there - the patch just
> throws away the buffer_heads which are attached to those pagecache
> pages.

oh. that's good.

> The 2.5 kernel does it tons better. Have you tried it?

I haven't. I've tried to compile it a few times, but it has failed. And. I
don't want to run 2.5 on a production server.

But - If you ask me to test it, I will

thanks for all help

roy

--
Roy Sigurd Karlsbakk, Datavaktmester

Computers are like air conditioners.
They stop working when you open Windows.