2008-06-10 17:16:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Problems with large number of clients and reads

On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> Unfortunately, I cannot stop the clients (middle of long running
> jobs). I might be able to test this soon. If I have the number of
> threads high, yes I can reduce the number of threads and it appears to
> lower some of the memory, but even with as little as three threads,
> the memory usage climbs very high, just not as high as if there are
> say 8 threads. When the memory usage climbs high, it can cause the
> box to not respond over the network (ssh, rsh), and even be very
> sluggish when I am connected over our serial console to the server(s).
> This same scenario has been happening with kernels that I have tried
> from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> interesting in that I can push the same load from a box with the
> 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> conditions.

OK, I think what we want to do is turn on CONFIG_DEBUG_SLAB_LEAK. I've
never used it before, but it looks like it will report which functions
are allocating from each slab cache, which may be exactly what we need
to know. So:

1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
debugging") turned on. They're both under the "kernel hacking"
section of the kernel config. (If you have a file
/proc/slab_allocators, then you already have these turned on and
you can skip this step.)

2. Do whatever you need to do to reproduce the problem.

3. Get a copy of /proc/slabinfo and /proc/slab_allocators.

Then we can take a look at that and see if it sheds any light.

I think that debugging will hurt the server performance, so you won't
want to keep it turned on all the time.

>
> Also, this is all with the SLAB cache option. SLUB crashes everytime
> I use it under heavy load.

Have you reported the SLUB bugs to lkml?

--b.


2008-06-10 22:12:38

by Weathers, Norman R.

[permalink] [raw]
Subject: RE: Problems with large number of clients and reads



> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Tuesday, June 10, 2008 12:16 PM
> To: Weathers, Norman R.
> Cc: [email protected]
> Subject: Re: Problems with large number of clients and reads
>
> On Tue, Jun 10, 2008 at 09:30:18AM -0500, Weathers, Norman R. wrote:
> > Unfortunately, I cannot stop the clients (middle of long running
> > jobs). I might be able to test this soon. If I have the number of
> > threads high, yes I can reduce the number of threads and it
> appears to
> > lower some of the memory, but even with as little as three threads,
> > the memory usage climbs very high, just not as high as if there are
> > say 8 threads. When the memory usage climbs high, it can cause the
> > box to not respond over the network (ssh, rsh), and even be very
> > sluggish when I am connected over our serial console to the
> server(s).
> > This same scenario has been happening with kernels that I have tried
> > from 2.6.22.x on to the 2.6.25 series. The 2.6.25 series is
> > interesting in that I can push the same load from a box with the
> > 2.6.25 kernel and not have a load over .3 (with 3 threads), but with
> > the 2.6.22.x kernel, I have a load of over 3 when I hit the same
> > conditions.
>
> OK, I think what we want to do is turn on
> CONFIG_DEBUG_SLAB_LEAK. I've
> never used it before, but it looks like it will report which functions
> are allocating from each slab cache, which may be exactly what we need
> to know. So:
>
> 1. Install a kernel with both CONFIG_DEBUG_SLAB ("Debug slab
> memory allocations") and CONFIG_DEBUG_SLAB_LEAK ("Memory leak
> debugging") turned on. They're both under the "kernel hacking"
> section of the kernel config. (If you have a file
> /proc/slab_allocators, then you already have these turned on and
> you can skip this step.)
>
> 2. Do whatever you need to do to reproduce the problem.
>
> 3. Get a copy of /proc/slabinfo and /proc/slab_allocators.
>
> Then we can take a look at that and see if it sheds any light.


I have taken several snapshots of the /proc/slab_allocators and
/proc/slabinfo as requested, but since there is a lot of info in them,
and I didn't think anyone wanted to go cross-eyed reading the data in an
email, I have them up on a website:

http://shashi-weathers.net/linux/cluster/NFS/

The order of data collection is:

slab_allocators_bad1.txt and corresponding slabinfo
slab_allocators_after_bad1.txt and corresponding slabinfo
slab_allocators_16_threads.txt and corresponding slabinfo
slab_allocators_16_threads_1.txt and corresponding slabinfo
slab_allocators_32_threads.txt and corresponding slabinfo
slab_allocators_really_bad.txt and corresponding slabinfo.


You will have to forgive my ignorance at this point, but I was looking
through the slabinfo and slab_allocators, and noticed that size-4096
does not show up in slab_allocators... I hope that is by design. You
can see it growing into the gigabytes in the slabinfo files....



>
> I think that debugging will hurt the server performance, so you won't
> want to keep it turned on all the time.
>
> >
> > Also, this is all with the SLAB cache option. SLUB crashes
> everytime
> > I use it under heavy load.
>
> Have you reported the SLUB bugs to lkml?

No, I haven't yet. I didn't know for sure if I was doing something
wrong, or if SLUB was the problem there. Since the failures, I had gone
back to using SLAB anyway, so .... I probably should...

>
> --b.
>


Norman Weathers