2008-08-01 18:36:22

by Christoph Lameter

[permalink] [raw]
Subject: [patch 00/19] Slab Fragmentation Reduction V13

V12->v13:
- Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
- Fix unitialized variable issue

Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim without much succes.
Slab defragmentation adds the capability to recover the memory that
is wasted.

Memory reclaim for the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
filesystems than the currently supported ext2/3/4 reiserfs, XFS
and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.

However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.

V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
use a timeout but track the amount of slab reclaim done by the shrinkers.
Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
/proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
defrag (Either of those could be theoretically equipped to support
slab defrag in some way but it seems that Andrew/Linus want to reduce
the number of slab allocators).

V10->V11
- Simplify determination when to reclaim: Just scan over all partials
and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
were calling into reclaim very frequently when the system was under
memory pressure which slowed things down. Various measures to
avoid scanning the partial list too frequently were added and the
earlier (expensive) method of determining the defrag ratio of the slab
cache as a whole was dropped. I think this addresses the issues that
Mel saw with V10.

V9->V10
- Rediff against upstream

V8->V9
- Rediff against 2.6.24-rc6-mm1

V7->V8
- Rediff against 2.6.24-rc3-mm2

V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
Targeted reclaim never triggers. This has to wait until we make
slabs movable or we need to perform a special version of lumpy reclaim
in SLUB while we scan the partial lists for slabs to kick out.
Removal simplifies handling significantly since we
get to slabs in a more controlled way via the partial lists.
The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
a new dentry flag that indicates that a dentry is not in the process
of being freed or allocated.

--


2008-08-03 01:58:58

by Matthew Wilcox

[permalink] [raw]
Subject: No, really, stop trying to delete slab until you've finished making slub perform as well

On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> defrag (Either of those could be theoretically equipped to support
> slab defrag in some way but it seems that Andrew/Linus want to reduce
> the number of slab allocators).

Do we have to once again explain that slab still outperforms slub on at
least one important benchmark? I hope Nick Piggin finds time to finish
tuning slqb; it already outperforms slub.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-08-03 21:29:19

by Pekka Enberg

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Hi Matthew,

Matthew Wilcox wrote:
> Do we have to once again explain that slab still outperforms slub on at
> least one important benchmark? I hope Nick Piggin finds time to finish
> tuning slqb; it already outperforms slub.

No, you don't have to. I haven't merged that patch nor do I intend to do
so until the regressions are fixed.

And yes, I'm still waiting to hear from you how we're now doing with
higher order page allocations...

Pekka

2008-08-04 02:39:20

by Rene Herman

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

On 03-08-08 23:25, Pekka Enberg wrote:

> Matthew Wilcox wrote:

>> Do we have to once again explain that slab still outperforms slub on at
>> least one important benchmark? I hope Nick Piggin finds time to finish
>> tuning slqb; it already outperforms slub.
>
> No, you don't have to. I haven't merged that patch nor do I intend to do
> so until the regressions are fixed.
>
> And yes, I'm still waiting to hear from you how we're now doing with
> higher order page allocations...

General interested question -- I recently "accidentally" read some of
slub and I believe that it doesn't feature the cache colouring support
that slab did? Is that true, and if so, wasn't it needed/useful?

Rene.

2008-08-04 13:44:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Matthew Wilcox wrote:
> On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
>> - Add a patch that obsoletes SLAB and explains why SLOB does not support
>> defrag (Either of those could be theoretically equipped to support
>> slab defrag in some way but it seems that Andrew/Linus want to reduce
>> the number of slab allocators).
>
> Do we have to once again explain that slab still outperforms slub on at
> least one important benchmark? I hope Nick Piggin finds time to finish
> tuning slqb; it already outperforms slub.
>

Uhh. I forgot to delete that statement. I did not include the patch in the series.

We have a fundamental issue design issue there. Queuing on free can result in
better performance as in SLAB. However, it limits concurrency (per node lock
taking) and causes latency spikes due to queue processing (f.e. one test load
had 118.65 vs. 34 usecs just by switching to SLUB).

Could you address the performance issues in different ways? F.e. try to free
when the object is hot or free from multiple processors? SLAB has to take the
list_lock rather frequently under high concurrent loads (depends on queue
size). That will not occur with SLUB. So you actually can free (and allocate)
concurrently with high performance.

2008-08-04 14:48:44

by Jamie Lokier

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Christoph Lameter wrote:
> Matthew Wilcox wrote:
> > On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> >> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> >> defrag (Either of those could be theoretically equipped to support
> >> slab defrag in some way but it seems that Andrew/Linus want to reduce
> >> the number of slab allocators).
> >
> > Do we have to once again explain that slab still outperforms slub on at
> > least one important benchmark? I hope Nick Piggin finds time to finish
> > tuning slqb; it already outperforms slub.
> >
>
> Uhh. I forgot to delete that statement. I did not include the patch
> in the series.
>
> We have a fundamental issue design issue there. Queuing on free can result in
> better performance as in SLAB. However, it limits concurrency (per node lock
> taking) and causes latency spikes due to queue processing (f.e. one test load
> had 118.65 vs. 34 usecs just by switching to SLUB).

Vaguely on this topic, has anyone studied the effects of SLAB/SLUB
etc. on MMUless systems?

-- Jamie

2008-08-04 15:13:37

by Rik van Riel

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

On Mon, 04 Aug 2008 08:43:21 -0500
Christoph Lameter <[email protected]> wrote:
> Matthew Wilcox wrote:
> > On Fri, May 09, 2008 at 07:21:01PM -0700, Christoph Lameter wrote:
> >> - Add a patch that obsoletes SLAB and explains why SLOB does not support
> >> defrag (Either of those could be theoretically equipped to support
> >> slab defrag in some way but it seems that Andrew/Linus want to reduce
> >> the number of slab allocators).
> >
> > Do we have to once again explain that slab still outperforms slub on at
> > least one important benchmark? I hope Nick Piggin finds time to finish
> > tuning slqb; it already outperforms slub.
> >
>
> Uhh. I forgot to delete that statement. I did not include the patch in the series.
>
> We have a fundamental issue design issue there. Queuing on free can result in
> better performance as in SLAB. However, it limits concurrency (per node lock
> taking) and causes latency spikes due to queue processing (f.e. one test load
> had 118.65 vs. 34 usecs just by switching to SLUB).
>
> Could you address the performance issues in different ways? F.e. try to free
> when the object is hot or free from multiple processors? SLAB has to take the
> list_lock rather frequently under high concurrent loads (depends on queue
> size). That will not occur with SLUB. So you actually can free (and allocate)
> concurrently with high performance.

I guess you could bypass the queueing on free for objects that
come from a "local" SLUB page, only queueing objects that go
onto remote pages.

That way workloads that already perform well with SLUB should
keep the current performance, while workloads that currently
perform badly with SLUB should get an improvement.

--
All Rights Reversed

2008-08-04 15:22:08

by Jamie Lokier

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Jamie Lokier wrote:
> Vaguely on this topic, has anyone studied the effects of SLAB/SLUB
> etc. on MMUless systems?

The reason is that MMU-less systems are extremely sensitive to
fragmentation. Every program started on those systems must allocate a
large contiguous block for the code and data, and every malloc >1 page
is the same. If memory is too fragmented, starting new programs fails.

The high-order page-allocator defragmentation lately should help with
that.

The different behaviours of SLAB/SLUB might result in different levels
of fragmentation, so I wonder if anyone has compared them on MMU-less
systems or fragmentation-sensitive workloads on general systems.

Thanks,
-- Jamie

2008-08-04 16:03:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Rik van Riel wrote:

> I guess you could bypass the queueing on free for objects that
> come from a "local" SLUB page, only queueing objects that go
> onto remote pages.

Tried that already. The logic to decide if an object is local is creating
significant overhead. Plus you need queues for the remote nodes. Back to alien
queues?

2008-08-04 16:36:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Jamie Lokier wrote:

> The different behaviours of SLAB/SLUB might result in different levels
> of fragmentation, so I wonder if anyone has compared them on MMU-less
> systems or fragmentation-sensitive workloads on general systems.

Never heard of such a comparison.

MMU less systems typically have a minimal number of processors. For that
configuration the page orders are roughly equivalent to slab. Larger orders
come into play with large amounts of processors.

2008-08-04 16:47:55

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Hi

> Could you address the performance issues in different ways? F.e. try to free
> when the object is hot or free from multiple processors? SLAB has to take the
> list_lock rather frequently under high concurrent loads (depends on queue
> size). That will not occur with SLUB. So you actually can free (and allocate)
> concurrently with high performance.

just information. (offtopic?)

When hackbench running, SLUB consume memory very largely than SLAB.
then, SLAB often outperform SLUB in memory stavation state.

I don't know why memory comsumption different.
Anyone know it?

2008-08-04 17:14:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:

> When hackbench running, SLUB consume memory very largely than SLAB.
> then, SLAB often outperform SLUB in memory stavation state.
>
> I don't know why memory comsumption different.
> Anyone know it?

Can you quantify the difference?

SLAB buffers objects in its queues. SLUB does rely more on the page allocator.
So SLAB may have its own reserves to fall back on.

2008-08-04 17:20:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

On Mon, Aug 4, 2008 at 8:13 PM, Christoph Lameter
<[email protected]> wrote:
> KOSAKI Motohiro wrote:
>
>> When hackbench running, SLUB consume memory very largely than SLAB.
>> then, SLAB often outperform SLUB in memory stavation state.
>>
>> I don't know why memory comsumption different.
>> Anyone know it?
>
> Can you quantify the difference?
>
> SLAB buffers objects in its queues. SLUB does rely more on the page allocator.
> So SLAB may have its own reserves to fall back on.

Also, what kind of machine are we talking about here? If there are a
lot of CPUs, SLUB will allocate higher order pages more aggressively
than SLAB by default.

2008-08-04 17:20:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:
>
> When hackbench running, SLUB consume memory very largely than SLAB.
> then, SLAB often outperform SLUB in memory stavation state.

Re memory use: If SLUB finds that there is lock contention on a slab page then
it will allocate a new one and dedicate it to a cpu in order to avoid future
contentions.

2008-08-04 21:26:30

by Pekka Enberg

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Rene Herman wrote:
> On 03-08-08 23:25, Pekka Enberg wrote:
>
>> Matthew Wilcox wrote:
>
>>> Do we have to once again explain that slab still outperforms slub on at
>>> least one important benchmark? I hope Nick Piggin finds time to finish
>>> tuning slqb; it already outperforms slub.
>>
>> No, you don't have to. I haven't merged that patch nor do I intend to
>> do so until the regressions are fixed.
>>
>> And yes, I'm still waiting to hear from you how we're now doing with
>> higher order page allocations...
>
> General interested question -- I recently "accidentally" read some of
> slub and I believe that it doesn't feature the cache colouring support
> that slab did? Is that true, and if so, wasn't it needed/useful?

I don't know why Christoph decided not to implement it. Christoph?

2008-08-04 21:44:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Pekka Enberg wrote:
>> General interested question -- I recently "accidentally" read some of
>> slub and I believe that it doesn't feature the cache colouring support
>> that slab did? Is that true, and if so, wasn't it needed/useful?
>
> I don't know why Christoph decided not to implement it. Christoph?

IMHO cache coloring issues seem to be mostly taken care of by newer more
associative cpu caching designs.

Note that the SLAB design origin is Solaris (See the paper by Jeff Bonwick in
1994 that is quoted in mm/slab.c). Logic for cache coloring is mostly avoided
today due to the complexity it would introduce. See also
http://en.wikipedia.org/wiki/CPU_cache.

What one could add to support cache coloring in SLUB is a prearrangement of
the order of object allocation order by constructing the initial freelist for
a page in a certain way. See mm/slub.c::new_slab()

2008-08-04 23:09:48

by Rene Herman

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

On 04-08-08 23:41, Christoph Lameter wrote:

>>> General interested question -- I recently "accidentally" read some of
>>> slub and I believe that it doesn't feature the cache colouring support
>>> that slab did? Is that true, and if so, wasn't it needed/useful?
>> I don't know why Christoph decided not to implement it. Christoph?
>
> IMHO cache coloring issues seem to be mostly taken care of by newer more
> associative cpu caching designs.

I see. Just gathered a bit of data on this (from sandpile.org):

32-byte lines:

P54 : L1 I 8K, 2-Way
D 8K, 2-Way
L2 External

P55 : L1 I 16K, 4-Way
D 16K, 4-Way
L2 External

P2 : L1 I 16K 4-Way
D 16K 4-Way
L2 128K to 2MB 4-Way

P3 : L1 I 16K 4-Way
D 16K 4-Way
L2 128K to 2MB 4-Way or
256K to 2MB 8-Way

64-byte lines:

P4 : L1 I 12K uOP Trace (8-Way, 6 uOP line)
D 8K 4-Way or
16K 8-Way
L2 128K 2-Way or
128K, 256K 4-Way or
512K, 1M, 2M 8-Way
L3 512K 4-Way or
1M to 8M 8-Way or
2M to 16M 16-Way

Core: L1 I 32K 8-Way
D 32K 8-Way
L2 512K 2-Way or
1M 4-Way or
2M 8-Way or
3M 12-Way or
4M 16-Way

K7 : L1 I 64K 2-Way
D 64K 2-Way
L2 512, 1M, 2M 2-Way or
4M, 8M 1-Way or
64K, 256K, 512K 16-Way

K8 : L1 I 64K 2-Way
D 64K 2-Way
L2 128K to 1M 16-Way


The L1 on K7 and K8 especially seems still a bit of worry here.

> Note that the SLAB design origin is Solaris (See the paper by Jeff Bonwick in
> 1994 that is quoted in mm/slab.c). Logic for cache coloring is mostly avoided
> today due to the complexity it would introduce. See also
> http://en.wikipedia.org/wiki/CPU_cache.
>
> What one could add to support cache coloring in SLUB is a prearrangement of
> the order of object allocation order by constructing the initial freelist for
> a page in a certain way. See mm/slub.c::new_slab()

<remains silent>

To me, colouring always seemed like a fairly promising thing but I won't
pretend to have any sort of data.

Rene.

2008-08-05 12:09:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> KOSAKI Motohiro wrote:
>
> > When hackbench running, SLUB consume memory very largely than SLAB.
> > then, SLAB often outperform SLUB in memory stavation state.
> >
> > I don't know why memory comsumption different.
> > Anyone know it?
>
> Can you quantify the difference?

machine spec:
CPU: IA64 x 8
MEM: 8G (4G x2node)

test method

1. echo 3 >/proc/sys/vm/drop_caches
2. % ./hackbench 90 process 1000 <- for fill pagetable cache
3. % ./hackbench 90 process 1000


vmstat result

<SLAB (without CONFIG_DEBUG_SLAB)>

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 3223168 6016 38336 0 0 0 0 3181 4314 0 15 85 0 0
2039 2 0 2022144 6016 38336 0 0 0 0 2364 13622 0 49 51 0 0
634 0 0 2629824 6080 38336 0 0 0 64 83582 2538927 5 95 0 0 0
596 0 0 2842624 6080 38336 0 0 0 0 6864 675841 6 94 0 0 0
590 0 0 2993472 6080 38336 0 0 0 0 9514 456085 6 94 0 0 0
503 0 0 3138560 6080 38336 0 0 0 0 8042 276024 4 96 0 0 0

about 3G remain.

<SLUB>
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1066 0 0 323008 3584 18240 0 0 0 0 12037 47353 1 99 0 0 0
1101 0 0 324672 3584 18240 0 0 0 0 6029 25100 1 99 0 0 0
913 0 0 330240 3584 18240 0 0 0 0 9694 54951 2 98 0 0 0

about 300M remain.


So, about 2.5G - 3G difference in 8G mem.



2008-08-05 15:10:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:

>> Can you quantify the difference?
>
> machine spec:
> CPU: IA64 x 8
> MEM: 8G (4G x2node)

16k or 64k page size?

> procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 2 0 0 3223168 6016 38336 0 0 0 0 3181 4314 0 15 85 0 0
> 2039 2 0 2022144 6016 38336 0 0 0 0 2364 13622 0 49 51 0 0
> 634 0 0 2629824 6080 38336 0 0 0 64 83582 2538927 5 95 0 0 0
> 596 0 0 2842624 6080 38336 0 0 0 0 6864 675841 6 94 0 0 0
> 590 0 0 2993472 6080 38336 0 0 0 0 9514 456085 6 94 0 0 0
> 503 0 0 3138560 6080 38336 0 0 0 0 8042 276024 4 96 0 0 0
>
> about 3G remain.
>
> <SLUB>
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 1066 0 0 323008 3584 18240 0 0 0 0 12037 47353 1 99 0 0 0
> 1101 0 0 324672 3584 18240 0 0 0 0 6029 25100 1 99 0 0 0
> 913 0 0 330240 3584 18240 0 0 0 0 9694 54951 2 98 0 0 0
>
> about 300M remain.
>
>
> So, about 2.5G - 3G difference in 8G mem.

Well not sure if that tells us much. Please show us the output of
/proc/meminfo after each run. The slab counters indicate how much memory is
used by the slabs.

It would also be interesting to see the output of the slabinfo command after
the slub run?

2008-08-06 12:36:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

>>> Can you quantify the difference?
>>
>> machine spec:
>> CPU: IA64 x 8
>> MEM: 8G (4G x2node)
>
> 16k or 64k page size?

64k.


>> So, about 2.5G - 3G difference in 8G mem.
>
> Well not sure if that tells us much. Please show us the output of
> /proc/meminfo after each run. The slab counters indicate how much memory is
> used by the slabs.
>
> It would also be interesting to see the output of the slabinfo command after
> the slub run?

ok.
but i can't do that in this week.
so, I'll do it in next week.

honestly, I don't know how to use slabinfo command :-)

2008-08-06 14:25:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:
>>>> Can you quantify the difference?
>>> machine spec:
>>> CPU: IA64 x 8
>>> MEM: 8G (4G x2node)
>> 16k or 64k page size?
>
> 64k.
>
>
>>> So, about 2.5G - 3G difference in 8G mem.
>> Well not sure if that tells us much. Please show us the output of
>> /proc/meminfo after each run. The slab counters indicate how much memory is
>> used by the slabs.
>>
>> It would also be interesting to see the output of the slabinfo command after
>> the slub run?
>
> ok.
> but i can't do that in this week.
> so, I'll do it in next week.
>
> honestly, I don't know how to use slabinfo command :-)

Its in linux/Documentation/vm/slabinfo.c

Do

gcc -o slabinfo Documentation/vm/slabinfo.c

./slabinfo

(./slabinfo -h if you are curious and want to use more advanced options)

2008-08-13 10:48:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> Well not sure if that tells us much. Please show us the output of
> /proc/meminfo after each run. The slab counters indicate how much memory is
> used by the slabs.
>
> It would also be interesting to see the output of the slabinfo command after
> the slub run?

sorry for late responce.

slab use 123M vs slub use 1.5G

Thought?


<slab>

% cat /proc/meminfo
MemTotal: 7701760 kB
MemFree: 5940096 kB
Buffers: 6400 kB
Cached: 27712 kB
SwapCached: 52544 kB
Active: 51520 kB
Inactive: 53248 kB
Active(anon): 26752 kB
Inactive(anon): 41792 kB
Active(file): 24768 kB
Inactive(file): 11456 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2031488 kB
SwapFree: 1958400 kB
Dirty: 192 kB
Writeback: 0 kB
AnonPages: 38400 kB
Mapped: 23232 kB
Slab: 123840 kB
SReclaimable: 30272 kB
SUnreclaim: 93568 kB
PageTables: 10688 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5882368 kB
Committed_AS: 397568 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29184 kB
VmallocChunk: 17592177626240 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB

% cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dm_mpath_io 0 0 40 1488 1 : tunables 120 60 8 : slabdata 0 0 0
dm_snap_tracked_chunk 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
dm_snap_pending_exception 0 0 112 564 1 : tunables 120 60 8 : slabdata 0 0 0
dm_snap_exception 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
kcopyd_job 0 0 408 158 1 : tunables 54 27 8 : slabdata 0 0 0
dm_target_io 515 2338 24 2338 1 : tunables 120 60 8 : slabdata 1 1 0
dm_io 515 1818 32 1818 1 : tunables 120 60 8 : slabdata 1 1 0
scsi_sense_cache 26 496 128 496 1 : tunables 120 60 8 : slabdata 1 1 0
scsi_cmd_cache 26 168 384 168 1 : tunables 54 27 8 : slabdata 1 1 0
uhci_urb_priv 0 0 56 1091 1 : tunables 120 60 8 : slabdata 0 0 0
flow_cache 0 0 96 654 1 : tunables 120 60 8 : slabdata 0 0 0
cfq_io_context 48 760 168 380 1 : tunables 120 60 8 : slabdata 2 2 0
cfq_queue 41 934 136 467 1 : tunables 120 60 8 : slabdata 2 2 0
mqueue_inode_cache 1 56 1152 56 1 : tunables 24 12 8 : slabdata 1 1 0
fat_inode_cache 1 77 840 77 1 : tunables 54 27 8 : slabdata 1 1 0
fat_cache 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
hugetlbfs_inode_cache 1 83 776 83 1 : tunables 54 27 8 : slabdata 1 1 0
ext2_inode_cache 0 0 1024 63 1 : tunables 54 27 8 : slabdata 0 0 0
ext2_xattr 0 0 88 711 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_journal_handle 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_journal_head 0 0 96 654 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_revoke_table 0 0 16 3274 1 : tunables 120 60 8 : slabdata 0 0 0
jbd2_revoke_record 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
journal_handle 48 4676 24 2338 1 : tunables 120 60 8 : slabdata 2 2 0
journal_head 41 1308 96 654 1 : tunables 120 60 8 : slabdata 2 2 0
revoke_table 4 3274 16 3274 1 : tunables 120 60 8 : slabdata 1 1 0
revoke_record 0 0 32 1818 1 : tunables 120 60 8 : slabdata 0 0 0
ext4_inode_cache 0 0 1192 54 1 : tunables 24 12 8 : slabdata 0 0 0
ext4_xattr 0 0 88 711 1 : tunables 120 60 8 : slabdata 0 0 0
ext4_alloc_context 0 0 168 380 1 : tunables 120 60 8 : slabdata 0 0 0
ext4_prealloc_space 0 0 120 528 1 : tunables 120 60 8 : slabdata 0 0 0
ext3_inode_cache 367 5696 1016 64 1 : tunables 54 27 8 : slabdata 89 89 0
ext3_xattr 99 1422 88 711 1 : tunables 120 60 8 : slabdata 2 2 0
dnotify_cache 1 1488 40 1488 1 : tunables 120 60 8 : slabdata 1 1 0
kioctx 0 0 384 168 1 : tunables 54 27 8 : slabdata 0 0 0
kiocb 0 0 256 251 1 : tunables 120 60 8 : slabdata 0 0 0
inotify_event_cache 0 0 40 1488 1 : tunables 120 60 8 : slabdata 0 0 0
inotify_watch_cache 1 861 72 861 1 : tunables 120 60 8 : slabdata 1 1 0
fasync_cache 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
shmem_inode_cache 864 1105 1000 65 1 : tunables 54 27 8 : slabdata 17 17 0
pid_namespace 0 0 184 348 1 : tunables 120 60 8 : slabdata 0 0 0
nsproxy 0 0 56 1091 1 : tunables 120 60 8 : slabdata 0 0 0
posix_timers_cache 0 0 184 348 1 : tunables 120 60 8 : slabdata 0 0 0
uid_cache 6 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
ia64_partial_page_cache 0 0 48 1259 1 : tunables 120 60 8 : slabdata 0 0 0
UNIX 32 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
UDP-Lite 0 0 1024 63 1 : tunables 54 27 8 : slabdata 0 0 0
tcp_bind_bucket 4 1924 64 962 1 : tunables 120 60 8 : slabdata 2 2 0
inet_peer_cache 0 0 64 962 1 : tunables 120 60 8 : slabdata 0 0 0
secpath_cache 0 0 64 962 1 : tunables 120 60 8 : slabdata 0 0 0
xfrm_dst_cache 0 0 384 168 1 : tunables 54 27 8 : slabdata 0 0 0
ip_fib_alias 3 1818 32 1818 1 : tunables 120 60 8 : slabdata 1 1 0
ip_fib_hash 15 1722 72 861 1 : tunables 120 60 8 : slabdata 2 2 0
ip_dst_cache 50 336 384 168 1 : tunables 54 27 8 : slabdata 2 2 0
arp_cache 1 251 256 251 1 : tunables 120 60 8 : slabdata 1 1 0
RAW 129 216 896 72 1 : tunables 54 27 8 : slabdata 3 3 0
UDP 9 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
tw_sock_TCP 0 0 256 251 1 : tunables 120 60 8 : slabdata 0 0 0
request_sock_TCP 0 0 128 496 1 : tunables 120 60 8 : slabdata 0 0 0
TCP 5 72 1792 36 1 : tunables 24 12 8 : slabdata 2 2 0
eventpoll_pwq 0 0 72 861 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_epi 0 0 128 496 1 : tunables 120 60 8 : slabdata 0 0 0
sgpool-128 2 30 4096 15 1 : tunables 24 12 8 : slabdata 2 2 0
sgpool-64 2 62 2048 31 1 : tunables 24 12 8 : slabdata 2 2 0
sgpool-32 2 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
sgpool-16 2 252 512 126 1 : tunables 54 27 8 : slabdata 2 2 0
sgpool-8 18 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
scsi_data_buffer 0 0 24 2338 1 : tunables 120 60 8 : slabdata 0 0 0
scsi_io_context 0 0 112 564 1 : tunables 120 60 8 : slabdata 0 0 0
blkdev_queue 26 70 1864 35 1 : tunables 24 12 8 : slabdata 2 2 0
blkdev_requests 44 212 304 212 1 : tunables 54 27 8 : slabdata 1 1 0
blkdev_ioc 38 1308 96 654 1 : tunables 120 60 8 : slabdata 2 2 0
biovec-256 34 60 4096 15 1 : tunables 24 12 8 : slabdata 4 4 0
biovec-128 34 93 2048 31 1 : tunables 24 12 8 : slabdata 3 3 0
biovec-64 34 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
biovec-16 34 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
biovec-4 34 1924 64 962 1 : tunables 120 60 8 : slabdata 2 2 0
biovec-1 37 6548 16 3274 1 : tunables 120 60 8 : slabdata 2 2 0
bio 37 992 128 496 1 : tunables 120 60 8 : slabdata 2 2 0
sock_inode_cache 188 288 896 72 1 : tunables 54 27 8 : slabdata 4 4 0
skbuff_fclone_cache 16 126 512 126 1 : tunables 54 27 8 : slabdata 1 1 0
skbuff_head_cache 1812 11546 256 251 1 : tunables 120 60 8 : slabdata 46 46 0
file_lock_cache 4 668 192 334 1 : tunables 120 60 8 : slabdata 2 2 0
Acpi-Operand 24947 26691 72 861 1 : tunables 120 60 8 : slabdata 31 31 0
Acpi-ParseExt 0 0 72 861 1 : tunables 120 60 8 : slabdata 0 0 0
Acpi-Parse 0 0 48 1259 1 : tunables 120 60 8 : slabdata 0 0 0
Acpi-State 0 0 80 779 1 : tunables 120 60 8 : slabdata 0 0 0
Acpi-Namespace 18877 21816 32 1818 1 : tunables 120 60 8 : slabdata 12 12 0
page_cgroup 1183 142848 40 1488 1 : tunables 120 60 8 : slabdata 96 96 0
proc_inode_cache 197 902 792 82 1 : tunables 54 27 8 : slabdata 11 11 0
sigqueue 0 0 160 399 1 : tunables 120 60 8 : slabdata 0 0 0
radix_tree_node 719 7254 552 117 1 : tunables 54 27 8 : slabdata 62 62 0
bdev_cache 30 126 1024 63 1 : tunables 54 27 8 : slabdata 2 2 0
sysfs_dir_cache 11089 12464 80 779 1 : tunables 120 60 8 : slabdata 16 16 0
mnt_cache 24 502 256 251 1 : tunables 120 60 8 : slabdata 2 2 0
inode_cache 54 696 744 87 1 : tunables 54 27 8 : slabdata 8 8 0
dentry 1577 17794 224 287 1 : tunables 120 60 8 : slabdata 62 62 0
filp 706 3765 256 251 1 : tunables 120 60 8 : slabdata 15 15 0
names_cache 46 105 4096 15 1 : tunables 24 12 8 : slabdata 7 7 0
buffer_head 3557 125442 104 606 1 : tunables 120 60 8 : slabdata 207 207 0
mm_struct 76 288 896 72 1 : tunables 54 27 8 : slabdata 4 4 0
vm_area_struct 1340 2178 176 363 1 : tunables 120 60 8 : slabdata 6 6 36
fs_cache 61 992 128 496 1 : tunables 120 60 8 : slabdata 2 2 0
files_cache 62 336 768 84 1 : tunables 54 27 8 : slabdata 4 4 0
signal_cache 161 588 768 84 1 : tunables 54 27 8 : slabdata 7 7 0
sighand_cache 157 390 1664 39 1 : tunables 24 12 8 : slabdata 10 10 0
anon_vma 657 2976 40 1488 1 : tunables 120 60 8 : slabdata 2 2 0
pid 160 992 128 496 1 : tunables 120 60 8 : slabdata 2 2 0
shared_policy_node 0 0 48 1259 1 : tunables 120 60 8 : slabdata 0 0 0
numa_policy 7 244 264 244 1 : tunables 54 27 8 : slabdata 1 1 0
idr_layer_cache 150 476 544 119 1 : tunables 54 27 8 : slabdata 4 4 0
size-33554432(DMA) 0 0 33554432 1 512 : tunables 1 1 0 : slabdata 0 0 0
size-33554432 0 0 33554432 1 512 : tunables 1 1 0 : slabdata 0 0 0
size-16777216(DMA) 0 0 16777216 1 256 : tunables 1 1 0 : slabdata 0 0 0
size-16777216 0 0 16777216 1 256 : tunables 1 1 0 : slabdata 0 0 0
size-8388608(DMA) 0 0 8388608 1 128 : tunables 1 1 0 : slabdata 0 0 0
size-8388608 0 0 8388608 1 128 : tunables 1 1 0 : slabdata 0 0 0
size-4194304(DMA) 0 0 4194304 1 64 : tunables 1 1 0 : slabdata 0 0 0
size-4194304 0 0 4194304 1 64 : tunables 1 1 0 : slabdata 0 0 0
size-2097152(DMA) 0 0 2097152 1 32 : tunables 1 1 0 : slabdata 0 0 0
size-2097152 0 0 2097152 1 32 : tunables 1 1 0 : slabdata 0 0 0
size-1048576(DMA) 0 0 1048576 1 16 : tunables 1 1 0 : slabdata 0 0 0
size-1048576 0 0 1048576 1 16 : tunables 1 1 0 : slabdata 0 0 0
size-524288(DMA) 0 0 524288 1 8 : tunables 1 1 0 : slabdata 0 0 0
size-524288 0 0 524288 1 8 : tunables 1 1 0 : slabdata 0 0 0
size-262144(DMA) 0 0 262144 1 4 : tunables 1 1 0 : slabdata 0 0 0
size-262144 0 0 262144 1 4 : tunables 1 1 0 : slabdata 0 0 0
size-131072(DMA) 0 0 131072 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-131072 1 1 131072 1 2 : tunables 8 4 0 : slabdata 1 1 0
size-65536(DMA) 0 0 65536 1 1 : tunables 24 12 8 : slabdata 0 0 0
size-65536 4 4 65536 1 1 : tunables 24 12 8 : slabdata 4 4 0
size-32768(DMA) 0 0 32768 2 1 : tunables 24 12 8 : slabdata 0 0 0
size-32768 12 14 32768 2 1 : tunables 24 12 8 : slabdata 7 7 0
size-16384(DMA) 0 0 16384 4 1 : tunables 24 12 8 : slabdata 0 0 0
size-16384 15 28 16384 4 1 : tunables 24 12 8 : slabdata 7 7 0
size-8192(DMA) 0 0 8192 8 1 : tunables 24 12 8 : slabdata 0 0 0
size-8192 2455 2472 8192 8 1 : tunables 24 12 8 : slabdata 309 309 0
size-4096(DMA) 0 0 4096 15 1 : tunables 24 12 8 : slabdata 0 0 0
size-4096 1607 1665 4096 15 1 : tunables 24 12 8 : slabdata 111 111 0
size-2048(DMA) 0 0 2048 31 1 : tunables 24 12 8 : slabdata 0 0 0
size-2048 2706 2914 2048 31 1 : tunables 24 12 8 : slabdata 94 94 0
size-1024(DMA) 0 0 1024 63 1 : tunables 54 27 8 : slabdata 0 0 0
size-1024 2414 2583 1024 63 1 : tunables 54 27 8 : slabdata 41 41 0
size-512(DMA) 0 0 512 126 1 : tunables 54 27 8 : slabdata 0 0 0
size-512 1805 2142 512 126 1 : tunables 54 27 8 : slabdata 17 17 0
size-256(DMA) 0 0 256 251 1 : tunables 120 60 8 : slabdata 0 0 0
size-256 44889 48945 256 251 1 : tunables 120 60 8 : slabdata 195 195 0
size-128(DMA) 0 0 128 496 1 : tunables 120 60 8 : slabdata 0 0 0
size-64(DMA) 0 0 64 962 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 28119 30256 128 496 1 : tunables 120 60 8 : slabdata 61 61 0
size-64 14597 22126 64 962 1 : tunables 120 60 8 : slabdata 23 23 0
kmem_cache 151 155 12416 5 1 : tunables 24 12 8 : slabdata 31 31 0


<SLUB>

% cat /proc/meminfo
MemTotal: 7701376 kB
MemFree: 4740928 kB
Buffers: 4544 kB
Cached: 35584 kB
SwapCached: 0 kB
Active: 119104 kB
Inactive: 9920 kB
Active(anon): 90240 kB
Inactive(anon): 0 kB
Active(file): 28864 kB
Inactive(file): 9920 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2031488 kB
SwapFree: 2031488 kB
Dirty: 64 kB
Writeback: 0 kB
AnonPages: 89152 kB
Mapped: 31232 kB
Slab: 1591680 kB
SReclaimable: 12608 kB
SUnreclaim: 1579072 kB
PageTables: 11904 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5882176 kB
Committed_AS: 446848 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29056 kB
VmallocChunk: 17592177626432 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB

% cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kcopyd_job 0 0 408 160 1 : tunables 0 0 0 : slabdata 0 0 0
cfq_io_context 3120 3120 168 390 1 : tunables 0 0 0 : slabdata 8 8 0
cfq_queue 3848 3848 136 481 1 : tunables 0 0 0 : slabdata 8 8 0
mqueue_inode_cache 56 56 1152 56 1 : tunables 0 0 0 : slabdata 1 1 0
fat_inode_cache 77 77 848 77 1 : tunables 0 0 0 : slabdata 1 1 0
fat_cache 0 0 40 1638 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 83 83 784 83 1 : tunables 0 0 0 : slabdata 1 1 0
ext2_inode_cache 0 0 1032 63 1 : tunables 0 0 0 : slabdata 0 0 0
journal_handle 21840 21840 24 2730 1 : tunables 0 0 0 : slabdata 8 8 0
journal_head 4774 4774 96 682 1 : tunables 0 0 0 : slabdata 7 7 0
revoke_table 4096 4096 16 4096 1 : tunables 0 0 0 : slabdata 1 1 0
revoke_record 2048 2048 32 2048 1 : tunables 0 0 0 : slabdata 1 1 0
ext4_inode_cache 0 0 1200 54 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_alloc_context 0 0 168 390 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_prealloc_space 0 0 120 546 1 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 750 2624 1024 64 1 : tunables 0 0 0 : slabdata 41 41 0
ext3_xattr 4464 4464 88 744 1 : tunables 0 0 0 : slabdata 6 6 0
shmem_inode_cache 1256 1365 1008 65 1 : tunables 0 0 0 : slabdata 21 21 0
nsproxy 0 0 56 1170 1 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 184 356 1 : tunables 0 0 0 : slabdata 0 0 0
ip_dst_cache 1360 1360 384 170 1 : tunables 0 0 0 : slabdata 8 8 0
TCP 180 180 1792 36 1 : tunables 0 0 0 : slabdata 5 5 0
scsi_data_buffer 21840 21840 24 2730 1 : tunables 0 0 0 : slabdata 8 8 0
scsi_io_context 0 0 112 585 1 : tunables 0 0 0 : slabdata 0 0 0
blkdev_queue 140 140 1864 70 2 : tunables 0 0 0 : slabdata 2 2 0
blkdev_requests 1720 1720 304 215 1 : tunables 0 0 0 : slabdata 8 8 0
sock_inode_cache 758 949 896 73 1 : tunables 0 0 0 : slabdata 13 13 0
file_lock_cache 2289 2289 200 327 1 : tunables 0 0 0 : slabdata 7 7 0
Acpi-ParseExt 29117 29120 72 910 1 : tunables 0 0 0 : slabdata 32 32 0
page_cgroup 14660 24570 40 1638 1 : tunables 0 0 0 : slabdata 15 15 0
proc_inode_cache 732 810 800 81 1 : tunables 0 0 0 : slabdata 10 10 0
sigqueue 3272 3272 160 409 1 : tunables 0 0 0 : slabdata 8 8 0
radix_tree_node 1200 1755 560 117 1 : tunables 0 0 0 : slabdata 15 15 0
bdev_cache 256 256 1024 64 1 : tunables 0 0 0 : slabdata 4 4 0
sysfs_dir_cache 16376 16380 80 819 1 : tunables 0 0 0 : slabdata 20 20 0
inode_cache 707 957 752 87 1 : tunables 0 0 0 : slabdata 11 11 0
dentry 3503 11096 224 292 1 : tunables 0 0 0 : slabdata 38 38 0
buffer_head 6920 23985 112 585 1 : tunables 0 0 0 : slabdata 41 41 0
mm_struct 741 1022 896 73 1 : tunables 0 0 0 : slabdata 14 14 0
vm_area_struct 4015 5208 176 372 1 : tunables 0 0 0 : slabdata 14 14 0
signal_cache 801 1020 768 85 1 : tunables 0 0 0 : slabdata 12 12 0
sighand_cache 433 546 1664 39 1 : tunables 0 0 0 : slabdata 14 14 0
anon_vma 10920 10920 48 1365 1 : tunables 0 0 0 : slabdata 8 8 0
shared_policy_node 5460 5460 48 1365 1 : tunables 0 0 0 : slabdata 4 4 0
numa_policy 248 248 264 248 1 : tunables 0 0 0 : slabdata 1 1 0
idr_layer_cache 944 944 552 118 1 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-65536 32 32 65536 4 4 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-32768 128 128 32768 16 8 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-16384 160 160 16384 32 8 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-8192 448 448 8192 64 8 : tunables 0 0 0 : slabdata 7 7 0
kmalloc-4096 819 14336 4096 64 4 : tunables 0 0 0 : slabdata 224 224 0
kmalloc-2048 2409 8384 2048 64 2 : tunables 0 0 0 : slabdata 131 131 0
kmalloc-1024 1848 14912 1024 64 1 : tunables 0 0 0 : slabdata 233 233 0
kmalloc-512 2306 2432 512 128 1 : tunables 0 0 0 : slabdata 19 19 0
kmalloc-256 13919 123904 256 256 1 : tunables 0 0 0 : slabdata 484 484 0
kmalloc-128 28739 10747904 128 512 1 : tunables 0 0 0 : slabdata 20992 20992 0
kmalloc-64 10224 10240 64 1024 1 : tunables 0 0 0 : slabdata 10 10 0
kmalloc-32 34806 34816 32 2048 1 : tunables 0 0 0 : slabdata 17 17 0
kmalloc-16 32768 32768 16 4096 1 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-8 65536 65536 8 8192 1 : tunables 0 0 0 : slabdata 8 8 0
kmalloc-192 4609 447051 192 341 1 : tunables 0 0 0 : slabdata 1311 1311 0
kmalloc-96 5456 5456 96 682 1 : tunables 0 0 0 : slabdata 8 8 0
kmem_cache_node 3276 3276 80 819 1 : tunables 0 0 0 : slabdata 4 4 0

% slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
:at-0000016 4096 16 65.5K 0/0/1 4096 0 0 100 *a
:at-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *a
:at-0000032 2048 32 65.5K 0/0/1 2048 0 0 100 *Aa
:at-0000088 4464 88 393.2K 0/0/6 744 0 0 99 *a
:at-0000096 4774 96 458.7K 0/0/7 682 0 0 99 *a
:t-0000016 32768 16 524.2K 0/0/8 4096 0 0 100 *
:t-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *
:t-0000032 34806 32 1.1M 9/1/8 2048 0 5 99 *
:t-0000040 14660 40 983.0K 7/7/8 1638 0 46 59 *
:t-0000048 5460 48 262.1K 0/0/4 1365 0 0 99 *
:t-0000064 10224 64 655.3K 2/1/8 1024 0 10 99 *
:t-0000072 29117 72 2.0M 26/2/6 910 0 6 99 *
:t-0000080 16376 80 1.3M 12/1/8 819 0 5 99 *
:t-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *
:t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
:t-0000256 15285 256 31.7M 476/438/8 256 0 90 12 *
:t-0000384 1360 352 524.2K 0/0/8 170 0 0 91 *A
:t-0000512 2306 512 1.2M 11/3/8 128 0 15 94 *
:t-0000768 801 768 786.4K 4/4/8 85 0 33 78 *A
:t-0000896 741 880 917.5K 6/5/8 73 0 35 71 *A
:t-0001024 1848 1024 15.2M 225/214/8 64 0 91 12 *
:t-0002048 2406 2048 17.1M 123/115/8 64 1 87 28 *
:t-0004096 819 4096 58.7M 216/216/8 64 2 96 5 *
anon_vma 10920 40 524.2K 0/0/8 1365 0 0 83
bdev_cache 256 1008 262.1K 0/0/4 64 0 0 98 Aa
blkdev_queue 140 1864 262.1K 0/0/2 70 1 0 99
blkdev_requests 1720 304 524.2K 0/0/8 215 0 0 99
buffer_head 7493 104 2.6M 33/32/8 585 0 78 29 a
cfq_io_context 3120 168 524.2K 0/0/8 390 0 0 99
cfq_queue 3848 136 524.2K 0/0/8 481 0 0 99
dentry 3793 224 2.4M 30/29/8 292 0 76 34 a
ext3_inode_cache 750 1016 2.6M 33/33/8 64 0 80 28 a
fat_inode_cache 77 840 65.5K 0/0/1 77 0 0 98 a
file_lock_cache 2289 192 458.7K 0/0/7 327 0 0 95
hugetlbfs_inode_cache 83 776 65.5K 0/0/1 83 0 0 98
idr_layer_cache 944 544 524.2K 0/0/8 118 0 0 97
inode_cache 1044 744 786.4K 4/0/8 87 0 0 98 a
kmalloc-16384 160 16384 2.6M 0/0/5 32 3 0 100
kmalloc-192 4609 192 85.9M 1303/1303/8 341 0 99 1
kmalloc-32768 128 32768 4.1M 0/0/8 16 3 0 100
kmalloc-65536 32 65536 2.0M 0/0/8 4 2 0 100
kmalloc-8 65536 8 524.2K 0/0/8 8192 0 0 100
kmalloc-8192 448 8192 3.6M 0/0/7 64 3 0 100
kmem_cache_node 3276 80 262.1K 0/0/4 819 0 0 99 *
mqueue_inode_cache 56 1064 65.5K 0/0/1 56 0 0 90 A
numa_policy 248 264 65.5K 0/0/1 248 0 0 99
proc_inode_cache 732 792 655.3K 2/1/8 81 0 10 88 a
radix_tree_node 1200 552 983.0K 7/7/8 117 0 46 67 a
shmem_inode_cache 1256 1000 1.3M 13/4/8 65 0 19 91
sighand_cache 433 1608 917.5K 6/4/8 39 0 28 75 A
sigqueue 3272 160 524.2K 0/0/8 409 0 0 99
sock_inode_cache 758 832 851.9K 5/4/8 73 0 30 74 Aa
TCP 180 1712 327.6K 0/0/5 36 0 0 94 A
vm_area_struct 4015 176 917.5K 6/6/8 372 0 42 77






2008-08-13 13:11:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:


> <SLUB>
>
> % cat /proc/meminfo
>
> Slab: 1591680 kB
> SReclaimable: 12608 kB
> SUnreclaim: 1579072 kB

Unreclaimable grew very big.


> :t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *

Argh. Most slabs contain a single object. Probably due to the conflict resolution.


> kmalloc-192 4609 192 85.9M 1303/1303/8 341 0 99 1

And a similar but not so severe issue here.

The obvious fix is to avoid allocating another slab on conflict but how will
this impact performance?


Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
+++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
@@ -1253,13 +1253,11 @@
static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
struct page *page)
{
- if (slab_trylock(page)) {
- list_del(&page->lru);
- n->nr_partial--;
- __SetPageSlubFrozen(page);
- return 1;
- }
- return 0;
+ slab_lock(page);
+ list_del(&page->lru);
+ n->nr_partial--;
+ __SetPageSlubFrozen(page);
+ return 1;
}

2008-08-13 14:15:00

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

>> :t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
>
> Argh. Most slabs contain a single object. Probably due to the conflict resolution.

agreed with the issue exist in lock contention code.


> The obvious fix is to avoid allocating another slab on conflict but how will
> this impact performance?
>
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
> +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> @@ -1253,13 +1253,11 @@
> static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> struct page *page)
> {
> - if (slab_trylock(page)) {
> - list_del(&page->lru);
> - n->nr_partial--;
> - __SetPageSlubFrozen(page);
> - return 1;
> - }
> - return 0;
> + slab_lock(page);
> + list_del(&page->lru);
> + n->nr_partial--;
> + __SetPageSlubFrozen(page);
> + return 1;
> }

I don't mesure it yet. I don't like this patch.
maybe, it decrease other typical benchmark.

So, I think better way is

1. slab_trylock(), if success goto 10.
2. check fragmentation ratio, if low goto 10
3. slab_lock()
10. return func

I think this way doesn't cause performance regression.
because high fragmentation cause defrag and compaction lately.
So, prevent fragmentation often increase performance.

Thought?

2008-08-13 14:17:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

On Wed, 2008-08-13 at 23:14 +0900, KOSAKI Motohiro wrote:
> >> :t-0000128 28739 128 1.3G 20984/20984/8 512 0 99 0 *
> >
> > Argh. Most slabs contain a single object. Probably due to the conflict resolution.
>
> agreed with the issue exist in lock contention code.
>
>
> > The obvious fix is to avoid allocating another slab on conflict but how will
> > this impact performance?
> >
> >
> > Index: linux-2.6/mm/slub.c
> > ===================================================================
> > --- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
> > +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> > @@ -1253,13 +1253,11 @@
> > static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> > struct page *page)
> > {
> > - if (slab_trylock(page)) {
> > - list_del(&page->lru);
> > - n->nr_partial--;
> > - __SetPageSlubFrozen(page);
> > - return 1;
> > - }
> > - return 0;
> > + slab_lock(page);
> > + list_del(&page->lru);
> > + n->nr_partial--;
> > + __SetPageSlubFrozen(page);
> > + return 1;
> > }
>
> I don't mesure it yet. I don't like this patch.
> maybe, it decrease other typical benchmark.
>
> So, I think better way is
>
> 1. slab_trylock(), if success goto 10.
> 2. check fragmentation ratio, if low goto 10
> 3. slab_lock()
> 10. return func
>
> I think this way doesn't cause performance regression.
> because high fragmentation cause defrag and compaction lately.
> So, prevent fragmentation often increase performance.
>
> Thought?

I guess that would work. But how exactly would you quantify
"fragmentation ratio?"

2008-08-13 14:32:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:
>
> I don't mesure it yet. I don't like this patch.
> maybe, it decrease other typical benchmark.

Yes but running with this patch would allow us to verify that we understand
what is causing the problem. There are other solutions like skipping to the
next partial slab on the list that could fix performance issues that the patch
may cause. A test will give us:

1. Confirmation that the memory use is caused by the trylock.

2. Some performance numbers. If these show a regression then we have some
markers that we can measure other solutions against.

2008-08-13 15:06:14

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> Yes but running with this patch would allow us to verify that we understand
> what is causing the problem. There are other solutions like skipping to the
> next partial slab on the list that could fix performance issues that the patch
> may cause. A test will give us:
>
> 1. Confirmation that the memory use is caused by the trylock.
>
> 2. Some performance numbers. If these show a regression then we have some
> markers that we can measure other solutions against.

okey.
I will confirm its patch at next week.

(unfortunately, my company don't business in rest this week)

Thanks.

2008-08-14 07:19:17

by Pekka Enberg

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Hi Christoph,

Christoph Lameter wrote:
> The obvious fix is to avoid allocating another slab on conflict but how will
> this impact performance?
>
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-08-13 08:06:00.000000000 -0500
> +++ linux-2.6/mm/slub.c 2008-08-13 08:07:59.000000000 -0500
> @@ -1253,13 +1253,11 @@
> static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
> struct page *page)
> {
> - if (slab_trylock(page)) {
> - list_del(&page->lru);
> - n->nr_partial--;
> - __SetPageSlubFrozen(page);
> - return 1;
> - }
> - return 0;
> + slab_lock(page);
> + list_del(&page->lru);
> + n->nr_partial--;
> + __SetPageSlubFrozen(page);
> + return 1;
> }

This patch hard locks on my 2-way 64-bit x86 machine (sysrq doesn't
respond) when I run hackbench.

2008-08-14 14:46:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Pekka Enberg wrote:
> This patch hard locks on my 2-way 64-bit x86 machine (sysrq doesn't
> respond) when I run hackbench.
Hmmm.. Then the issue may be different than we thought. Lock may be
taken recursively in some situations.
Can you enable lockdep?

2008-08-14 15:08:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Pekka Enberg wrote:
>
> This patch hard locks on my 2-way 64-bit x86 machine (sysrq doesn't
> respond) when I run hackbench.
At that point we take the listlock and then the slab lock which is a
lock inversion if we do not use a trylock here. Crap.

Hmmm.. The code already goes to the next slab if an earlier one is
already locked. So I do not see how the large partial lists could be
generated.

2008-08-14 19:46:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

This is a NUMA system right? Then we have another mechanism that will avoid
off node memory references by allocating new slabs. Can you set the
node_defrag parameter to 0? (Noted by Adrian).

2008-08-15 16:44:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> This is a NUMA system right?

True.
My system is

CPU: ia64 x8
MEM: 8G (4G x 2node)

> Then we have another mechanism that will avoid
> off node memory references by allocating new slabs. Can you set the
> node_defrag parameter to 0? (Noted by Adrian).

Please let me know that operations ?

2008-08-15 18:25:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:

>> Then we have another mechanism that will avoid
>> off node memory references by allocating new slabs. Can you set the
>> node_defrag parameter to 0? (Noted by Adrian).
>
> Please let me know that operations ?

The control over the preferences of node local vs. remote defrag is occurring
via /sys/kernel/slab/<slabcache>/remote_node_defrag ratio. Default is 10%.
Comments in get_any_partial explain the operations.

The default setting means that in 9 out of 10 cases slub will prefer creating
a new slab over taking one from the remote node (meaning the memory is node
local, probably not important in your 2 node case). It will therefore waste
memory because local memory may be more efficient to use.

Setting remote_node_defrag_ratio to 100 will make slub always take the remote
slab instead of allocating a new one.


/*
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations. A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes.
*
* If the defrag_ratio is set to 0 then kmalloc() always
* returns node local objects. If the ratio is higher then kmalloc()
* may return off node objects because partial slabs are obtained
* from other nodes and filled up.
*
* If /sys/kernel/slab/xx/defrag_ratio is set to 100 (which makes
* defrag_ratio = 1000) then every (well almost) allocation will
* first attempt to defrag slab caches on other nodes. This means
* scanning over all nodes to look for partial slabs which may be
* expensive if we do it every time we are trying to find a slab
* with available objects.
*/
if (!s->remote_node_defrag_ratio ||
get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;



2008-08-15 19:43:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

Christoph Lameter wrote:

> Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> slab instead of allocating a new one.

As pointed out by Adrian D. off list:

The max remote_node_defrag_ratio is 99.

Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
allow 100 to switch off any node local allocs?

2008-08-18 10:10:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> Christoph Lameter wrote:
>
> > Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> > slab instead of allocating a new one.
>
> As pointed out by Adrian D. off list:
>
> The max remote_node_defrag_ratio is 99.
>
> Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
> allow 100 to switch off any node local allocs?

Hmmm,
it doesn't change any behavior.

I did ..

1. slub code change (see below)


Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4056,7 +4056,7 @@ static ssize_t remote_node_defrag_ratio_
if (err)
return err;

- if (ratio < 100)
+ if (ratio <= 100)
s->remote_node_defrag_ratio = ratio * 10;

return length;


2. change remote defrag ratio
# echo 100 > /sys/kernel/slab/:t-0000128/remote_node_defrag_ratio
# cat /sys/kernel/slab/:t-0000128/remote_node_defrag_ratio
100

3. ran hackbench
4. ./slabinfo

Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
:at-0000016 4096 16 65.5K 0/0/1 4096 0 0 100 *a
:at-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *a
:at-0000032 2048 32 65.5K 0/0/1 2048 0 0 100 *Aa
:at-0000088 4464 88 393.2K 0/0/6 744 0 0 99 *a
:at-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *a
:t-0000016 32768 16 524.2K 0/0/8 4096 0 0 100 *
:t-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *
:t-0000032 34806 32 1.1M 9/1/8 2048 0 5 99 *
:t-0000040 14417 40 917.5K 6/6/8 1638 0 42 62 *
:t-0000048 5460 48 262.1K 0/0/4 1365 0 0 99 *
:t-0000064 10224 64 655.3K 2/1/8 1024 0 10 99 *
:t-0000072 29120 72 2.0M 26/0/6 910 0 0 99 *
:t-0000080 16376 80 1.3M 12/1/8 819 0 5 99 *
:t-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *
:t-0000128 28917 128 1.3G 21041/21041/8 512 0 99 0 *
:t-0000256 15280 256 31.4M 472/436/8 256 0 90 12 *
:t-0000384 1360 352 524.2K 0/0/8 170 0 0 91 *A
:t-0000512 2388 512 1.3M 12/4/8 128 0 20 93 *
:t-0000768 851 768 851.9K 5/5/8 85 0 38 76 *A
:t-0000896 742 880 851.9K 5/4/8 73 0 30 76 *A
:t-0001024 1819 1024 15.1M 223/211/8 64 0 91 12 *
:t-0002048 2641 2048 17.9M 129/116/8 64 1 84 30 *
:t-0004096 817 4096 57.1M 210/210/8 64 2 96 5 *
anon_vma 10920 40 524.2K 0/0/8 1365 0 0 83
bdev_cache 256 1008 262.1K 0/0/4 64 0 0 98 Aa
blkdev_queue 140 1864 262.1K 0/0/2 70 1 0 99
blkdev_requests 1720 304 524.2K 0/0/8 215 0 0 99
buffer_head 7284 104 2.5M 31/30/8 585 0 76 29 a
cfq_io_context 3120 168 524.2K 0/0/8 390 0 0 99
cfq_queue 3848 136 524.2K 0/0/8 481 0 0 99
dentry 3775 224 2.5M 31/29/8 292 0 74 33 a
ext3_inode_cache 740 1016 2.4M 30/30/8 64 0 78 30 a
fat_inode_cache 77 840 65.5K 0/0/1 77 0 0 98 a
file_lock_cache 2616 192 524.2K 0/0/8 327 0 0 95
hugetlbfs_inode_cache 83 776 65.5K 0/0/1 83 0 0 98
idr_layer_cache 944 544 524.2K 0/0/8 118 0 0 97
inode_cache 1050 744 851.9K 5/1/8 87 0 7 91 a
kmalloc-16384 160 16384 2.6M 0/0/5 32 3 0 100
kmalloc-192 4578 192 87.5M 1328/1328/8 341 0 99 1
kmalloc-32768 128 32768 4.1M 0/0/8 16 3 0 100
kmalloc-65536 32 65536 2.0M 0/0/8 4 2 0 100
kmalloc-8 65536 8 524.2K 0/0/8 8192 0 0 100
kmalloc-8192 512 8192 4.1M 0/0/8 64 3 0 100
kmem_cache_node 3276 80 262.1K 0/0/4 819 0 0 99 *
mqueue_inode_cache 56 1064 65.5K 0/0/1 56 0 0 90 A
numa_policy 248 264 65.5K 0/0/1 248 0 0 99
proc_inode_cache 655 792 720.8K 3/3/8 81 0 27 71 a
radix_tree_node 1142 552 917.5K 6/6/8 117 0 42 68 a
shmem_inode_cache 1230 1000 1.3M 12/3/8 65 0 15 93
sighand_cache 434 1608 917.5K 6/4/8 39 0 28 76 A
sigqueue 3272 160 524.2K 0/0/8 409 0 0 99
sock_inode_cache 774 832 851.9K 5/3/8 73 0 23 75 Aa
TCP 144 1712 262.1K 0/0/4 36 0 0 94 A
vm_area_struct 4034 176 851.9K 5/5/8 372 0 38 83




2008-08-18 10:37:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> > Christoph Lameter wrote:
> >
> > > Setting remote_node_defrag_ratio to 100 will make slub always take the remote
> > > slab instead of allocating a new one.
> >
> > As pointed out by Adrian D. off list:
> >
> > The max remote_node_defrag_ratio is 99.
> >
> > Maybe we need to change the comparison in remote_node_defrag_ratio_store() to
> > allow 100 to switch off any node local allocs?
>
> Hmmm,
> it doesn't change any behavior.

Ah, ok.
I did mistakes.

new patch is here.

Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1326,9 +1326,11 @@ static struct page *get_any_partial(stru
* expensive if we do it every time we are trying to find a slab
* with available objects.
*/
+#if 0
if (!s->remote_node_defrag_ratio ||
get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;
+#endif

zonelist = node_zonelist(slab_node(current->mempolicy), flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {


new result is here.

% cat /proc/meminfo
MemTotal: 7701504 kB
MemFree: 5986432 kB
Buffers: 7872 kB
Cached: 38208 kB
SwapCached: 0 kB
Active: 120256 kB
Inactive: 14656 kB
Active(anon): 90304 kB
Inactive(anon): 0 kB
Active(file): 29952 kB
Inactive(file): 14656 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2031488 kB
SwapFree: 2031488 kB
Dirty: 448 kB
Writeback: 0 kB
AnonPages: 89088 kB
Mapped: 31360 kB
Slab: 69952 kB
SReclaimable: 13376 kB
SUnreclaim: 56576 kB
PageTables: 11648 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 5882240 kB
Committed_AS: 453440 kB
VmallocTotal: 17592177655808 kB
VmallocUsed: 29312 kB
VmallocChunk: 17592177626112 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 262144 kB


% slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
:at-0000016 4096 16 65.5K 0/0/1 4096 0 0 100 *a
:at-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *a
:at-0000032 2048 32 65.5K 0/0/1 2048 0 0 100 *Aa
:at-0000088 2976 88 262.1K 0/0/4 744 0 0 99 *a
:at-0000096 4774 96 458.7K 0/0/7 682 0 0 99 *a
:t-0000016 32768 16 524.2K 0/0/8 4096 0 0 100 *
:t-0000024 21840 24 524.2K 0/0/8 2730 0 0 99 *
:t-0000032 34806 32 1.1M 9/1/8 2048 0 5 99 *
:t-0000040 14279 40 851.9K 5/5/8 1638 0 38 67 *
:t-0000048 5460 48 262.1K 0/0/4 1365 0 0 99 *
:t-0000064 10224 64 655.3K 2/1/8 1024 0 10 99 *
:t-0000072 29109 72 2.0M 26/4/6 910 0 12 99 *
:t-0000080 16379 80 1.3M 12/1/8 819 0 5 99 *
:t-0000096 5456 96 524.2K 0/0/8 682 0 0 99 *
:t-0000128 27831 128 3.6M 48/8/8 512 0 14 97 *
:t-0000256 15401 256 9.8M 143/96/8 256 0 63 39 *
:t-0000384 1360 352 524.2K 0/0/8 170 0 0 91 *A
:t-0000512 2307 512 1.2M 11/3/8 128 0 15 94 *
:t-0000768 755 768 720.8K 3/3/8 85 0 27 80 *A
:t-0000896 728 880 851.9K 5/4/8 73 0 30 75 *A
:t-0001024 1810 1024 1.9M 21/4/8 64 0 13 97 *
:t-0002048 2621 2048 5.5M 34/15/8 64 1 35 97 *
:t-0004096 775 4096 3.4M 5/2/8 64 2 15 93 *
anon_vma 10920 40 524.2K 0/0/8 1365 0 0 83
bdev_cache 192 1008 196.6K 0/0/3 64 0 0 98 Aa
blkdev_queue 140 1864 262.1K 0/0/2 70 1 0 99
blkdev_requests 1720 304 524.2K 0/0/8 215 0 0 99
buffer_head 8020 104 2.7M 34/32/8 585 0 76 30 a
cfq_io_context 3120 168 524.2K 0/0/8 390 0 0 99
cfq_queue 3848 136 524.2K 0/0/8 481 0 0 99
dentry 3798 224 2.5M 31/30/8 292 0 76 33 a
ext3_inode_cache 1127 1016 2.7M 34/34/8 64 0 80 41 a
fat_inode_cache 77 840 65.5K 0/0/1 77 0 0 98 a
file_lock_cache 2289 192 458.7K 0/0/7 327 0 0 95
hugetlbfs_inode_cache 83 776 65.5K 0/0/1 83 0 0 98
idr_layer_cache 944 544 524.2K 0/0/8 118 0 0 97
inode_cache 1044 744 786.4K 4/0/8 87 0 0 98 a
kmalloc-16384 160 16384 2.6M 0/0/5 32 3 0 100
kmalloc-192 3883 192 1.0M 8/8/8 341 0 50 71
kmalloc-32768 128 32768 4.1M 0/0/8 16 3 0 100
kmalloc-65536 32 65536 2.0M 0/0/8 4 2 0 100
kmalloc-8 65536 8 524.2K 0/0/8 8192 0 0 100
kmalloc-8192 512 8192 4.1M 0/0/8 64 3 0 100
kmem_cache_node 3276 80 262.1K 0/0/4 819 0 0 99 *
mqueue_inode_cache 56 1064 65.5K 0/0/1 56 0 0 90 A
numa_policy 248 264 65.5K 0/0/1 248 0 0 99
proc_inode_cache 653 792 655.3K 2/2/8 81 0 20 78 a
radix_tree_node 1221 552 983.0K 7/7/8 117 0 46 68 a
shmem_inode_cache 1218 1000 1.3M 12/3/8 65 0 15 92
sighand_cache 416 1608 851.9K 5/3/8 39 0 23 78 A
sigqueue 3272 160 524.2K 0/0/8 409 0 0 99
sock_inode_cache 758 832 786.4K 4/3/8 73 0 25 80 Aa
TCP 180 1712 327.6K 0/0/5 36 0 0 94 A
vm_area_struct 4054 176 851.9K 5/5/8 372 0 38 83


2008-08-18 14:10:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:

> new patch is here.
>
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1326,9 +1326,11 @@ static struct page *get_any_partial(stru
> * expensive if we do it every time we are trying to find a slab
> * with available objects.
> */
> +#if 0
> if (!s->remote_node_defrag_ratio ||
> get_cycles() % 1024 > s->remote_node_defrag_ratio)
> return NULL;
> +#endif
>
> zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {

Hmmm.... So always take from partial lists works? That is the same effect that
the setting of the remote_defrag_ratio to 100 should have had (its multiplied
by 10 when storing it).

So its a NUMA only phenomenon. How is performance affected?

2008-08-19 10:35:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> > +#if 0
> > if (!s->remote_node_defrag_ratio ||
> > get_cycles() % 1024 > s->remote_node_defrag_ratio)
> > return NULL;
> > +#endif
> >
> > zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> > for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>
> Hmmm.... So always take from partial lists works? That is the same effect that
> the setting of the remote_defrag_ratio to 100 should have had (its multiplied
> by 10 when storing it).

Sorry, I don't know reason.
OK, I'll digg it more.

> So its a NUMA only phenomenon. How is performance affected?

Unfortunately, I can't mesure it.

because
- Fujitsu server can access remote node fastly than typical numa server.
So, my performance number often isn't typical.
- My box (4G x2node) is very small in NUMA machine.
but that is large server improving mechanism.

IOW, My box didn't happend performance regression.
but I think it isn't typical.


2008-08-19 13:53:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

KOSAKI Motohiro wrote:

> IOW, My box didn't happend performance regression.
> but I think it isn't typical.

Well that is typical for small NUMA system. Maybe this patch will fix it for
now? Large systems can be tuned by setting the ratio lower.


Subject: slub/NUMA: Disable remote node defragmentation by default

Switch remote node defragmentation off by default. The current settings can
cause excessive node local allocations with hackbench. (Note that this feature
is not related to slab defragmentation).

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-08-19 06:45:54.732348449 -0700
+++ linux-2.6/mm/slub.c 2008-08-19 06:46:12.442348249 -0700
@@ -2312,7 +2312,7 @@ static int kmem_cache_open(struct kmem_c

s->refcount = 1;
#ifdef CONFIG_NUMA
- s->remote_node_defrag_ratio = 100;
+ s->remote_node_defrag_ratio = 1000;
#endif
if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
goto error;
@@ -4058,7 +4058,7 @@ static ssize_t remote_node_defrag_ratio_
if (err)
return err;

- if (ratio < 100)
+ if (ratio <= 100)
s->remote_node_defrag_ratio = ratio * 10;

return length;

2008-08-20 11:48:14

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: No, really, stop trying to delete slab until you've finished making slub perform as well

> KOSAKI Motohiro wrote:
>
> > IOW, My box didn't happend performance regression.
> > but I think it isn't typical.
>
> Well that is typical for small NUMA system. Maybe this patch will fix it for
> now? Large systems can be tuned by setting the ratio lower.
>
>
> Subject: slub/NUMA: Disable remote node defragmentation by default
>
> Switch remote node defragmentation off by default. The current settings can
> cause excessive node local allocations with hackbench. (Note that this feature
> is not related to slab defragmentation).

OK.
I confirmed this patch works well.

Tested-by: KOSAKI Motohiro <[email protected]>


>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> mm/slub.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-08-19 06:45:54.732348449 -0700
> +++ linux-2.6/mm/slub.c 2008-08-19 06:46:12.442348249 -0700
> @@ -2312,7 +2312,7 @@ static int kmem_cache_open(struct kmem_c
>
> s->refcount = 1;
> #ifdef CONFIG_NUMA
> - s->remote_node_defrag_ratio = 100;
> + s->remote_node_defrag_ratio = 1000;
> #endif
> if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
> goto error;
> @@ -4058,7 +4058,7 @@ static ssize_t remote_node_defrag_ratio_
> if (err)
> return err;
>
> - if (ratio < 100)
> + if (ratio <= 100)
> s->remote_node_defrag_ratio = ratio * 10;
>
> return length;