LinuxLists.cc - Major regression on hackbench with SLUB

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

* Steven Rostedt <[email protected]> wrote:

> The regression is that hackbench slows down tremendously. It goes from
> running just under 2 seconds to running around 14 seconds.

ouch ...

it might make sense to check the SLUB patches from -mm ontop of vanilla
as well - while that might be of little comfort as far as 2.6.24 goes,
it at least gives us the info whether this is something that has been
fixed in the latest SLUB code.

Ingo

2007-12-07 18:00:51

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

On Fri, 7 Dec 2007, Steven Rostedt wrote:
>
> The regression is that hackbench slows down tremendously. It goes from
> running just under 2 seconds to running around 14 seconds.

Can you do one run with oprofile, and see exactly where the cost is? It
should hopefully be pretty darn obvious, considering your timing.

Linus

2007-12-07 18:37:40

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

On Fri, 7 Dec 2007, Linus Torvalds wrote:
>
> Can you do one run with oprofile, and see exactly where the cost is? It
> should hopefully be pretty darn obvious, considering your timing.

So I checked hackbench on my machine with oprofile, and while I'm not
seeing anything that says "ten times slower", I do see it hitting the slow
path all the time, and __slab_alloc() is 15% of the profile.

With __slab_free() being another 8-9%.

I assume that with many CPU's, those get horrendously worse, with cache
trashing.

The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but
that may not be the one that causes problems in a 64-cpu setup, so it
would be good to have that verified.

Anyway, it's unclear whether the reason it goes into the slow-path because
the freelist is just always empty, or whether it hits the

... || !node_match(c, node)

case which can trigger on NUMA. That's another potential explanation of
why you'd see such a *huge* slowdown (ie if the whole node-match thing
doesn't work out, slub just never gets the fast-case at all). That said,
the number of places that actually pass a specific node to slub is very
limited, so I suspect it's not the node matching. But just disabling that
test in slab_alloc() might be one thing to test.

[ The whole node match thing is likely totally bogus. I suspect we pay
*more* in trying to match nodes than we'd ever likely pay in just
returning the wrong node for an allocation, but that's neither here nor
there ]

But yeah, I'm not entirely sure SLUB is working out.

Linus

2007-12-07 18:59:28

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

[ Note: I'm currently having problems with my email:
http://www.domaindirect.com/network.html
I'm in the process of setting up my own personal email server. ]

> > Can you do one run with oprofile, and see exactly where the cost is? It
> > should hopefully be pretty darn obvious, considering your timing.

I kicked off my kernel tests which will take a few hours.
I run "make -j256" (4*nr_CPUS) 10 times at SCHED_OTHER and 10 times
at SCHED_FIFO (chrt -f 10) for both SLUB and SLAB.

This is automated, so I need to wait for it to finish before I can
continue other tests.

> The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but
> that may not be the one that causes problems in a 64-cpu setup, so it
> would be good to have that verified.

I'll run oprofile as soon as the kernel build test is done.

> case which can trigger on NUMA. That's another potential explanation of
> why you'd see such a *huge* slowdown (ie if the whole node-match thing
> doesn't work out, slub just never gets the fast-case at all). That said,
> the number of places that actually pass a specific node to slub is very
> limited, so I suspect it's not the node matching. But just disabling that
> test in slab_alloc() might be one thing to test.

I'll try that after the oprofile.

Thanks,

-- Steve

2007-12-08 22:17:39

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

Hi Linus,

> On Fri, 7 Dec 2007, Linus Torvalds wrote:
> >
> > Can you do one run with oprofile, and see exactly where the cost is? It
> > should hopefully be pretty darn obvious, considering your timing.

The results are here:

http://people.redhat.com/srostedt/slub/results/slab.op
http://people.redhat.com/srostedt/slub/results/slub.op

>
> Anyway, it's unclear whether the reason it goes into the slow-path because
> the freelist is just always empty, or whether it hits the
>
> ... || !node_match(c, node)

I did two things. First I tried making node_match always return true,
the second was just commenting out the above check. They both had pretty
much the exact same results. It slowed down by double! Instead of taking
15 seconds to complete, both took 30 seconds to complete.

Here's the oprofiles of those runs.

node_match return 1:
http://people.redhat.com/srostedt/slub/results/slub-nonode.op

comment out node_match check:
http://people.redhat.com/srostedt/slub/results/slub-nochecknode.op

>
> [ The whole node match thing is likely totally bogus. I suspect we pay
> *more* in trying to match nodes than we'd ever likely pay in just
> returning the wrong node for an allocation, but that's neither here nor
> there ]

I don't think it's bogus by the result that removing it slowes it down
by half.

I'm wondering if there's not some other cache thrashing happening
somewhere else in the slub code. I'll start adding some stats in the code
to see where the congestion might lie. I'll look into this on Monday.

Seems that both SLAB and SLUB run kernel compiles the same. Here's the
results of my compile test. (make -j256 and chrt -f 10 make -j256)

http://people.redhat.com/srostedt/slub/

-- Steve

2007-12-09 00:29:46

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

Ingo,

Due to email problems, I never got your reply on this thread. But I saw
your reply when reading this thread on lkml.org (nor did I receive Linus's
first reply).

Anyway, I booted 2.6.24-rc4-mm1 with the same config, and accepted the
defaults to the new config options as the git version I checked. And it
gave me an average of 12.5 second runs.

Here's the oprofile for it:

http://people.redhat.com/srostedt/slub/results/slub-mm.op

-- Steve

2007-12-10 07:38:36

by Björn Steinbrink

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

On 2007.12.08 17:16:24 -0500, Steven Rostedt wrote:
>
> Hi Linus,
>
> > On Fri, 7 Dec 2007, Linus Torvalds wrote:
> > >
> > > Can you do one run with oprofile, and see exactly where the cost is? It
> > > should hopefully be pretty darn obvious, considering your timing.
>
> The results are here:
>
> http://people.redhat.com/srostedt/slub/results/slab.op
> http://people.redhat.com/srostedt/slub/results/slub.op

Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc
alot. I wonder if !node_match triggers too often. We always start with
the per cpu slab, if that one is on the wrong node, you'll always hit
that "another_slab" path.

After searching for way too long (given that I have no clue about that
stuff anyway and just read the code out of curiousness), I noticed that
the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff
(arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1],
which, AFAICT, shows that the per cpu slab setup is done _before_
cpu_to_node is correctly setup. That would lead to the per cpu slabs all
having node == 0xff, which looks pretty bad.

Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first
time, so this might be total bull ;-)

Bj?rn

[1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-10/msg04648.html

2007-12-10 14:02:19

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

On Mon, 10 Dec 2007, Bj?rn Steinbrink wrote:
> >
> > The results are here:
> >
> > http://people.redhat.com/srostedt/slub/results/slab.op
> > http://people.redhat.com/srostedt/slub/results/slub.op
>
> Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc
> alot. I wonder if !node_match triggers too often. We always start with
> the per cpu slab, if that one is on the wrong node, you'll always hit
> that "another_slab" path.

Well, I commented out the node_match part and it got 100% worse. It took
30 seconds to complete.

>
> After searching for way too long (given that I have no clue about that
> stuff anyway and just read the code out of curiousness), I noticed that
> the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff
> (arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1],
> which, AFAICT, shows that the per cpu slab setup is done _before_
> cpu_to_node is correctly setup. That would lead to the per cpu slabs all
> having node == 0xff, which looks pretty bad.

I didn't check to see if the internal set up of the node is correct
though. I can put in some debug to see what I get.

>
> Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first
> time, so this might be total bull ;-)

Well I'm a newbie on NUMA stuff too. I just got lucky enough to be able to
reserve this box ;-)

-- Steve

2007-12-11 14:34:18

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Steven Rostedt <[email protected]> wrote:

> Hackbench seems to show this regression the most. In my tests I didn't
> see much change with kernel builds and such, but the focus was on
> scheduling not memory management. I'll run my kernel tests next for
> both SLAB and SLUB and see if there's any difference there.

i just ran various benchmarks on an 8-way (8x 700 MHz Xeon, 4GB RAM):

AVG v2.6.24.slab v2.6.24.slub [ smaller is better ]
-----------------------------------------
mmap: 1052.66 1049.33 ( 0%)
ctx-2: 4.32 4.30 ( 0%)
select: 41.95 43.69 ( 4%)
proc-exec: 394.45 391.92 ( 0%)
hackbench-10: 1.12 2.99 (166%)
hackbench-20: 2.04 6.67 (226%)
hackbench-50: 5.03 17.50 (247%)

and hackbench overhead stands out, by a huge margin. Other stuff is
within measurement noise. Neither SLUB nor SLAB debugging was turned on,
all other debugging options were off too.

Ingo

2007-12-13 22:11:37

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB

On Fri, 7 Dec 2007, Linus Torvalds wrote:

> The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but
> that may not be the one that causes problems in a 64-cpu setup, so it
> would be good to have that verified.

Hmmmm.. This would indicate lock contention on a slab.

> [ The whole node match thing is likely totally bogus. I suspect we pay
> *more* in trying to match nodes than we'd ever likely pay in just
> returning the wrong node for an allocation, but that's neither here nor
> there ]

Node match is necessary in order to make the allocator able to get memory
for the node that the caller requested.

2007-12-13 22:22:29

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Tue, 11 Dec 2007, Ingo Molnar wrote:

> hackbench-10: 1.12 2.99 (166%)
> hackbench-20: 2.04 6.67 (226%)
> hackbench-50: 5.03 17.50 (247%)
>
> and hackbench overhead stands out, by a huge margin. Other stuff is
> within measurement noise. Neither SLUB nor SLAB debugging was turned on,
> all other debugging options were off too.

I just came back from vacation. The non linear growth of regression
indicates lock contention somewhere. Must be something special that was
triggered by hackbench.

We have had a regression like that in 2.6.24 before due to order 1 allocs
in the network layer. The .24 modifications for SLUB introduced page
allocator pass through for allocations > PAGESIZE/2. Order 1 allocs could
serialize on zone locks.

2007-12-14 04:24:54

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Hmmmm... Some tests here on an 8p 8G machine:

SLAB

N=10 Time: 0.341
N=20 Time: 0.605
N=50 Time: 1.487

SLUB

N=10 Time: 0.675
N=20 Time: 1.434
N=50 Time: 3.996

So its factor 2 for untuned SLUB. Looking at hackbench: This is a
test that allocates objects that are then consumed by N cpus that then
return them. The allocating processor can allocate from the per cpu slab
(the freelist is copied to a per cpu structure when its activated)
avoiding touching the page struct. However, the freeing processors
must update the freelist of the slab page directly. So they all content
for the cacheline of the page struct.

Since we do directly free there is the chance of lots of contention
between the N cpus that free the objects. This is in particular high in a
synthetic benchmark like this.

However, if the object to be freed is still in a cpu slab then the freeing
action will reduce the taking of the listlock. So we can actually
decrease the overhead by increasing the slab page size.

In an extreme case (boot with slub_min_order=9 to get huge page sized
slabs) SLUB can win against SLAB:

N=10 Time: 0.338 Minimally faster
N=20 Time: 0.560 10% faster
N=50 Time: 1.353 15% faster

Not sure how to get that mainstream yet but there certainly is a way to
get there. Need to get an idea how to reduce listlock contention in the
remote free case.

2007-12-14 06:17:10

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Christoph,

Welcome back from your vacation! :-)

On Thu, 13 Dec 2007, Christoph Lameter wrote:
>
> In an extreme case (boot with slub_min_order=9 to get huge page sized
> slabs) SLUB can win against SLAB:
>
> N=10 Time: 0.338 Minimally faster
> N=20 Time: 0.560 10% faster
> N=50 Time: 1.353 15% faster
>

I booted back to slub with slub_min_order=9 on the 64way and you are
indeed right about this.

# tests/hackbench 90
Time: 2.109
# chrt -f 10 tests/hackbench 90
Time: 3.775
# tests/hackbench 90
Time: 1.583
# chrt -f 10 tests/hackbench 90
Time: 1.833
# tests/hackbench 90
Time: 1.601
# chrt -f 10 tests/hackbench 90
Time: 3.078
# tests/hackbench 90
Time: 1.321
# chrt -f 10 tests/hackbench 90
Time: 1.933

A flux between 1.321 and 3.775 (*)(the rt runs were higher). But still,
good job at narrowing this down. Now the hard part. Figuring out a way
that will be nice to all users.

-- Steve

(*) The previous runs (without the command line option) were 13 seconds or
so.

2007-12-21 12:13:41

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Christoph Lameter <[email protected]> wrote:

> Hmmmm... Some tests here on an 8p 8G machine:

> In an extreme case (boot with slub_min_order=9 to get huge page sized
> slabs) SLUB can win against SLAB:
>
> N=10 Time: 0.338 Minimally faster
> N=20 Time: 0.560 10% faster
> N=50 Time: 1.353 15% faster

what's up with this regression? There's been absolutely no activity
about it in the last 8 days: upstream still regresses, -mm still
regresses and there are no patches posted for testing.

being able to utilize order-0 pages was supposed to be one of the big
plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously
the "fix", right?

Ingo

2007-12-21 12:27:15

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Ingo Molnar <[email protected]> wrote:

> * Christoph Lameter <[email protected]> wrote:
>
> > Hmmmm... Some tests here on an 8p 8G machine:
>
> > In an extreme case (boot with slub_min_order=9 to get huge page sized
> > slabs) SLUB can win against SLAB:
> >
> > N=10 Time: 0.338 Minimally faster
> > N=20 Time: 0.560 10% faster
> > N=50 Time: 1.353 15% faster
>
> what's up with this regression? There's been absolutely no activity
> about it in the last 8 days: upstream still regresses, -mm still
> regresses and there are no patches posted for testing.
>
> being able to utilize order-0 pages was supposed to be one of the big
> plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously
> the "fix", right?

and this is not the only regression:

http://lkml.org/lkml/2007/10/4/290

_6%_ TPC-C regression. That's _a lot_ in TPC-C terms.

and just like in this case there were very clear profiles posted. I
proffer, reading back the whole thread, that if you fix hackbench you
have fixed TPC-C as well.

So i believe you should either send some sensible fixes _NOW_, or admit
that the "no queues" NIH nonsense of SLUB doesnt work and do an edible,
incremental patchset against SLAB to bring in the debuggability features
of SLUB without killing SLAB's performance. (And fix the NUMA alien
cache problem along the lines suggested before - perhaps initially by
making 'noaliencache' the default bootup option.) And we obviously must
revert the default in 2.6.24 to SLAB as well.

Ingo

2007-12-21 16:18:37

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> what's up with this regression? There's been absolutely no activity
> about it in the last 8 days: upstream still regresses, -mm still
> regresses and there are no patches posted for testing.

I added a test that does this 1 alloc N free behavior to the slab
in kernel benchmark (cpu 0 continually allocs, cpu 1-7 continously free)
(all order 0). The test makes the regression worse because it run the
scenario with raw slab allocs/frees in the kernel context without the use
of the scheduler:

SLAB
1 alloc N free(8): 0=2115 1=549 2=551 3=550 4=550 5=550 6=551 7=549 Average=746
1 alloc N free(16): 0=2159 1=573 2=574 3=573 4=574 5=573 6=574 7=573 Average=772
1 alloc N free(32): 0=2168 1=577 2=576 3=577 4=579 5=578 6=577 7=576 Average=776
1 alloc N free(64): 0=2276 1=555 2=554 3=554 4=555 5=556 6=555 7=554 Average=770
1 alloc N free(128): 0=2635 1=549 2=548 3=548 4=550 5=549 6=548 7=549 Average=809
1 alloc N free(256): 0=3261 1=543 2=542 3=542 4=544 5=542 6=542 7=542 Average=882
1 alloc N free(512): 0=4642 1=571 2=572 3=570 4=573 5=571 6=572 7=572 Average=1080
1 alloc N free(1024): 0=7116 1=582 2=581 3=583 4=581 5=582 6=582 7=583 Average=1399
1 alloc N free(2048): 0=11612 1=667 2=667 3=667 4=668 5=667 6=668 7=667 Average=2035
1 alloc N free(4096): 0=20135 1=673 2=674 3=674 4=675 5=674 6=675 7=674 Average=3107

SLUB
1 alloc N free test
===================
1 alloc N free(8): 0=1223 1=645 2=417 3=643 4=538 5=612 6=557 7=564 Average=650
1 alloc N free(16): 0=3828 1=3140 2=3116 3=3166 4=3122 5=3172 6=3129 7=3172 Average=3231
1 alloc N free(32): 0=4116 1=3207 2=3180 3=3210 4=3183 5=3202 6=3182 7=3214 Average=3312
1 alloc N free(64): 0=4116 1=3017 2=3005 3=3013 4=3006 5=3016 6=3004 7=3020 Average=3149
1 alloc N free(128): 0=5282 1=3013 2=3010 3=3014 4=3009 5=3010 6=3005 7=3013 Average=3295
1 alloc N free(256): 0=5271 1=2750 2=2751 3=2751 4=2751 5=2751 6=2752 7=2751 Average=3066
1 alloc N free(512): 0=5304 1=2293 2=2295 3=2295 4=2295 5=2294 6=2295 7=2294 Average=2671
1 alloc N free(1024): 0=6017 1=2044 2=2043 3=2043 4=2043 5=2045 6=2044 7=2044 Average=2541
1 alloc N free(2048): 0=7162 1=2086 2=2084 3=2086 4=2086 5=2085 6=2086 7=2086 Average=2720
1 alloc N free(4096): 0=7133 1=1795 2=1796 3=1796 4=1797 5=1795 6=1796 7=1795 Average=2463

So we have 3 different regimes here (order 0):

1. SLUB wins in size 8 because the cpu slab is never given up given that
we have 512 objects 8 byte objects in a 4K slab (same scenario that we
can produce with slub_min_order=9).

2. At the high end (>= 2048) the additional overhead of SLAB when
allocating objects makes SLUB again performance wise better or similar.

3. There is an intermediate range where the cpu slab has to be changed
(and thus list_lock has to be used) and where multiple cpus that free
objects can simultaneouly hit the slab lock of the same slab page. That
is the cause of the regression

The regression is contained because:

1. The contention only exist when concurrently freeing and allocation
from the same slab page. SLUB has provisions for keeping a single
allocator and a single freeer in the clear by moving the freelist
pointer when a slab page becomes a cpu slab. Concurrent allocs and
frees are possible without cacheline contention. The problem only comes
about when multiple free operations occur simultaneously to the same
slab page.

2. The number of objects in a slab is limited. For the 128 byte size of
interest here we have 32 objects in a slab. The maximum theoretical
contention on the slab lock is through 32 cpus if they are all freeing
simultaneously.

3. The tests are an artificial scenario. Typically one would have
additional processing in both the alloc and the free paths which would
reduce the contention. If one would not have complex processing on
alloc then we typically allocate the object on the remote processor
(warms up the processor cache) instead of allocating objects while
freeing them on remote processors.

4. The contention is reduced the larger the object size becomes since
this will reduce the number of object per slab page. Thus the number
of processors taking the lock simultaneously is also reduced.

5. The contention is also reduced through normal operations that will
lead to the creation of partially populated slabs. If we cannot
allocate a large number of objects from the same slab (typical if
the system has been run for awhile) then the contention is
constrained.

We could address this issue by:

1. Detecting the contention while taking the slab lock and then defer the
free for these special situations (f.e. by using RCU to free these
objects.

2. Not allocate from the same slab if we detect contention.

But given all the boundaries for the contention I would think that it is
not worth addressing.

> being able to utilize order-0 pages was supposed to be one of the big
> plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously
> the "fix", right?

Booting with 2MB sized slab is certainly not the mainstream fix although I
expect that some SGI machines may run with that as the default
given that they have a couple of terabytes of memory and also the 2Mb
configurations reduce TLB pressure significantly. The use of 2MB slabs is
not motivated by the contention that we see here but rather because of the
improved scaling and nice interaction with the huge page management
framework provided by the antifrag patches.

Order 0 allocations were never the emphasis of SLUB. The emphasis was on
being able to use higher order allocs in order to make slab operations
scale higher.

2007-12-21 16:26:56

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> and this is not the only regression:
>
> http://lkml.org/lkml/2007/10/4/290
>
> _6%_ TPC-C regression. That's _a lot_ in TPC-C terms.
>
> and just like in this case there were very clear profiles posted. I
> proffer, reading back the whole thread, that if you fix hackbench you
> have fixed TPC-C as well.

There are patches pending to address these issues. AFAICT Intel is
testing if the regression is still there. There is no way for me to verify what
is going on there and there is the constant difficulty of getting
detailed information about what is going on at Intel. Every couple of
month I get a result from that test. Its a really crappy situation where a
lot of confusing information is passed around.

> So i believe you should either send some sensible fixes _NOW_, or admit
> that the "no queues" NIH nonsense of SLUB doesnt work and do an edible,
> incremental patchset against SLAB to bring in the debuggability features
> of SLUB without killing SLAB's performance. (And fix the NUMA alien
> cache problem along the lines suggested before - perhaps initially by
> making 'noaliencache' the default bootup option.) And we obviously must
> revert the default in 2.6.24 to SLAB as well.

The fixes to the alien that you proposed do not work since we would still
have to check for the need to put the object into alien cache. The checks
for the locality of each object are the main cause of trouble when freeing
in SLAB.

SLUB is generally performance wise superior to SLAB as demonstrated by my
measurements that you reviewed too. In addition SLUB has many improvements
over SLAB. Its not only the runtime debuggability which would be difficult
to add since there would be numerous runtime checks in critical code paths
that would cause regressions.

2007-12-21 16:37:52

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Christoph Lameter <[email protected]> wrote:

> On Fri, 21 Dec 2007, Ingo Molnar wrote:
>
> > and this is not the only regression:
> >
> > http://lkml.org/lkml/2007/10/4/290
> >
> > _6%_ TPC-C regression. That's _a lot_ in TPC-C terms.
> >
> > and just like in this case there were very clear profiles posted. I
> > proffer, reading back the whole thread, that if you fix hackbench
^^^^^^^^^^^^^^^^^^^^
> > you have fixed TPC-C as well.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> There are patches pending to address these issues. AFAICT Intel is
> testing if the regression is still there. There is no way for me to
> verify what is going on there and there is the constant difficulty of
> getting detailed information about what is going on at Intel. Every
> couple of month I get a result from that test. Its a really crappy
> situation where a lot of confusing information is passed around.

of course there is a way to find out, and that's why i mailed you: fix
the hackbench regression and i'm quite sure you'll improve the TPC-C
numbers as well. It shows the same kind of overhead in the profile and
takes just a few seconds to run. Are your pending SLUB patches in
2.6.24-rc5-mm1 already?

Ingo

2007-12-21 16:45:36

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Christoph Lameter wrote:
>
> So we have 3 different regimes here (order 0):
[ ... ]
> The regression is contained because:
[ ... ]

Christoph, *NONE* of these arguments matter at all.

The only thing that matters is the simple fact that real-life benchmarks
show that SLUB is worse. It doesn't matter one whit that you can say that
it's better under some circumstance that either isn't triggered, or when
setting unrealistic and unusable parameters (ie big internal slab orders).

> We could address this issue by:
[...]
> But given all the boundaries for the contention I would think that it is
> not worth addressing.

.. and this seems to be the single biggest reason to just say "revert SLUB
entirely".

If you aren't even motivated to fix the problems that have been reported,
then SLUB isn't even a _potential_ solution, it's purely a problem, and
since I am not IN THE LEAST interested in having three different
allocators around, SLUB is going to get axed.

In other words, here's the low-down:

a) either SLUB can replace SLAB, or SLUB is going away

This isn't open for discussion, Christoph. This was the rule back when
it got merged in the first place.

b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being
one, and while hackbench may be borderline, it's certainly less
borderline than others), then SLUB cannot replace SLAB.

See (a)

c) If you aren't even interested in trying to fix it and are ignoring
the reports, there is not even a _potential_ for SLUB for getting over
these problems, and I should disable it and we should get over it as
soon as possible, and this whole discussion is pretty much ended.

See? It really is that simple. Either you say "Hell yes, I'll fix it", or
SLUB goes away. There simply is no orther choice.

When you say "not worth addressing", that to me is a clear an unambiguous
"let's remove SLUB".

The main reason for SLUB in the first place was SGI concerns. You seem to
think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only
thing and you don't fix it for other users, then SLUB gets reverted from
the standard kernel.

That's all. And it's not really worth discussing.

Linus

2007-12-21 17:44:28

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Hi,

Christoph Lameter <[email protected]> wrote:
> > In an extreme case (boot with slub_min_order=9 to get huge page sized
> > slabs) SLUB can win against SLAB:
> >
> > N=10 Time: 0.338 Minimally faster
> > N=20 Time: 0.560 10% faster
> > N=50 Time: 1.353 15% faster

On Dec 21, 2007 2:09 PM, Ingo Molnar <[email protected]> wrote:
> what's up with this regression? There's been absolutely no activity
> about it in the last 8 days: upstream still regresses, -mm still
> regresses and there are no patches posted for testing.

Christoph, did you see Steven's oprofile results for the hackbench
runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting
add_partial like crazy?

Pekka

2007-12-21 21:56:22

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> > There are patches pending to address these issues. AFAICT Intel is
> > testing if the regression is still there. There is no way for me to
> > verify what is going on there and there is the constant difficulty of
> > getting detailed information about what is going on at Intel. Every
> > couple of month I get a result from that test. Its a really crappy
> > situation where a lot of confusing information is passed around.
>
> of course there is a way to find out, and that's why i mailed you: fix
> the hackbench regression and i'm quite sure you'll improve the TPC-C
> numbers as well. It shows the same kind of overhead in the profile and
> takes just a few seconds to run. Are your pending SLUB patches in
> 2.6.24-rc5-mm1 already?

The tests that I wrote emulate the test behavior that was described to me
by me.

The fixes in 2.6.24-rc5-mm1 improved those numbers. See
http://lkml.org/lkml/2007/10/27/245 which I quoted earlier to you.
However, I have no TPC-C setup here and from what I hear it takes weeks to
run and requires a large support team for tuning.

You can find the slab test suite for that at

http://git.kernel.org/?p=linux/kernel/git/christoph/vm.git;a=shortlog;h=tests

AFAICT the fixes in 2.6.25-rc5-mm1 result in double the alloc performance
(fastpath) of SLAB.

There are fixes that are not merged yet (the cpu alloc patchset) that
seem to make that factor 3 because we can use the segment register to
avoid per cpu array lookups in the fast path.

2007-12-21 22:12:06

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Linus Torvalds wrote:

> If you aren't even motivated to fix the problems that have been reported,
> then SLUB isn't even a _potential_ solution, it's purely a problem, and
> since I am not IN THE LEAST interested in having three different
> allocators around, SLUB is going to get axed.

Not motivated? I have analyzed the problem in detail and when it comes
down to it there is not much impact that I can see in real life
applications. I have always responded to the regression reported also via
TPC-C. There is still no answer to my post at
http://lkml.org/lkml/2007/10/27/245

SLAB has similar issues when freeing to remote nodes under NUMA. Its
even worse since the lock is not per slab page but per slab cache. The
alien caches that are then used have one lock per node.

> In other words, here's the low-down:
>
> a) either SLUB can replace SLAB, or SLUB is going away
>
> This isn't open for discussion, Christoph. This was the rule back when
> it got merged in the first place.

It obviously can replace it and has replaced it for awhile now.

> b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being
> one, and while hackbench may be borderline, it's certainly less
> borderline than others), then SLUB cannot replace SLAB.

Looks like I need to find someone to get that going here to be able to
test under it.

> c) If you aren't even interested in trying to fix it and are ignoring
> the reports, there is not even a _potential_ for SLUB for getting over
> these problems, and I should disable it and we should get over it as
> soon as possible, and this whole discussion is pretty much ended.

I have looked at this issue under various circumstances for the last week.
Nothing really convinced me that this is so serious. Could not figure
out if its really worth anything about but if you put it that way then of
course it is.

> The main reason for SLUB in the first place was SGI concerns. You seem to
> think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only
> thing and you don't fix it for other users, then SLUB gets reverted from
> the standard kernel.

The reason for SLUB was dysfunctional behavior of SLAB in many areas. It
was not just SGI's concerns that drove it.

> That's all. And it's not really worth discussing.

Well still SLUB is really superior to SLAB in many ways....

- SLUB is performance wise much faster than SLAB. This can be more than a
factor of 10 (case of concurrent allocations / frees on multiple
processors). See http://lkml.org/lkml/2007/10/27/245

- Single threaded allocation speed is up to double that of SLAB

- Remote freeing of objectcs in a NUMA systems is typically 30% faster.

- Debugging on SLAB is difficult. Requires recompile of the kernel
and the resulting output is difficult to interpret. SLUB can apply
debugging options to a subset of the slabcaches in order to allow
the system to work with maximum speed. This is necessary to detect
difficult to reproduce race conditions.

- SLAB can capture huge amounts of memory in its queues. The problem
gets worse the more processors and NUMA nodes are in the system.

- SLAB requires a pass through all slab caches every 2 seconds to
expire objects. This is a problem both for realtime and MPI jobs
that cannot take such a processor outage.

- SLAB does not have a sophisticated slabinfo tool to report the
state of slab objects on the system. Can provide details of
object use.

- SLAB requires the update of two words for freeing
and allocation. SLUB can do that by updating a single
word which allows to avoid enabling and disabling interrupts if
the processor supports an atomic instruction for that purpose.
This is important for realtime kernels where special measures
may have to be implemented.

- SLAB requires memory to be set aside for queues (processors
times number of slabs times queue size).
SLUB requires none of that.

- SLUB merges slab caches with similar characteristics to
reduce the memory footprint even further.

- SLAB performs object level NUMA management which creates
a complex allocator complexity. SLUB manages NUMA on the level of
slab pages.

- SLUB allows remote node defragmentation to avoid the buildup
of large partial lists on a single node.

- SLUB can actively reduce the fragmentation of slabs through
slab cache specific callbacks (not merged yet)

- SLUB has resiliency features that allow it to isolate a problem
object and continue after diagnostics have been performed.

- SLUB creates rarely used DMA caches on demand instead of creating
them all on bootup (SLAB).

2007-12-21 22:17:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 2007-12-21 at 14:11 -0800, Christoph Lameter wrote:
> On Fri, 21 Dec 2007, Linus Torvalds wrote:
>
> > If you aren't even motivated to fix the problems that have been reported,
> > then SLUB isn't even a _potential_ solution, it's purely a problem, and
> > since I am not IN THE LEAST interested in having three different
> > allocators around, SLUB is going to get axed.
>
> Not motivated? I have analyzed the problem in detail and when it comes
> down to it there is not much impact that I can see in real life
> applications. I have always responded to the regression reported also via
> TPC-C.

But you are dismissing the hackbench regression, which is not a small
one. It runs an astonishing 10x slower.

2007-12-21 22:18:08

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 2007-12-21 at 23:16 +0100, Peter Zijlstra wrote:
> On Fri, 2007-12-21 at 14:11 -0800, Christoph Lameter wrote:
> > On Fri, 21 Dec 2007, Linus Torvalds wrote:
> >
> > > If you aren't even motivated to fix the problems that have been reported,
> > > then SLUB isn't even a _potential_ solution, it's purely a problem, and
> > > since I am not IN THE LEAST interested in having three different
> > > allocators around, SLUB is going to get axed.
> >
> > Not motivated? I have analyzed the problem in detail and when it comes
> > down to it there is not much impact that I can see in real life
> > applications. I have always responded to the regression reported also via
> > TPC-C.
>
> But you are dismissing the hackbench regression, which is not a small
> one. It runs an astonishing 10x slower.
>

BTW, does /proc/slabinfo exist again? I thought we set that as a
requirement for SLUB to be the default and a full replacement for SLAB.

2007-12-21 22:22:21

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Pekka Enberg wrote:

> Christoph, did you see Steven's oprofile results for the hackbench
> runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting
> add_partial like crazy?

Hmmm... SLUB is cycling through partial slabs. If it gets fed objects with
a single free object from the free list again and again then this is the
behavior that one would see.

The worst case scenario would be.

1. Processor 0 gets slab with one free entry from the partial list.

2. Processor 0 allocates object and deactivates the slab since it is full.

3. Processor 1 frees the object and finds that it was not on the partial
list since there were no free objects. Call put_partial()

4. processor 0 gets slab with one free entry from the partial list.

If we would make the partial list a mininum size (which is already done
through MIN_PARTIAL maybe just increase it) then we may be able to avoid
this particular case. Also we could make sure that the slab is not put at
the beginning of the partial list on free in order to increase the time
that the slab spends on the partial list. That way it will gather more
objects before it is picked up.

Hmmmm... This is a bit different from what I got to here.

2007-12-21 22:27:18

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Peter Zijlstra wrote:

> BTW, does /proc/slabinfo exist again? I thought we set that as a
> requirement for SLUB to be the default and a full replacement for SLAB.

Well the information that would be exposed in /proc/slabinfo would only be
faked up stuff from SLUB. The queues that are described in /proc/slabinfo do not
exist f.e. (tunables do not make sense) and there are more details available via /sys/kernel/slab.

There is a discussion on linux-mm with some of the people working on tools
to make these to use the information that SLUB provides.

2007-12-21 22:37:46

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Actually thanks for pointing that out Pekka.... Here is a patch that takes
the regression almost completely away (Too much jetlag combined with flu
seems to have impaired my thinking I should have seen this earlier). I
still need to run my slab performance tests on this. But hackbench
improves.

SLUB: Improve hackbench speed

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-12-21 14:30:11.096324462 -0800
+++ linux-2.6/mm/slub.c 2007-12-21 14:30:33.656441994 -0800
@@ -172,7 +172,7 @@ static inline void ClearSlabDebug(struct
* Mininum number of partial slabs. These will be left on the partial
* lists even if they are empty. kmem_cache_shrink may reclaim them.
*/
-#define MIN_PARTIAL 2
+#define MIN_PARTIAL 5

/*
* Maximum number of desirable partial slabs.
@@ -1616,7 +1616,7 @@ checks_ok:
* then add it.
*/
if (unlikely(!prior))
- add_partial(get_node(s, page_to_nid(page)), page);
+ add_partial_tail(get_node(s, page_to_nid(page)), page);

out_unlock:
slab_unlock(page);

2007-12-21 22:52:01

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Christoph Lameter wrote:
>
> It obviously can replace it and has replaced it for awhile now.

No. If there are 6% performance regressions on TPC-C, then it CAN NOT
replace it!

> Well still SLUB is really superior to SLAB in many ways....
>
> - SLUB is performance wise much faster than SLAB.

No.

Christoph, statements like this is *exactly* why I don't think SLUB can
make it.

You're closing your eyes to real performace *regression* reports, and then
you claim thast SLUB is faster, because you find your own microbenchmarks
that show so for specific loads.

But those specific loads are apparetly never the issue.

So stop saying that SLUB is "much faster", as long as hackbench shows that
that is simply NOT THE CASE.

It doesn't matter one whit if SLUB is lots faster, if it's faster for
cases that never matter. So far, I don't think we've *ever* seen any
actual benchmarks that involve any kind of real use where SLUB is
really faster, and we have some major examples of SLUB being disastrously
slower!

Your special-case kmalloc performance tests don't matter, when real user
programs show the exact opposite effect.

And the fact that you dismiss those real user programs just because you
have your own test harness is why I'm ready to throw in the towel on SLUB.

I don't mind performance regressions, but I *do* mind performance
regressions when the main author then says "look, a helicopter!" and
points to some totally different thing and claims that the performance
regression doesn't even exist!

Because those kinds of performance regressions never get fixed, because
they are ignored.

Linus

2007-12-21 22:52:55

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Christoph Lameter wrote:
>
> Actually thanks for pointing that out Pekka.... Here is a patch that takes
> the regression almost completely away

Now *this* is what I wanted to see - rather than argue against other
peoples performance regression reports, an actual patch that acknowledges
the problem.

Thanks.

And now, can the people who made the problem reports and complained about
SLUB please test the patch - the ball is now in your court!

Linus

2007-12-21 22:55:13

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Christoph Lameter wrote:

> Actually thanks for pointing that out Pekka.... Here is a patch that takes
> the regression almost completely away (Too much jetlag combined with flu
> seems to have impaired my thinking I should have seen this earlier). I
> still need to run my slab performance tests on this. But hackbench
> improves.
>
>
> SLUB: Improve hackbench speed
>
> Increase the mininum number of partial slabs to keep around and put
> partial slabs to the end of the partial queue so that they can add
> more objects.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>

FYI, I'm currently on holiday till Jan 2cd, and I also lost access to the
machine I did the orignal benchmarks on. But I'll try to reserve it again
after the New Year, and I will run the benchmarks including this patch and
report my findings.

Thanks,

-- Steve

2007-12-21 22:58:16

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Christoph Lameter <[email protected]> wrote:

> On Fri, 21 Dec 2007, Peter Zijlstra wrote:
>
> > BTW, does /proc/slabinfo exist again? I thought we set that as a
> > requirement for SLUB to be the default and a full replacement for
> > SLAB.
>
> Well the information that would be exposed in /proc/slabinfo would
> only be faked up stuff from SLUB. The queues that are described in
> /proc/slabinfo do not exist f.e. (tunables do not make sense) and
> there are more details available via /sys/kernel/slab.

Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop
relies on it, people use it every day to monitor memory consumption.

I'm really getting worried that you are apparently incapable of grasping
such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or
whatever else in some of the fields? People use it to know how many
objects are allocated and sure SLUB knows that count, sheesh. How on
earth can you come up with a lame excuse like that? You dont like the
'SLAB' portion of the name perhaps? Is it NIH again?

Really, if your behavior is representative of how our SLAB allocator
will be maintained in the future then i'm very, very worried :-( You
ignore and downplay clear-cut regressions, you insult and attack
testers, you are incredibly stupid about user ABIs (or pretend to be so)
and you distort and mislead all the way. What will you be able to do in
the much less clear-cut cases??

Ingo

2007-12-21 23:00:35

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Christoph Lameter <[email protected]> wrote:

> if (unlikely(!prior))
> - add_partial(get_node(s, page_to_nid(page)), page);
> + add_partial_tail(get_node(s, page_to_nid(page)), page);

FYI, this gives me:

mm/slub.c: In function 'kfree':
mm/slub.c:2604: warning: passing argument 3 of 'slab_free' discards qualifiers from pointer target type

looks harmless though at first sight.

Ingo

2007-12-21 23:20:21

by David Miller

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

From: Ingo Molnar <[email protected]>
Date: Fri, 21 Dec 2007 23:54:13 +0100

> Really, if your behavior is representative of how our SLAB allocator
> will be maintained in the future then i'm very, very worried :-(

Actually, to the contrary, I actually think Christoph responds to
every problem I've ever reported to him about his per-cpu counters
work and SLUB much better than most people who call themselves
"maintainers" around here.

And I say that without any reservations.

He doesn't deserve the ad-hominem attacks he is getting today, because
he does resolve every problem reported to him.

The guy wrote test cases, he analyzed every problem, he wrote test
patches, and he doesn't stop doing any of that until the issue really
is reported as resolved by the testers.

I'll take Christoph as the implementor and maintainer of anything, any
day of the week. He rocks.

2007-12-21 23:21:35

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Linus Torvalds <[email protected]> wrote:

> And now, can the people who made the problem reports and complained
> about SLUB please test the patch - the ball is now in your court!

yep, and i ran a quick comparison test on a 2-core box with 3 kernels:

[ best of 5 runs in a row which had a relative jitter of less than 10% ]

MIN v2.6.24.slab v2.6.24.slub v2.6.24.slub.fix
----------------------------------------------------------
mmap: 429.00 402.00 ( -6%) 385.00 (-10%)
select: 11.38 10.46 ( -8%) 11.41 ( 0%)
proc-exec: 121.52 116.77 ( -3%) 120.77 ( 0%)
proc-fork: 106.84 106.19 ( 0%) 107.92 ( 1%)
syscall-open: 3.09 3.13 ( 1%) 3.25 ( 4%)
hackbench-50: 2.85 3.47 ( 21%) 2.88 ( 1%)

and the regression seems to be largely fixed! Not only is the hackbench
one fixed, but mmap shows an above-noise improvement as well.

Acked-by: Ingo Molnar <[email protected]>

And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson
;-)

Ingo

2007-12-21 23:28:34

by Andrew Morton

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007 23:54:13 +0100
Ingo Molnar <[email protected]> wrote:

> * Christoph Lameter <[email protected]> wrote:
>
> > On Fri, 21 Dec 2007, Peter Zijlstra wrote:
> >
> > > BTW, does /proc/slabinfo exist again? I thought we set that as a
> > > requirement for SLUB to be the default and a full replacement for
> > > SLAB.
> >
> > Well the information that would be exposed in /proc/slabinfo would
> > only be faked up stuff from SLUB. The queues that are described in
> > /proc/slabinfo do not exist f.e. (tunables do not make sense) and
> > there are more details available via /sys/kernel/slab.
>
> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it.

That would be really bad.

/proc/slabinfo intimately exposes implementation-specific internals and
strictly speaking should never have been introduced. Obviously that isn't
a _practical_ position, but there are no easy solutions here.

We don't want to have to drag a pile of be-compatible-with-slab gunk along
with us for ever.

> slabtop
> relies on it, people use it every day to monitor memory consumption.

slabtop needs to be changed asap.

Possibly we could look at adding some interrim make-it-look-like-slab hack
into slub.c for a couple of releases but no longer.

If we'd updated slabtop six months ago this issue wouldn't have arisen now.

But we didn't do that. We're quite bad at this sort of thing.

2007-12-21 23:40:49

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Hi,

On Dec 22, 2007 1:17 AM, Ingo Molnar <[email protected]> wrote:
> yep, and i ran a quick comparison test on a 2-core box with 3 kernels:
>
> [ best of 5 runs in a row which had a relative jitter of less than 10% ]
>
> MIN v2.6.24.slab v2.6.24.slub v2.6.24.slub.fix
> ----------------------------------------------------------
> mmap: 429.00 402.00 ( -6%) 385.00 (-10%)
> select: 11.38 10.46 ( -8%) 11.41 ( 0%)
> proc-exec: 121.52 116.77 ( -3%) 120.77 ( 0%)
> proc-fork: 106.84 106.19 ( 0%) 107.92 ( 1%)
> syscall-open: 3.09 3.13 ( 1%) 3.25 ( 4%)
> hackbench-50: 2.85 3.47 ( 21%) 2.88 ( 1%)
>
> and the regression seems to be largely fixed! Not only is the hackbench
> one fixed, but mmap shows an above-noise improvement as well.
>
> Acked-by: Ingo Molnar <[email protected]>

I thought it might be a bug but couldn't figure it out. The patch
looks good to me too, Christoph.

Reviewed-by: Pekka Enberg <[email protected]>

On Dec 22, 2007 1:17 AM, Ingo Molnar <[email protected]> wrote:
> And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson
> ;-)

Heh, thanks, but grep me from MAINTAINERS some day, Ingo :-).

Pekka

2007-12-21 23:51:35

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> I'm really getting worried that you are apparently incapable of grasping
> such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or
> whatever else in some of the fields? People use it to know how many
> objects are allocated and sure SLUB knows that count, sheesh. How on
> earth can you come up with a lame excuse like that? You dont like the
> 'SLAB' portion of the name perhaps? Is it NIH again?

NIH? I wrote major portions of SLAB. I would be hating my own product.
Could you get the facts straight at some point? This is getting weird.

> Really, if your behavior is representative of how our SLAB allocator
> will be maintained in the future then i'm very, very worried :-( You
> ignore and downplay clear-cut regressions, you insult and attack
> testers, you are incredibly stupid about user ABIs (or pretend to be so)
> and you distort and mislead all the way. What will you be able to do in
> the much less clear-cut cases??

I analyzed the issue and argued that the issues that one test showed in
SLUB is a really special case and then you conclude that I ignore all
regressions? I have addressed and responded to all reports of regressions
that came to me.

2007-12-21 23:54:43

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, 22 Dec 2007, Ingo Molnar wrote:

> and the regression seems to be largely fixed! Not only is the hackbench
> one fixed, but mmap shows an above-noise improvement as well.
>
> Acked-by: Ingo Molnar <[email protected]>

Well lets use the version attached to this patch. It turns out that it is
not important to increase the number of partial slabs. Just putting the
slab onto the tail of the list does it.

> And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson
> ;-)

He is already the co-maintainer of the slab allocators. See the
MAINTAINERS file.

SLUB: Improve hackbench speed

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/slub.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-12-21 14:30:11.096324462 -0800
+++ linux-2.6/mm/slub.c 2007-12-21 15:10:17.611022381 -0800
@@ -1611,12 +1611,12 @@ checks_ok:
goto slab_empty;

/*
- * Objects left in the slab. If it
- * was not on the partial list before
- * then add it.
+ * Objects left in the slab. If it was not on the partial list before
+ * then add it. Add it to the end since there is only a single object
+ * which would make slab_alloc inefficient.
*/
if (unlikely(!prior))
- add_partial(get_node(s, page_to_nid(page)), page);
+ add_partial_tail(get_node(s, page_to_nid(page)), page);

out_unlock:
slab_unlock(page);

2007-12-22 00:03:37

by Chuck Ebbert

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On 12/21/2007 06:54 PM, Christoph Lameter wrote:
>
> SLUB: Improve hackbench speed
>
> Increase the mininum number of partial slabs to keep around and put
> partial slabs to the end of the partial queue so that they can add
> more objects.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>

You forgot to update the commit message.

Should be:

Put partial slabs to the end of the partial queue so that they can add
more objects.

2007-12-22 00:28:45

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Ingo Molnar <[email protected]> writes:

> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop
> relies on it, people use it every day to monitor memory consumption.

It's definitely not a stable ABI. slabtop tends to exit without any
error message on any slabinfo version number increase and I've seen
that happen several times in not so old kernels.

Requiring just another slabtop update isn't really a big deal.

Also it's not that it's a critical functionality like udev.

-Andi

2007-12-22 00:35:22

by Chuck Ebbert

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On 12/21/2007 07:28 PM, Andi Kleen wrote:
> Ingo Molnar <[email protected]> writes:
>
>> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop
>> relies on it, people use it every day to monitor memory consumption.
>
> It's definitely not a stable ABI. slabtop tends to exit without any
> error message on any slabinfo version number increase and I've seen
> that happen several times in not so old kernels.
>
> Requiring just another slabtop update isn't really a big deal.
>
> Also it's not that it's a critical functionality like udev.
>

There's also Documentation/vm/slabinfo.c, which really belongs in
some kind of 'kernel utilities' package. Since it's so intimately
tied to kernel version I don't think it belongs in util-linux.

2007-12-22 02:22:16

by Matt Mackall

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, 2007-12-22 at 01:28 +0100, Andi Kleen wrote:
> Ingo Molnar <[email protected]> writes:
>
> > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop
> > relies on it, people use it every day to monitor memory consumption.
>
> It's definitely not a stable ABI. slabtop tends to exit without any
> error message on any slabinfo version number increase and I've seen
> that happen several times in not so old kernels.
>
> Requiring just another slabtop update isn't really a big deal.

Might as well mention that SLOB also doesn't provide this functionality,
nor would it make any sense for it to try (because there are no
slab-like objects to report about).

--
Mathematics is the supreme nostalgia of our time.

2007-12-22 02:43:47

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, Dec 21, 2007 at 08:19:42PM -0600, Matt Mackall wrote:
>
> On Sat, 2007-12-22 at 01:28 +0100, Andi Kleen wrote:
> > Ingo Molnar <[email protected]> writes:
> >
> > > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop
> > > relies on it, people use it every day to monitor memory consumption.
> >
> > It's definitely not a stable ABI. slabtop tends to exit without any
> > error message on any slabinfo version number increase and I've seen
> > that happen several times in not so old kernels.
> >
> > Requiring just another slabtop update isn't really a big deal.
>
> Might as well mention that SLOB also doesn't provide this functionality,
> nor would it make any sense for it to try (because there are no
> slab-like objects to report about).

% slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
...
anon_vma 1551 24 81.9K 20/18/2 128 0 90 45

Seems certainly like information that slabtop could report, even if it's
not exactly the same.

-Andi

2007-12-22 09:50:37

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Fri, 21 Dec 2007 23:54:13 +0100
>
> > Really, if your behavior is representative of how our SLAB allocator
> > will be maintained in the future then i'm very, very worried :-(
>
> Actually, to the contrary, I actually think Christoph responds to
> every problem I've ever reported to him about his per-cpu counters
> work and SLUB much better than most people who call themselves
> "maintainers" around here.
>
> And I say that without any reservations.
>
> He doesn't deserve the ad-hominem attacks he is getting today, because
> he does resolve every problem reported to him.
>
> The guy wrote test cases, he analyzed every problem, he wrote test
> patches, and he doesn't stop doing any of that until the issue really
> is reported as resolved by the testers.
>
> I'll take Christoph as the implementor and maintainer of anything, any
> day of the week. He rocks.

well, maybe i got unlucky, this hackbench thing being the first time i'm
exposed to a major SLUB regression. The hackbench problem was dead easy
to reproduce, i (and others) offered immediate testing of whatever test
patches, it also matched the profiles of the TPC-C regression but still
i was only offered explanations about why this workload does not matter
and how others suck because they are unable to give immediate test
feedback from millions-of-dollars test equipment that is barely able to
run our devel kernels. The regression is fixed now and i'm a happy
camper!

Christoph, i'd like to apologize for all overly harsh words i said. (and
i said quite a few :-/ )

Ingo

2007-12-22 10:04:12

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it.
> > slabtop relies on it, people use it every day to monitor memory
> > consumption.
>
> It's definitely not a stable ABI. slabtop tends to exit without any
> error message on any slabinfo version number increase and I've seen
> that happen several times in not so old kernels.

so because we sucked in the past we can continue to suck? ;)

Why are we still arguing about this? We kernel developers are foxes
amongst the hens and if a compatibility issue comes up we have to act
_doubly_ conservatively.

You think it's reasonable to not offer /proc/slabinfo? You can fairly
assume that a user considers it absolutely unreasonable. If other kernel
developers tell you: "no, it's fine", then it's as if other foxes told
you "no, it's fine to eat that hen, we do it all the time too!" ;-)

> Requiring just another slabtop update isn't really a big deal.

certainly. But consider this from the user's perspective who tries one
of our devel kernels. He suspects a memory leak. Runs slabtop and
gets:

$ slabtop
fopen /proc/slabinfo: No such file or directory

and would fairly conclude: "ok, this new Linux kernel looks quite
apparently unfinished, i'm outta here".

We do this way too frequently and many silly details like this _do_
mount up.

> Also it's not that it's a critical functionality like udev.

Sure, we can argue about details that not all fields in /proc/slabinfo
are relevant, and that slabtop should be a bit more careful, etc., but
we've got what we've got because _we_ built the current code, so we
might as well accept the consequences it brings. The most of the basic
output of slabtop:

Active / Total Objects (% used) : 648754 / 747076 (86.8%)
Active / Total Slabs (% used) : 39394 / 39394 (100.0%)
Active / Total Caches (% used) : 103 / 143 (72.0%)
Active / Total Size (% used) : 138884.36K / 151075.96K (91.9%)
Minimum / Average / Maximum Object : 0.01K / 0.20K / 128.00K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
261928 239808 91% 0.13K 9032 29 36128K dentry_cache
222048 174144 78% 0.05K 3084 72 12336K buffer_head
187232 178929 95% 0.48K 23404 8 93616K ext3_inode_cache
24416 17908 73% 0.27K 1744 14 6976K radix_tree_node

could be offered on SLUB too.

'top' isnt critical functionality either like udev, and the ABI does not
only cover 'critical' functionality. A utility suddenly not working
gives Linux a pretty amateurish feeling. Should we tell users/admins:
"Hehe, gotcha! Didnt you know /proc/slabinfo was not an ABI? Poor sob.
If you want your stuff to continue working, use Windows next time around
or what. Sheesh, what do these people want!' ;-)

the rule is very simple: unless you have really, really, really, REALLY
good reasons, just dont break stuff.

Ingo

2007-12-22 12:36:18

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

> could be offered on SLUB too.

Sure (I argued that in a earlier mail in fact), but it doesn't make
sense to fake into the old slabinfo format.

>
> 'top' isnt critical functionality either like udev, and the ABI does not
> only cover 'critical' functionality. A utility suddenly not working
> gives Linux a pretty amateurish feeling. Should we tell users/admins:
> "Hehe, gotcha! Didnt you know /proc/slabinfo was not an ABI? Poor sob.
> If you want your stuff to continue working, use Windows next time around
> or what. Sheesh, what do these people want!' ;-)

While I sympatise with this point in practice non critial ABIs change
regularly (and sometimes even critical ones like sched_yield ...)

>
> the rule is very simple: unless you have really, really, really, REALLY
> good reasons, just dont break stuff.

Removing an interface that exposes lots of internal details when you
rewrite the subsystem and those internal details all change seems
like a good reason to me.

Besides the original interface was broken anyways. The version
number was one of the interface worst ideas ever imho and slabtop's handling
of it quite dumb. The better way would have been to always add new
fields and never remove old ones. Hopefully the new /sys based interface
will be better.

-Andi

2007-12-22 12:51:57

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Hi,

On Dec 22, 2007 2:37 PM, Andi Kleen <[email protected]> wrote:
> Removing an interface that exposes lots of internal details when you
> rewrite the subsystem and those internal details all change seems
> like a good reason to me.

Yeah but then again removing an interface that has been around for
ever is a real PITA for users. Besides, emulating /proc/slabinfo ain't
so bad:

http://lkml.org/lkml/2007/12/22/58

Pekka

2007-12-22 12:54:11

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

> Yeah but then again removing an interface that has been around for
> ever

Version 2.1 hasn't been around forever and at least slabtop does not
work over version number changes in my experience.

-Andi

2007-12-22 13:02:04

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

> > the rule is very simple: unless you have really, really, really, REALLY
> > good reasons, just dont break stuff.

Could we please drop "kset: move /sys/slab to /sys/kernel/slab" from -mm?
Old slabinfo complains that SLUB_DEBUG is not on and refuses to do anything.

2007-12-22 13:27:18

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

Hi Andi,

On Dec 22, 2007 2:54 PM, Andi Kleen <[email protected]> wrote:
> > Yeah but then again removing an interface that has been around for
> > ever
>
> Version 2.1 hasn't been around forever and at least slabtop does not
> work over version number changes in my experience.

True but the default user-visible ABI (the one without statistics) is
unchanged since version 2.0:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=7ee832030b3a5d3a87f8f85b1fc773be15e98608

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=e79fbf181ba738ca410f2c457ced220b738e8856

But anyway, I don't care too much either way so I'll just let Andrew
sort this out, as usual.

Pekka

2007-12-22 18:02:46

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, 22 Dec 2007, Ingo Molnar wrote:
> >
> > It's definitely not a stable ABI. slabtop tends to exit without any
> > error message on any slabinfo version number increase and I've seen
> > that happen several times in not so old kernels.
>
> so because we sucked in the past we can continue to suck? ;)

Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and
the fact that there hasn't been a lot of complaints about it going away
does seem to imply that it wasn't a very important ABI.

I'm the first to stand up for backwards compatibility, but I also try to
be realistic, and so far nobody has really complained about the fact that
/proc/slabinfo went away on any grounds but on the general principle of
"it was a user-visible ABI".

We've changed user-visible ABI's in the past in the hope that they weren't
actually used. Sometimes it worked, sometimes it didn't. In this case, I
think it still has the potential of working.

That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I
as a developer tend to want from there, but it's really rather less
convenient than I'd like. It is really quite hard, for example, to get any
kind of "who actually uses the most *memory*" information out of there.

You have to use something like this:

for i in *
do
order=$(cat "$i/order")
slabs=$(cat "$i/slabs")
object_size=$(cat "$i/object_size")
objects=$(cat "$i/objects")
truesize=$(( $slabs<<($order+12) ))
size=$(( $object_size*$objects ))
percent=$(( $truesize/100 ))
if [ $percent -gt 0 ]; then
percent=$(( $size / $percent ))
fi
mb=$(( $size >> 10 ))
printf "%10d MB %3d %s\n" $mb $percent $i
done | sort -n | tail

which works fine, but while it's actually _much_ more readable than doing
the same thing with /proc/slabinfo was, I worry about the lack of
atomicity in getting the statistics.

I dunno. I do think /sys/slab/ is a better interface than /proc/slabinfo
was. I just wonder if it could be better still.

Linus

2007-12-22 19:15:30

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Linus Torvalds <[email protected]> wrote:

> > > It's definitely not a stable ABI. slabtop tends to exit without
> > > any error message on any slabinfo version number increase and I've
> > > seen that happen several times in not so old kernels.
> >
> > so because we sucked in the past we can continue to suck? ;)
>
> Well, I do have to admit that I'm not a huge fan of /proc/slabinfo,
> and the fact that there hasn't been a lot of complaints about it going
> away does seem to imply that it wasn't a very important ABI.
>
> I'm the first to stand up for backwards compatibility, but I also try
> to be realistic, and so far nobody has really complained about the
> fact that /proc/slabinfo went away on any grounds but on the general
> principle of "it was a user-visible ABI".
>
> We've changed user-visible ABI's in the past in the hope that they
> weren't actually used. Sometimes it worked, sometimes it didn't. In
> this case, I think it still has the potential of working.
>
> That said, I'm not a *huge* fan of /sys/slab/ either. I can get the
> info I as a developer tend to want from there, but it's really rather
> less convenient than I'd like. It is really quite hard, for example,
> to get any kind of "who actually uses the most *memory*" information
> out of there.

Yes, i agree that me calling it an 'ABI' stretches the term - we dont
want to put ourselves into a forced compatibility corner in every case
where /proc does something really stupid. But /proc/slabinfo is being
used, slabtop is installed and deployed, so why break it unnecessarily?
It's not like we couldnt remove /proc/slabinfo later on, via the normal
route of feature removal. I think Pekka's patch that adds /proc/slabinfo
is simple and reasonable enough for 2.6.24.

We can get rid of /proc/slabinfo but then it should i think be done via
the normal route of Documentation/feature-removal-schedule.txt. Was
there any particular problem with /proc/slabinfo, at least as far as the
read-only access that slabtop does goes? I dont think there was. So why
should we go out on a limb _not_ providing it when there's a patch
available, etc. I just googled a bit and people _have_ asked about
slabtop availability, and the impression was that it would be offered by
the time SLUB becomes the default. (and this was mentioned by others in
this discussion)

I'd also not rely on the fact that only a few people are complaining.
Most people, even 2.6.24-rc early adopters, still use SLAB because early
adopters typically use their .23 .config and do a 'make oldconfig' -
which picks up SLAB. So SLUB use will become more widespread only once
2.6.24 is out and is packaged in distros. Distros will likely pick SLUB
if there's no performance worries and if it's the default. Fedora
rawhide already uses SLUB.

Ingo

2007-12-22 19:26:19

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, Dec 22, 2007 at 10:01:37AM -0800, Linus Torvalds wrote:
> Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and
> the fact that there hasn't been a lot of complaints about it going away
> does seem to imply that it wasn't a very important ABI.
>
> I'm the first to stand up for backwards compatibility, but I also try to
> be realistic, and so far nobody has really complained about the fact that
> /proc/slabinfo went away on any grounds but on the general principle of
> "it was a user-visible ABI".

That's probably because the people who are most likely to be using
/proc/slabinfo tend to people people using kernels in production
environments, and they probably haven't started playing with SLUB yet.
So the fact we haven't heard the yelling doesn't necessarily mean that
people won't notice.

I know of a number of debugging scripts that periodically poll
/proc/slabinfo to find out what is using memory and to track trends
and predict potential problems with memory usage and balance. Indeed,
I've left several such scripts behind at various customer
installations after debugging memory usage problems, and some system
administrators very much like being able to collect that information
easily. So I can say for sure there at there are scripts out there
that depend on /proc/slabinfo, but they generally tend to be at sites
where people won't be running bleeding edge kernels.

> That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I
> as a developer tend to want from there, but it's really rather less
> convenient than I'd like. It is really quite hard, for example, to get any
> kind of "who actually uses the most *memory*" information out of there.

I have a general problem with things in /sys/slab, and that's just
because they are *ugly*. So yes, you can write ugly shell scripts
like this to get out information:

> You have to use something like this:
>
> for i in *
> do
> order=$(cat "$i/order")
> slabs=$(cat "$i/slabs")
> object_size=$(cat "$i/object_size")
> objects=$(cat "$i/objects")
> truesize=$(( $slabs<<($order+12) ))
> size=$(( $object_size*$objects ))
> percent=$(( $truesize/100 ))
> if [ $percent -gt 0 ]; then
> percent=$(( $size / $percent ))
> fi
> mb=$(( $size >> 10 ))
> printf "%10d MB %3d %s\n" $mb $percent $i
> done | sort -n | tail

But sometimes when trying to eyeball what is going on, it's a lot
nicer just to use "cat /proc/slabinfo".

Another problem with using /sys/slab is that it is downright *ugly*.
Why is it for example, that /sys/slab/dentry is a symlink to
../slab/:a-0000160? Because of this, Linus's script reports dentry twice:

10942 MB 86 buffer_head
10942 MB 86 journal_head
10943 MB 86 :a-0000056
21923 MB 86 dentry
21924 MB 86 :a-0000160
78437 MB 94 ext3_inode_cache

Once as "dentry" and once as ":a-0000160". A similar situation
happens with journal_head and ":a-0000056". I'm sure this could be
filtered out with the right shell or perl script, but now we're
getting even farther from /proc/slabinfo.

Further, if you look at Documentation/sysfs-rules.txt (does anyone
ever read it or bother to update it?), /sys/slab isn't documented so
it falls in the category of "everything else is a kernel-level
implementation detail which is not guaranteed to be stable". Nice....

Also, looking at sysfs-rules.txt, it talks an awful lot about reading
symlinks, and implying that actually trying to *dereference* the
symlinks might not actually work. So I'm pretty sure linus's "cd
/sys/slab; for i in *" violates the Documentation/sysfs-rules.txt
guidelines even if /sys/slab were documented.

In general, /sys has some real issues that need some careful scrutiny.
Documentation/sysfs-rules.txt is *not* enough. For example, various
official websites are telling people that the way to enable or disable
power-saving is:

echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level

(Reference: http://www.lesswatts.org/tips/wireless.php)

Aside from the fact that "iwconfig wlan0 power on" is easier to type
and simpler to undrestand, I'm pretty sure that the above violates
sysfs-rules.txt. From looking at sysfs-rules.txt, the fact that you
have to read symlinks as the only valid and guaranteed way for things
to work means that you have to write a perl script to safely
manipulate things in /sys. People are ignoring sysfs-rules.txt, and
giving advice in websites contrary to sysfs-rules.txt because it's too
hard to follow.

> I worry about the lack of atomicity in getting the statistics.

And that's another problem with trying to use /sys when trying to get
statistics in bulk... Very often the tables in /proc/* were not only
more convenient, but allowed for atomic access.

- Ted

2007-12-22 19:30:36

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Ingo Molnar <[email protected]> wrote:

> I'd also not rely on the fact that only a few people are complaining.
> Most people, even 2.6.24-rc early adopters, still use SLAB because
> early adopters typically use their .23 .config and do a 'make
> oldconfig' - which picks up SLAB. So SLUB use will become more
> widespread only once 2.6.24 is out and is packaged in distros. Distros
> will likely pick SLUB if there's no performance worries and if it's
> the default. Fedora rawhide already uses SLUB.

here's some silly statistics about allocator coverage on lkml, based on
configs reported to lkml in the past 4 months:

$ for N in SLAB SLUB SLOB; do printf "%s: " $N; grep ^CONFIG_$N=y linux-kernel | wc -l; done
SLAB: 70
SLUB: 77
SLOB: 4

so SLUB and SLAB is utilized about equally amongst people who reported
configs to lkml.

But people who use SLUB enabled it intentionally - and they are thus
much less likely to complain about this choice of them.

Reporting:

"I just enabled SLUB instead of SLAB, and hey guys, it does not have
SLABinfo"

has a foolish ring to it, doesnt it? I'd rather ask carefully, like this
person did:

http://kerneltrap.org/mailarchive/linux-kernel/2007/10/12/335765

This is one of the reasons why i think the whole SLAB->SLUB renaming was
bad - we should have done SLAB2 or SLAB^2 instead, so that people expect
_at least as good_ behavior (performance, features, etc.) from SLUB as
from SLAB. Instead of "something different".

anyway ... i think we still generally suck _alot_ at providing
near-transparent kernel upgrades to users. (kernel upgrades are still a
pain and risk, and often just due to poor developer choices on our
side.)

It's nowhere near as bad as the 2.4->2.6 transition was (in fact it
shouldnt even be mentioned in the same sentence), and it's getting
better gradually, but i think we should just by default be 10 times more
careful about these things - whenever it is borderline technically sane
to do so. We induce enough unintentional breakage of user-space, we
shouldnt compound it with intentional breakages. The kernel community
still has _a lot_ of user and distro trust to win back. So being seen as
over-cautious wont harm.

Ingo

2007-12-22 19:45:34

[permalink] [raw]

Subject: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers)

As a followup to the thread regarding slabinfo and Andrew's
earlier query about updating slabtop.

watch -n1 slabinfo -S

seems to be a reasonable replacement for slabtop. The only
missing functionality are some hotkeys missing: you have to restart
now to get a different sort order (and to read the slabinfo source
to find out the correct option), but other than that it works.

So I would propose to just retire slabtop and replace
it with a simple shell script doing that.

A manpage for slabinfo would be useful though. Anybody
volunteering to write one?

-Andi

2007-12-22 21:01:35

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, 22 Dec 2007, Theodore Tso wrote:
>
> I have a general problem with things in /sys/slab, and that's just
> because they are *ugly*. So yes, you can write ugly shell scripts
> like this to get out information:

[ script deleted ]

> But sometimes when trying to eyeball what is going on, it's a lot
> nicer just to use "cat /proc/slabinfo".

.. and I call BS on this claim.

/proc/slabinfo was (and is) totally pointless for "trying to eyeball
what's going on". The output is totally unreadable, and useless. You end
up with exactly the same script as above, except it reads as

cat /proc/slabinfo | (read headerline
while read name active num objsize objsperslab pagesperslab rest
do
realsize=$(( nul * objsize ))
size=$(( active * objsize ))
.. exact same rest of loop ..
done | sort -n | ..

so no, "cat /proc/slabinfo" was almost never practical on its own.

The *one* advantage it does have is that you can forward it to others.
That's a big advantage. But no, it wasn't ever readable for eyeballing,
because it doesn't even give you a memory usage thing (just "number of
objects and object size" as separate numbers).

But the "everything in one file" indubitably did make it a lot easier for
just attaching it to bug-reports.

> Another problem with using /sys/slab is that it is downright *ugly*.
> Why is it for example, that /sys/slab/dentry is a symlink to
> ../slab/:a-0000160?

That's the only really ugly thing there. Otherwise, it's pretty nice, but
having a million files makes for problems when trying to send somebody
else the full info.

Linus

2007-12-22 21:50:50

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:

> > Another problem with using /sys/slab is that it is downright *ugly*.
> > Why is it for example, that /sys/slab/dentry is a symlink to
> > ../slab/:a-0000160?
>
> That's the only really ugly thing there. Otherwise, it's pretty nice, but
> having a million files makes for problems when trying to send somebody
> else the full info.

*raised brows*

I would say that there's that really ugly thing with embedding kobject
into a struct with the lifetime rules of its own and then having that
struct kfree()d while kobject might still have references, but hey,
maybe it's just me and my lack of appreciation of the glory that is
sysfs.

Al, too tired of ranting about sysfs being a major architecture
mistake and a recurring source of turds like that one...

2007-12-22 22:15:25

by Willy Tarreau

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
>
>
> On Sat, 22 Dec 2007, Theodore Tso wrote:
> >
> > I have a general problem with things in /sys/slab, and that's just
> > because they are *ugly*. So yes, you can write ugly shell scripts
> > like this to get out information:
>
> [ script deleted ]
>
> > But sometimes when trying to eyeball what is going on, it's a lot
> > nicer just to use "cat /proc/slabinfo".
>
> .. and I call BS on this claim.
>
> /proc/slabinfo was (and is) totally pointless for "trying to eyeball
> what's going on". The output is totally unreadable, and useless. You end
> up with exactly the same script as above, except it reads as
>
> cat /proc/slabinfo | (read headerline
> while read name active num objsize objsperslab pagesperslab rest
> do
> realsize=$(( nul * objsize ))
> size=$(( active * objsize ))
> .. exact same rest of loop ..
> done | sort -n | ..
>
> so no, "cat /proc/slabinfo" was almost never practical on its own.
>
> The *one* advantage it does have is that you can forward it to others.
> That's a big advantage. But no, it wasn't ever readable for eyeballing,
> because it doesn't even give you a memory usage thing (just "number of
> objects and object size" as separate numbers).

I don't agree with you Linus. I'm one of those used to quickly take a
look at its contents when I don't know where all my memory has gone.
I know by experience that if I find 6-digit values in dentry_cache or
inode_cache, all I was looking for is there, and that's enough for a
quick diag. It doesn't give anything accurate, but it's very useful
to quickly diag out some problems on production systems.

It was this week that I noticed for the first time that slabinfo did
not exist anymore, and did not know what replaced it. Lacking time,
I finally gave up and *supposed* that the memory was eaten by the
usual suspects.

I can understand that it has to go away for technical reasons, but Ted
is right, please don't believe that nobody uses it just because you got
no complaint. While people are not likely to perform all computations
in scripts, at least they're used to find some quickly identifiable
patterns there.

Willy

2007-12-22 23:28:43

by Karol Swietlicki

[permalink] [raw]

Subject: Re: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers)

On 22/12/2007, Andi Kleen <[email protected]> wrote:
> A manpage for slabinfo would be useful though. Anybody
> volunteering to write one?
>
> -Andi

That would be me.
I'm a newbie and never wrote a man page before, so it will take a few
days, but I'm bored and out of ideas for any new code for the moment
being. Should be fun.

Karol Swietlicki

2007-12-22 23:29:46

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, Dec 22, 2007 at 09:50:09PM +0000, Al Viro wrote:
> On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
>
> > > Another problem with using /sys/slab is that it is downright *ugly*.
> > > Why is it for example, that /sys/slab/dentry is a symlink to
> > > ../slab/:a-0000160?
> >
> > That's the only really ugly thing there. Otherwise, it's pretty nice, but
> > having a million files makes for problems when trying to send somebody
> > else the full info.
>
> *raised brows*
>
> I would say that there's that really ugly thing with embedding kobject
> into a struct with the lifetime rules of its own and then having that
> struct kfree()d while kobject might still have references, but hey,
> maybe it's just me and my lack of appreciation of the glory that is
> sysfs.
>
> Al, too tired of ranting about sysfs being a major architecture
> mistake and a recurring source of turds like that one...

BTW, the fact that presence of that kobject is conditional makes life even
uglier - we have to either kfree() or drop a reference to kobject leaving
actual kfree() to its ->release(). While we are at it, when do we remove
the symlinks? That got to add even more fun for the lifetime rules...

Sigh... How does one set a script that would mail a warning upon git pull
that introduces any instances of keyword from given set? I hadn't noticed
that slub was using sysfs when it went into the tree back in May ;-/

2007-12-22 23:32:39

by Jason L Tibbitts III

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

>>>>> "IM" == Ingo Molnar <[email protected]> writes:

IM> Distros will likely pick SLUB if there's no performance worries
IM> and if it's the default. Fedora rawhide already uses SLUB.

Actually, it seems to me that not only does Fedora rawhide use SLUB,
but Fedora 8 and 7 use it as well. They don't have /proc/slabinfo and
they all seem to have CONFIG_SLUB=y:

> grep -r CONFIG_SLUB=y kernel
kernel/devel/config-generic:CONFIG_SLUB=y
kernel/F-7/configs/config-generic:CONFIG_SLUB=y
kernel/F-8/config-generic:CONFIG_SLUB=y

- J<

2007-12-23 01:01:19

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

[/me sneaks away from the family]

On Sat, 22 Dec 2007, Willy Tarreau wrote:

> On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
> > On Sat, 22 Dec 2007, Theodore Tso wrote:
> > > But sometimes when trying to eyeball what is going on, it's a lot
> > > nicer just to use "cat /proc/slabinfo".
> >
> > .. and I call BS on this claim.
> >

[...]

>
> I can understand that it has to go away for technical reasons, but Ted
> is right, please don't believe that nobody uses it just because you got
> no complaint. While people are not likely to perform all computations
> in scripts, at least they're used to find some quickly identifiable
> patterns there.
>

I know when I'm looking for memory leaks, I've asked customers to give
snapshots of slabinfo at periodic times (once a day even, matters how bad
the leak is). This has been helpful in seeing if something did indeed
leak.

If you have a slab cache that constantly grows, and never shrinks, that's
a good indication that something might be leaking. Not always, since there
can be legitimate reasons for that, but sometimes it helps.

But I still scratch my head when ever I need to touch sysfs.

[/me runs back to the family]

-- Steve

2007-12-23 05:16:03

by Willy Tarreau

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, Dec 22, 2007 at 07:59:47PM -0500, Steven Rostedt wrote:
[...]
> But I still scratch my head when ever I need to touch sysfs.

Same here. In fact, I've always considered that procfs was for
humans while sysfs was for tools. sysfs reminds me too much the
unexploitable /devices in Solaris. With the proper tools, I think
we can do a lot with it, but it's not as intuitive to find the
proper tools as it was to do "ls" followed by "cat" in /proc.

Willy

2007-12-23 14:14:14

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

> Same here. In fact, I've always considered that procfs was for
> humans while sysfs was for tools. sysfs reminds me too much the
> unexploitable /devices in Solaris. With the proper tools, I think
> we can do a lot with it, but it's not as intuitive to find the
> proper tools as it was to do "ls" followed by "cat" in /proc.

find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done

tends to work reasonably well for a quick overview, but yes
cat was nicer for humans.

-Andi

2007-12-24 03:57:28

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sun, Dec 23, 2007 at 03:15:00PM +0100, Andi Kleen wrote:
> > Same here. In fact, I've always considered that procfs was for
> > humans while sysfs was for tools. sysfs reminds me too much the
> > unexploitable /devices in Solaris. With the proper tools, I think
> > we can do a lot with it, but it's not as intuitive to find the
> > proper tools as it was to do "ls" followed by "cat" in /proc.
>
> find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done
>
> tends to work reasonably well for a quick overview, but yes
> cat was nicer for humans.

Until you start to wonder what the heck :a-0000136 is:

/sys/slab/:a-0000136/objs_per_slab: 30

Sigh...

- Ted

2007-12-24 03:59:23

by Alessandro Suardi

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On 22 Dec 2007 16:52:56 -0600, Jason L Tibbitts III <[email protected]> wrote:
> >>>>> "IM" == Ingo Molnar <[email protected]> writes:
>
> IM> Distros will likely pick SLUB if there's no performance worries
> IM> and if it's the default. Fedora rawhide already uses SLUB.
>
> Actually, it seems to me that not only does Fedora rawhide use SLUB,
> but Fedora 8 and 7 use it as well. They don't have /proc/slabinfo and
> they all seem to have CONFIG_SLUB=y:
>
> > grep -r CONFIG_SLUB=y kernel
> kernel/devel/config-generic:CONFIG_SLUB=y
> kernel/F-7/configs/config-generic:CONFIG_SLUB=y
> kernel/F-8/config-generic:CONFIG_SLUB=y

Also true for at least recent versions of FC6 (which my bittorrent
machine is still on), which currently run 2.6.22-14 based kernels.
And no, I didn't notice - that box usually hits triple-digit uptime
before I reboot; in fact, it's still running a 2.6.22-9 based kernel.

--alessandro

"Damaged people are dangerous. They know they can survive."

(Anna Barton/Juliette Binoche, 'Damage')

2007-12-24 19:21:25

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sun, 23 Dec 2007, Theodore Tso wrote:

> > tends to work reasonably well for a quick overview, but yes
> > cat was nicer for humans.
>
> Until you start to wonder what the heck :a-0000136 is:
>
> /sys/slab/:a-0000136/objs_per_slab: 30
>
> Sigh...

That is why there is a slabinfo tool that does all the nice formatting.

Do a

gcc -o slabinfo Documentation/vm/slabinfo.c

Then run slabinfo instead of doing cat /proc/slabinfo

2007-12-24 19:25:06

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Sat, 22 Dec 2007, Ingo Molnar wrote:

> Christoph, i'd like to apologize for all overly harsh words i said. (and
> i said quite a few :-/ )

Looks like a bad interaction due to overload, vacation, flu epidemic
hitting me and family and also Christmas coming up so that I could not do
my usual workload. Need to find some way to cut back on the stuff that I
am involved with it seems. Thanks for the SLUB patches.

2007-12-24 23:38:20

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Mon, Dec 24, 2007 at 11:21:14AM -0800, Christoph Lameter wrote:
>
> That is why there is a slabinfo tool that does all the nice formatting.
>
> Do a
>
> gcc -o slabinfo Documentation/vm/slabinfo.c
>
> Then run slabinfo instead of doing cat /proc/slabinfo

So two questions: why isn't -f the default? And is /sys/slab
guaranteed to be a stable and permanent interface if the SLAB
implementation ever gets ripped out? If so, maybe this should go into
util-linux-ng? I am aa bit concerned about the lack of atomicity of
/sys/slab, but this is a heck of a lot better of many kernel drivers
or subsystems which use /sys and the completely punt on any kind of
userspace utility, forcing users to type crazy things like

echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level

- Ted

2007-12-25 03:33:19

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

> And is /sys/slab
> guaranteed to be a stable and permanent interface if the SLAB

Debugging feature and "stable and permanent" just don't
fit together. It's like requesting stable and permanent
sysrq output.

-Andi

2007-12-26 21:31:45

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Mon, 24 Dec 2007, Theodore Tso wrote:

> So two questions: why isn't -f the default? And is /sys/slab

Because it gives misleading output. It displays the name of the first
of multiple slabs that share the same storage structures.

2007-12-26 22:17:32

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Wed, Dec 26, 2007 at 01:31:35PM -0800, Christoph Lameter wrote:
> On Mon, 24 Dec 2007, Theodore Tso wrote:
>
> > So two questions: why isn't -f the default? And is /sys/slab
>
> Because it gives misleading output. It displays the name of the first
> of multiple slabs that share the same storage structures.

Erm... Let me spell it out: current lifetime rules are completely broken.
As it is, create/destroy/create cache sequence will do kobject_put() on
kfree'd object. Even without people playing with holding sysfs files
open or doing IO on those.

a) you have kobject embedded into struct with the lifetime rules of its
own. When its refcount hits zero you kfree() the sucker, even if you
still have references to embedded kobject.

b) your symlinks stick around. Even when cache is long gone you still
have a sysfs symlink with its embedded kobject as a target. They are
eventually removed when cache with the same name gets created. _Then_
you get the target kobject dropped - when the memory it used to be in
had been freed for hell knows how long and reused by something that would
not appreciate slub.c code suddenly deciding to decrement some word in
that memory.

c) you leak references to these kobject; kobject_del() only removes it
from the tree undoing the effect of kobject_add() and you still need
kobject_put() to deal with the last reference.

2007-12-27 05:51:56

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Wed, 26 Dec 2007, Al Viro wrote:

> Erm... Let me spell it out: current lifetime rules are completely broken.
> As it is, create/destroy/create cache sequence will do kobject_put() on
> kfree'd object. Even without people playing with holding sysfs files
> open or doing IO on those.

Urgh. Help! Isnt there some sanity with sysfs?

> a) you have kobject embedded into struct with the lifetime rules of its
> own. When its refcount hits zero you kfree() the sucker, even if you
> still have references to embedded kobject.

So alloate it separately?

> b) your symlinks stick around. Even when cache is long gone you still
> have a sysfs symlink with its embedded kobject as a target. They are
> eventually removed when cache with the same name gets created. _Then_
> you get the target kobject dropped - when the memory it used to be in
> had been freed for hell knows how long and reused by something that would
> not appreciate slub.c code suddenly deciding to decrement some word in
> that memory.

Hmmmm.. That would also be addressed by a having a separately allocated
kobject?

> c) you leak references to these kobject; kobject_del() only removes it
> from the tree undoing the effect of kobject_add() and you still need
> kobject_put() to deal with the last reference.

Patches would be appreciated.... Had a hard time to get the sysfs support
to work right.

2007-12-27 20:28:30

[permalink] [raw]

Subject: SLUB sysfs support

Hmmm.. If I separately allocate the kobject then I can no longer get to
the kmem_cache structure from the kobject.

I need to add a second kobject_del to sysfs_slab_remove() to make sysfs
completely forget about the object?

Probably should track down any remaining symlinks at that point and nuke
them too. Isnt there some way to convince sysfs to remove the symlinks
if the target vanishes?

2007-12-27 23:00:29

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote:
> Hmmm.. If I separately allocate the kobject then I can no longer get to
> the kmem_cache structure from the kobject.
>
> I need to add a second kobject_del to sysfs_slab_remove() to make sysfs
> completely forget about the object?
>
> Probably should track down any remaining symlinks at that point and nuke
> them too. Isnt there some way to convince sysfs to remove the symlinks
> if the target vanishes?

Don't bother with separate allocation.

a) remove symlink when slab goes away
b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
c) have ->release() of these kobjects do kfree()

2007-12-27 23:22:42

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, 27 Dec 2007, Al Viro wrote:

> On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote:
> > Hmmm.. If I separately allocate the kobject then I can no longer get to
> > the kmem_cache structure from the kobject.
> >
> > I need to add a second kobject_del to sysfs_slab_remove() to make sysfs
> > completely forget about the object?
> >
> > Probably should track down any remaining symlinks at that point and nuke
> > them too. Isnt there some way to convince sysfs to remove the symlinks
> > if the target vanishes?
>
> Don't bother with separate allocation.
>
> a) remove symlink when slab goes away

Ok. Need to think about how to code that.

> b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff

Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a
kobject_del().

> c) have ->release() of these kobjects do kfree()

In slab_ktype?

---
mm/slub.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-12-27 15:09:38.000000000 -0800
+++ linux-2.6/mm/slub.c 2007-12-27 15:21:12.000000000 -0800
@@ -247,7 +247,10 @@ static void sysfs_slab_remove(struct kme
static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
{ return 0; }
-static inline void sysfs_slab_remove(struct kmem_cache *s) {}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kfree(s);
+}
#endif

/********************************************************************
@@ -2322,7 +2325,6 @@ void kmem_cache_destroy(struct kmem_cach
if (kmem_cache_close(s))
WARN_ON(1);
sysfs_slab_remove(s);
- kfree(s);
} else
up_write(&slub_lock);
}
@@ -3940,6 +3942,13 @@ static ssize_t slab_attr_store(struct ko
return err;
}

+static void slab_release(struct kobject *kobj)
+{
+ struct kmem_cache *s = to_slab(kobj);
+
+ kfree(s);
+}
+
static struct sysfs_ops slab_sysfs_ops = {
.show = slab_attr_show,
.store = slab_attr_store,
@@ -3947,6 +3956,7 @@ static struct sysfs_ops slab_sysfs_ops =

static struct kobj_type slab_ktype = {
.sysfs_ops = &slab_sysfs_ops,
+ .release = slab_release
};

static int uevent_filter(struct kset *kset, struct kobject *kobj)
@@ -4048,6 +4058,7 @@ static void sysfs_slab_remove(struct kme
{
kobject_uevent(&s->kobj, KOBJ_REMOVE);
kobject_del(&s->kobj);
+ kobject_put(&s->kobj);
}

/*

2007-12-27 23:53:56

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, 27 Dec 2007, Christoph Lameter wrote:

> > a) remove symlink when slab goes away
>
> Ok. Need to think about how to code that.

How do I iterate over all symlinks in /sys/kernel/slab/*?

I remember trying to do it before and not being able to find a sysfs
method for that.

2007-12-27 23:55:26

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, Dec 27, 2007 at 03:22:28PM -0800, Christoph Lameter wrote:
> > a) remove symlink when slab goes away
>
> Ok. Need to think about how to code that.

Huh? Just call it from kmem_cache_destroy(); what business does that symlink
have being around after that point?

> > b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
>
> Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a
> kobject_del().

kobject_del() undoes the effect of kobject_add(). And then you are left
with the refcount you had before kobject_add(), i.e. from kobject_init().

Think of it that way: kobject refcount controls the lifetime of structure
the sucker's embedded into. Depending on kobject methods, you might
be unable to do cleanup of non-kobject parts of that animal until after
kobject_del(). IOW, you _can't_ have kobject_del() dropping the (hopefully)
final reference to kobject - only the caller knows what might have to be
done in between.

2007-12-27 23:58:18

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote:
> On Thu, 27 Dec 2007, Christoph Lameter wrote:
>
> > > a) remove symlink when slab goes away
> >
> > Ok. Need to think about how to code that.
>
> How do I iterate over all symlinks in /sys/kernel/slab/*?
>
> I remember trying to do it before and not being able to find a sysfs
> method for that.

What the hell do you need that for? You have an obvious moment when
/sys/kernel/slab/cache_name -> <whatever> should be gone - it's when
you remove the slab called cache_name. And at that point you don't
need to iterate over anything - you have the exact name, for fsck sake...

Why would you want these symlinks to stick around for longer than that?

2007-12-28 00:03:11

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, 27 Dec 2007, Al Viro wrote:

> On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote:
> > On Thu, 27 Dec 2007, Christoph Lameter wrote:
> >
> > > > a) remove symlink when slab goes away
> > >
> > > Ok. Need to think about how to code that.
> >
> > How do I iterate over all symlinks in /sys/kernel/slab/*?
> >
> > I remember trying to do it before and not being able to find a sysfs
> > method for that.
>
> What the hell do you need that for? You have an obvious moment when
> /sys/kernel/slab/cache_name -> <whatever> should be gone - it's when
> you remove the slab called cache_name. And at that point you don't
> need to iterate over anything - you have the exact name, for fsck sake...
>
> Why would you want these symlinks to stick around for longer than that?

/sys/kernel/slab/cache_name is a real directory but then there are the
aliases in /sys/kernel/slab/alias_name pointing to that directory that
also have to be removed.

So I need to scan for symlinks in /sys/kernel/slab/* pointing to
/sys/kernel/slab/cache_name.

2007-12-28 00:45:26

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, Dec 27, 2007 at 04:02:56PM -0800, Christoph Lameter wrote:
> > Why would you want these symlinks to stick around for longer than that?
>
> /sys/kernel/slab/cache_name is a real directory but then there are the
> aliases in /sys/kernel/slab/alias_name pointing to that directory that
> also have to be removed.
>
> So I need to scan for symlinks in /sys/kernel/slab/* pointing to
> /sys/kernel/slab/cache_name.

Oh, lovely. So we can have module A do kmem_cache_create(), calling
cache "foo". Then module B does (without any knowlegde about A)
completely unrelated kmem_cache_create(), calling the sucker "bar".
mm/slub decides that they are mergable. Then we get rmmod A... and
no way to find out if that's foo or bar going away - kmem_cache_destroy()
doesn't have enough information to figure that out. So we have to keep
both slab/foo and slab/bar around until both are gone or until somebody
kind enough creates a cache called foo. Better yet, on some systems you
get things like slab/nfsd4_delegations sticking around long after nfsd is
gone just because slub.c decides that it's sharable, but in the next
kernel version the thing it's shared with gets different object size and
behaviour suddenly changes - now it's gone as soon as nfsd is gone.
Brilliant interface, that...

2007-12-28 02:19:58

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Fri, 28 Dec 2007, Al Viro wrote:

> Oh, lovely. So we can have module A do kmem_cache_create(), calling
> cache "foo". Then module B does (without any knowlegde about A)
> completely unrelated kmem_cache_create(), calling the sucker "bar".
> mm/slub decides that they are mergable. Then we get rmmod A... and
> no way to find out if that's foo or bar going away - kmem_cache_destroy()
> doesn't have enough information to figure that out. So we have to keep
> both slab/foo and slab/bar around until both are gone or until somebody
> kind enough creates a cache called foo. Better yet, on some systems you
> get things like slab/nfsd4_delegations sticking around long after nfsd is
> gone just because slub.c decides that it's sharable, but in the next
> kernel version the thing it's shared with gets different object size and
> behaviour suddenly changes - now it's gone as soon as nfsd is gone.
> Brilliant interface, that...

Just enable slab debugging and all of that goes away and you get the usual
error messages. Otherwise slub tries to compress the memory use as much as
possible. The advantage here is that you can add an abitrary amount of
kmem_cache_create() without increasing memory use significantly. Its cheap
to do kmem_cache creations. Thus you can create open arbitrary caches not
worrying about memory use and have meaningful variables on which you do
kmem_cache_alloc instead of kmalloc/kfree. kmalloc/kfree wastes space
since they go to powers of two. And kmalloc/kfree have to calculate the
cache at runtime whereas kmem_cache_alloc/free does the lookup once.

The amount of per cpu stuff we keep around that is idle is reduced by 50%.
Increasing cpu cache usage and reduces memory footprint. per cpu stuff
has to hold per cpu reserves and it is not good to waste too much memory
there.

nfsd4_delegations? What is this about?

How do I scan for the symlinks in sysfs?

2007-12-28 03:26:57

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote:
> nfsd4_delegations? What is this about?

The random lifetimes of user-visible files you create in sysfs.

> How do I scan for the symlinks in sysfs?

At which point are you going to do that? AFAICS, the fundamental problem
is that you
* have aliases indistinguishable, so kmem_cache_destroy() can't tell
which one is going away, no matter what
* have per-alias objects in sysfs
As the result, you have a user-visible mess in that directory in sysfs.
And I don't see how you would deal with that - on the "the contents of
directory changes in so-and-so way when such-and-such operation is
done", not the implementation details one.

BTW, I'm rather sceptical about free use of slabs; keep in mind that their
names have to be unique with your sysfs layout, so...

2007-12-28 09:08:33

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

* Al Viro <[email protected]> wrote:

> > > So two questions: why isn't -f the default? And is /sys/slab
> >
> > Because it gives misleading output. It displays the name of the
> > first of multiple slabs that share the same storage structures.
>
> Erm... Let me spell it out: current lifetime rules are completely
> broken. As it is, create/destroy/create cache sequence will do
> kobject_put() on kfree'd object. Even without people playing with
> holding sysfs files open or doing IO on those.
>
> a) you have kobject embedded into struct with the lifetime rules of
> its own. When its refcount hits zero you kfree() the sucker, even if
> you still have references to embedded kobject.
>
> b) your symlinks stick around. Even when cache is long gone you still
> have a sysfs symlink with its embedded kobject as a target. They are
> eventually removed when cache with the same name gets created. _Then_
> you get the target kobject dropped - when the memory it used to be in
> had been freed for hell knows how long and reused by something that
> would not appreciate slub.c code suddenly deciding to decrement some
> word in that memory.
>
> c) you leak references to these kobject; kobject_del() only removes it
> from the tree undoing the effect of kobject_add() and you still need
> kobject_put() to deal with the last reference.

as a sidenote: bugs like this seem to be reoccuring. People implement
sysfs bindings (without being sysfs internals experts - and why should
they be) - and create hard to debug problems. We've seen that with the
scheduler's recent sysfs changes too.

shouldnt the sysfs code be designed in a way to not allow such bugs? The
primary usecase of sysfs is by people who do _not_ deal with it on a
daily basis. So if they pick APIs that look obvious and create hard to
debug problems (and userspace incompatibilities) that's a primary
failure of sysfs, not a failure of those who utilize it.

At a minimum there should be some _strong_ debugging facility that
transparently detects and reports such bugs as they occur.
CONFIG_DEBUG_KOBJECT is totally unusable right now, it spams the syslog
(so no distro ever enables it - i disable it in random bootups as well
because it takes _ages_ to even get to a boot prompt) and never finds
any of these hard-to-find-but-easy-to-explain bugs.

or if sysfs/kobjects should be scrapped and rewritten, do you have any
insight into what kind of abstraction could/should replace it? Should we
go back to procfs and get rid of kobjects altogether? (as it's slowly
turning into a /proc problem of itself, with worse compatibility and
sneakier bugs.)

Ingo

2007-12-28 14:55:23

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Fri, 28 Dec 2007, Ingo Molnar wrote:
>
> or if sysfs/kobjects should be scrapped and rewritten, do you have any
> insight into what kind of abstraction could/should replace it? Should we
> go back to procfs and get rid of kobjects altogether? (as it's slowly
> turning into a /proc problem of itself, with worse compatibility and
> sneakier bugs.)
>

The few times I've used kobjects/sysfs, I would always need to "relearn"
how to use it. It's not trivial at all, and even when I finally get it
working, I'm not sure it's working correctly.

Although something like debugfs took me a few minutes to use. Now when
ever I need to do something that userspace uses, I simply choose debugfs.

I thought the switch to sysfs was to keep procfs cleaner and only used for
processes. Why did it need something so complex as the kobject structure?
Was it so that udev could integrate with it? If sysfs was as simple as
debugfs, I think a lot of these bugs would not have occurred.

-- Steve

2007-12-29 18:08:33

by Karol Swietlicki

[permalink] [raw]

Subject: Re: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers)

On 23/12/2007, Karol Swietlicki <[email protected]> wrote:
> On 22/12/2007, Andi Kleen <[email protected]> wrote:
> > A manpage for slabinfo would be useful though. Anybody
> > volunteering to write one?
> >
> > -Andi
>
> That would be me.
> I'm a newbie and never wrote a man page before, so it will take a few
> days, but I'm bored and out of ideas for any new code for the moment
> being. Should be fun.
>
> Karol Swietlicki

I'm dropping out from this one.
I worked for a good while, and I really see no way of improving what
the usage information (slabinfo -h) already provides. It's a good
description, and I'm starting to think that adding a man page is
overkill. If I am misunderstanding something and what is desired is in
fact usage information in the man page format, I can provide such a
page, no problem.

At least now I know how to write man pages.

Karol Swietlicki

2008-01-01 12:53:44

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

On Tue, Dec 25, 2007 at 04:34:18AM +0100, Andi Kleen wrote:
> > And is /sys/slab
> > guaranteed to be a stable and permanent interface if the SLAB
>
> Debugging feature and "stable and permanent" just don't
> fit together. It's like requesting stable and permanent
> sysrq output.

Yeah, unfortunately, querying to find out how much memory is being
used by which parts of the system is something which I think needs to
be more than just a debugging feature. One could argue that "vmstat",
"iostat", and "sar" are debugging features as well, but other people
would consider them "critical programs to get information necessary to
monitor the health of their system".

Perhaps /proc/slabinfo should be that relatively stable interface, if
/sys/slab can't be guaranteed to be stable. But there *are* people
who will want to monitor information like this on an ongoing fashion
on production systems. One of the major downsides of /sys seems to be
that it's very hard to keep it stable, so maybe it's not the right
tool for this. /proc/slabinfo has the advantage that it's a simple
text file, and its format is relatively well-understood, and doesn't a
patch to provide /proc/slabinfo for SLUB already exist?

- Ted

2008-01-01 15:18:07

[permalink] [raw]

Subject: Re: Major regression on hackbench with SLUB (more numbers)

> Yeah, unfortunately, querying to find out how much memory is being
> used by which parts of the system is something which I think needs to
> be more than just a debugging feature. One could argue that "vmstat",

slabinfo is about querying kernel internals and requiring stable
and permanent interfaces to that would imply freezing the kernel
internals.

People who want to have information about slabs and other similar
kernel internal allocations will always have to deal with version
dependencies.

No way around that.

> Perhaps /proc/slabinfo should be that relatively stable interface, if

That would require at least never changing any slab caches in any
subsystem. Obviously not an option.

-Andi

2008-01-01 16:24:17

[permalink] [raw]

Subject: [patch] slub: provide /proc/slabinfo

* Theodore Tso <[email protected]> wrote:

> [...] doesn't a patch to provide /proc/slabinfo for SLUB already
> exist?

yes, the complete (and tested) patch is below. It has been through a few
thousand random-bootup tests on x86 32-bit and 64-bit so it's v2.6.24
material i think.

Ingo

----------------->
Subject: slub: provide /proc/slabinfo
From: Pekka J Enberg <[email protected]>

This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

[ [email protected]: build fix. ]

Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/slub_def.h | 2
mm/slub.c | 105 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 94 insertions(+), 13 deletions(-)

Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -200,4 +200,6 @@ static __always_inline void *kmalloc_nod
}
#endif

+extern const struct seq_operations slabinfo_op;
+
#endif /* _LINUX_SLUB_DEF_H */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3076,6 +3076,19 @@ void *__kmalloc_node_track_caller(size_t
return slab_alloc(s, gfpflags, node, caller);
}

+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+ unsigned long flags;
+ unsigned long x = 0;
+ struct page *page;
+
+ spin_lock_irqsave(&n->list_lock, flags);
+ list_for_each_entry(page, &n->partial, lru)
+ x += page->inuse;
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ return x;
+}
+
#if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG)
static int validate_slab(struct kmem_cache *s, struct page *page,
unsigned long *map)
@@ -3458,19 +3471,6 @@ static int list_locations(struct kmem_ca
return n;
}

-static unsigned long count_partial(struct kmem_cache_node *n)
-{
- unsigned long flags;
- unsigned long x = 0;
- struct page *page;
-
- spin_lock_irqsave(&n->list_lock, flags);
- list_for_each_entry(page, &n->partial, lru)
- x += page->inuse;
- spin_unlock_irqrestore(&n->list_lock, flags);
- return x;
-}
-
enum slab_stat_type {
SL_FULL,
SL_PARTIAL,
@@ -4123,3 +4123,82 @@ static int __init slab_sysfs_init(void)

__initcall(slab_sysfs_init);
#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_PROC_FS
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+ seq_puts(m, "slabinfo - version: 2.1\n");
+ seq_puts(m, "# name <active_objs> <num_objs> <objsize> "
+ "<objperslab> <pagesperslab>");
+ seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+ seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+ seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+ loff_t n = *pos;
+
+ down_read(&slub_lock);
+ if (!n)
+ print_slabinfo_header(m);
+
+ return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+ return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+ up_read(&slub_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+ unsigned long nr_partials = 0;
+ unsigned long nr_slabs = 0;
+ unsigned long nr_inuse = 0;
+ unsigned long nr_objs;
+ struct kmem_cache *s;
+ int node;
+
+ s = list_entry(p, struct kmem_cache, list);
+
+ for_each_online_node(node) {
+ struct kmem_cache_node *n = get_node(s, node);
+
+ if (!n)
+ continue;
+
+ nr_partials += n->nr_partial;
+ nr_slabs += atomic_long_read(&n->nr_slabs);
+ nr_inuse += count_partial(n);
+ }
+
+ nr_objs = nr_slabs * s->objects;
+ nr_inuse += (nr_slabs - nr_partials) * s->objects;
+
+ seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
+ nr_objs, s->size, s->objects, (1 << s->order));
+ seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0);
+ seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
+ 0UL);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+const struct seq_operations slabinfo_op = {
+ .start = s_start,
+ .next = s_next,
+ .stop = s_stop,
+ .show = s_show,
+};
+
+#endif /* CONFIG_PROC_FS */

2008-01-02 20:25:20

[permalink] [raw]

Subject: Re: SLUB sysfs support

On Fri, 28 Dec 2007, Al Viro wrote:

> On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote:
> > nfsd4_delegations? What is this about?
>
> The random lifetimes of user-visible files you create in sysfs.

Well these are symlinks.

> > How do I scan for the symlinks in sysfs?
>
> At which point are you going to do that? AFAICS, the fundamental problem
> is that you
> * have aliases indistinguishable, so kmem_cache_destroy() can't tell
> which one is going away, no matter what
> * have per-alias objects in sysfs
> As the result, you have a user-visible mess in that directory in sysfs.
> And I don't see how you would deal with that - on the "the contents of
> directory changes in so-and-so way when such-and-such operation is
> done", not the implementation details one.

If the alias count of a kmem_cache structure reaches zero then all aliases
can be taken down. At that point we know that no aliases exist anymore.

> BTW, I'm rather sceptical about free use of slabs; keep in mind that their
> names have to be unique with your sysfs layout, so...

What do you mean by free use of slabs? Creation of new slab caches to
avoid the rounding up to order 2 size?

2008-01-05 00:32:19

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [patch] slub: provide /proc/slabinfo

On Tue, 1 Jan 2008 17:23:28 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Theodore Tso <[email protected]> wrote:
>
> > [...] doesn't a patch to provide /proc/slabinfo for SLUB already
> > exist?
>
> yes, the complete (and tested) patch is below. It has been through a
> few thousand random-bootup tests on x86 32-bit and 64-bit so it's
> v2.6.24 material i think.

> +static void *s_start(struct seq_file *m, loff_t *pos)
> +{
> + loff_t n = *pos;
> +
> + down_read(&slub_lock);
> + if (!n)
> + print_slabinfo_header(m);
> +
> + return seq_list_start(&slab_caches, *pos);
> +}
> +

this is the part I'm not very thrilled about... at least on first sight
it looks like a user can now hold the read sem over system calls, and for as long as it wants.
Either that's no problem (but then we wouldn't need the lock in the first place) or it
is a problem that then wants to be fixed

2008-01-05 00:50:42