Updates for slub are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus
Christoph Lameter (5):
SLUB: Fix sysfs refcounting
Move count_partial before kmem_cache_shrink
SLUB: rename defrag to remote_node_defrag_ratio
Add parameter to add_partial to avoid having two functions
Explain kmem_cache_cpu fields
Harvey Harrison (1):
slub: fix shadowed variable sparse warnings
Pekka Enberg (1):
SLUB: Fix coding style violations
root (1):
SLUB: Do not upset lockdep
include/linux/slub_def.h | 15 ++--
mm/slub.c | 182 +++++++++++++++++++++++++---------------------
2 files changed, 108 insertions(+), 89 deletions(-)
On Mon, 4 Feb 2008 12:08:34 -0800 (PST)
Christoph Lameter <[email protected]> wrote:
> Updates for slub are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus
>
> Christoph Lameter (5):
> SLUB: Fix sysfs refcounting
> Move count_partial before kmem_cache_shrink
> SLUB: rename defrag to remote_node_defrag_ratio
> Add parameter to add_partial to avoid having two functions
> Explain kmem_cache_cpu fields
>
> Harvey Harrison (1):
> slub: fix shadowed variable sparse warnings
>
> Pekka Enberg (1):
> SLUB: Fix coding style violations
>
> root (1):
> SLUB: Do not upset lockdep
>
err, what? I though I was going to merge these:
slub-move-count_partial.patch
slub-rename-numa-defrag_ratio-to-remote_node_defrag_ratio.patch
slub-consolidate-add_partial-and-add_partial_tail-to-one-function.patch
slub-use-non-atomic-bit-unlock.patch
slub-fix-coding-style-violations.patch
slub-noinline-some-functions-to-avoid-them-being-folded-into-alloc-free.patch
slub-move-kmem_cache_node-determination-into-add_full-and-add_partial.patch
slub-avoid-checking-for-a-valid-object-before-zeroing-on-the-fast-path.patch
slub-__slab_alloc-exit-path-consolidation.patch
slub-provide-unique-end-marker-for-each-slab.patch
slub-avoid-referencing-kmem_cache-structure-in-__slab_alloc.patch
slub-optional-fast-path-using-cmpxchg_local.patch
slub-do-our-own-locking-via-slab_lock-and-slab_unlock.patch
slub-restructure-slab-alloc.patch
slub-comment-kmem_cache_cpu-structure.patch
slub-fix-sysfs-refcounting.patch
before you went and changed things under my feet.
Please clarify.
On Mon, 4 Feb 2008 14:28:45 -0800
Andrew Morton <[email protected]> wrote:
> > root (1):
> > SLUB: Do not upset lockdep
> >
>
> err, what? I though I was going to merge these:
>
> slub-move-count_partial.patch
> slub-rename-numa-defrag_ratio-to-remote_node_defrag_ratio.patch
> slub-consolidate-add_partial-and-add_partial_tail-to-one-function.patch
> slub-use-non-atomic-bit-unlock.patch
> slub-fix-coding-style-violations.patch
> slub-noinline-some-functions-to-avoid-them-being-folded-into-alloc-free.patch
> slub-move-kmem_cache_node-determination-into-add_full-and-add_partial.patch
> slub-avoid-checking-for-a-valid-object-before-zeroing-on-the-fast-path.patch
> slub-__slab_alloc-exit-path-consolidation.patch
> slub-provide-unique-end-marker-for-each-slab.patch
> slub-avoid-referencing-kmem_cache-structure-in-__slab_alloc.patch
> slub-optional-fast-path-using-cmpxchg_local.patch
> slub-do-our-own-locking-via-slab_lock-and-slab_unlock.patch
> slub-restructure-slab-alloc.patch
> slub-comment-kmem_cache_cpu-structure.patch
> slub-fix-sysfs-refcounting.patch
>
> before you went and changed things under my feet.
erk, sorry, I misremembered. I was about to merge all the patches we
weren't going to merge. oops.
Linus, please proceed with this merge.
Hope to have the slub-mm repository setup tonight which will simplify
things for the future. Hope you still remember ....
On Tuesday 05 February 2008 09:30, Andrew Morton wrote:
> On Mon, 4 Feb 2008 14:28:45 -0800
>
> Andrew Morton <[email protected]> wrote:
> > > root (1):
> > > SLUB: Do not upset lockdep
> >
> > err, what? I though I was going to merge these:
> >
> > slub-move-count_partial.patch
> > slub-rename-numa-defrag_ratio-to-remote_node_defrag_ratio.patch
> > slub-consolidate-add_partial-and-add_partial_tail-to-one-function.patch
> > slub-use-non-atomic-bit-unlock.patch
> > slub-fix-coding-style-violations.patch
> > slub-noinline-some-functions-to-avoid-them-being-folded-into-alloc-free.p
> >atch
> > slub-move-kmem_cache_node-determination-into-add_full-and-add_partial.pat
> >ch
> > slub-avoid-checking-for-a-valid-object-before-zeroing-on-the-fast-path.pa
> >tch slub-__slab_alloc-exit-path-consolidation.patch
> > slub-provide-unique-end-marker-for-each-slab.patch
> > slub-avoid-referencing-kmem_cache-structure-in-__slab_alloc.patch
> > slub-optional-fast-path-using-cmpxchg_local.patch
> > slub-do-our-own-locking-via-slab_lock-and-slab_unlock.patch
> > slub-restructure-slab-alloc.patch
> > slub-comment-kmem_cache_cpu-structure.patch
> > slub-fix-sysfs-refcounting.patch
> >
> > before you went and changed things under my feet.
>
> erk, sorry, I misremembered. I was about to merge all the patches we
> weren't going to merge. oops.
While you're there, can you drop the patch(es?) I commented on
and didn't get an answer to. Like the ones that open code their
own locking primitives and do risky looking things with barriers
to boot...
Also, WRT this one:
slub-use-non-atomic-bit-unlock.patch
This is strange that it is unwanted. Avoiding atomic operations
is a pretty good idea. The fact that it appears to be slower on
some microbenchmark on some architecture IMO either means that
their __clear_bit_unlock or the CPU isn't implemented so well...
On Tue, 5 Feb 2008, Nick Piggin wrote:
> > erk, sorry, I misremembered. I was about to merge all the patches we
> > weren't going to merge. oops.
>
> While you're there, can you drop the patch(es?) I commented on
> and didn't get an answer to. Like the ones that open code their
> own locking primitives and do risky looking things with barriers
> to boot...
That patch will be moved to a special archive for
microbenchmarks. It shows the same issues like the __unlock patch.
> Also, WRT this one:
> slub-use-non-atomic-bit-unlock.patch
>
> This is strange that it is unwanted. Avoiding atomic operations
> is a pretty good idea. The fact that it appears to be slower on
> some microbenchmark on some architecture IMO either means that
> their __clear_bit_unlock or the CPU isn't implemented so well...
Its slower on x86_64 and that is a pretty important arch. So
I am to defer this until we have analyzed the situation some more. Could
there be some effect of atomic ops on the speed with which a cacheline is
released?
On Tuesday 05 February 2008 10:47, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Nick Piggin wrote:
> > > erk, sorry, I misremembered. I was about to merge all the patches we
> > > weren't going to merge. oops.
> >
> > While you're there, can you drop the patch(es?) I commented on
> > and didn't get an answer to. Like the ones that open code their
> > own locking primitives and do risky looking things with barriers
> > to boot...
>
> That patch will be moved to a special archive for
> microbenchmarks. It shows the same issues like the __unlock patch.
Ok. But the approach is just not so good. If you _really_ need something
like that and it is a win over the regular non-atomic unlock, then you
just have to implement it as a generic locking / atomic operation and
allow all architectures to implement the optimal (and correct) memory
barriers.
Anyway....
> > Also, WRT this one:
> > slub-use-non-atomic-bit-unlock.patch
> >
> > This is strange that it is unwanted. Avoiding atomic operations
> > is a pretty good idea. The fact that it appears to be slower on
> > some microbenchmark on some architecture IMO either means that
> > their __clear_bit_unlock or the CPU isn't implemented so well...
>
> Its slower on x86_64 and that is a pretty important arch. So
> I am to defer this until we have analyzed the situation some more. Could
> there be some effect of atomic ops on the speed with which a cacheline is
> released?
I'm sure it could have an effect. But why is the common case in SLUB
for the cacheline to be bouncing? What's the benchmark? What does SLAB
do in that benchmark, is it faster than SLUB there? What does the
non-atomic bit unlock do to Willy's database workload?
On Tue, 5 Feb 2008, Nick Piggin wrote:
> I'm sure it could have an effect. But why is the common case in SLUB
> for the cacheline to be bouncing? What's the benchmark? What does SLAB
> do in that benchmark, is it faster than SLUB there? What does the
> non-atomic bit unlock do to Willy's database workload?
I saw this in tbench and the test was done on a recent quad core Intel
cpu. SLAB is 1 - 4% faster than SLUB on the 2 x Quad setup that I am using
here to test. Not sure if what I think it is is really the issue. I added
some statistics to SLUB to figure out what exactly is going on and it
seems that the remote handoff may not be the issue:
Name Objects Alloc Free %Fast
skbuff_fclone_cache 33 111953835 111953835 99 99
:0000192 2666 5283688 5281047 99 99
:0001024 849 5247230 5246389 83 83
vm_area_struct 1349 119642 118355 91 22
:0004096 15 66753 66751 98 98
:0000064 2067 25297 23383 98 78
dentry 10259 28635 18464 91 45
:0000080 11004 18950 8089 98 98
:0000096 1703 12358 10784 99 98
:0000128 762 10582 9875 94 18
:0000512 184 9807 9647 95 81
:0002048 479 9669 9195 83 65
anon_vma 777 9461 9002 99 71
kmalloc-8 6492 9981 5624 99 97
:0000768 258 7174 6931 58 15
slabinfo -a | grep 000192
:0000192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP
request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili
Likely skbuff_head_cache.
slabinfo skbuff_fclone_cache
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 111953360 111946981 99 99
Slowpath 1044 7423 0 0
Page Alloc 272 264 0 0
Add partial 25 325 0 0
Remove partial 86 264 0 0
RemoteObj/SlabFrozen 350 4832 0 0
Total 111954404 111954404
Flushes 49 Refill 0
Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)
There is only minimal handoff here.
skbuff_head_cache:
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 5297262 5259882 99 99
Slowpath 4477 39586 0 0
Page Alloc 937 824 0 0
Add partial 0 2515 0 0
Remove partial 1691 824 0 0
RemoteObj/SlabFrozen 2621 9684 0 0
Total 5301739 5299468
Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)
Some more handoff but still basically the same.
Need to dig into this some more.
On Tue, 5 Feb 2008, Nick Piggin wrote:
> Ok. But the approach is just not so good. If you _really_ need something
> like that and it is a win over the regular non-atomic unlock, then you
> just have to implement it as a generic locking / atomic operation and
> allow all architectures to implement the optimal (and correct) memory
> barriers.
Assuming this really gives a benefit on several benchmarks then we need
to think about how to do this some more. Its a rather strange form of
locking.
Basically you lock the page with a single atomic operation that sets
PageLocked and retrieves the page flags. Then we shovel the page state
around a couple of functions in a register and finally store the page
state back which at the same time unlocks the page. So two memory
references with one of them being atomic with none in between. We have
nothing that can do something like that right now.
On Tuesday 05 February 2008 11:32, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Nick Piggin wrote:
> > Ok. But the approach is just not so good. If you _really_ need something
> > like that and it is a win over the regular non-atomic unlock, then you
> > just have to implement it as a generic locking / atomic operation and
> > allow all architectures to implement the optimal (and correct) memory
> > barriers.
>
> Assuming this really gives a benefit on several benchmarks then we need
> to think about how to do this some more. Its a rather strange form of
> locking.
>
> Basically you lock the page with a single atomic operation that sets
> PageLocked and retrieves the page flags.
This operation is not totally unusual. I could use it for my optimised
page lock patches for example (although I need an operation that clears
a flag and has release semantics, but similar class of "thing").
> Then we shovel the page state
> around a couple of functions in a register and finally store the page
> state back which at the same time unlocks the page.
And this is a store-for-unlock (eg. with release semantics).
Nothing too special about that either I guess. (it is almost the word
equivalent of clear_bit_unlock).
> So two memory
> references with one of them being atomic with none in between. We have
> nothing that can do something like that right now.
The load you are trying to avoid in the lock really isn't that
expensive. The cacheline is in L1. Even after a store, many CPUs
have store forwarding so it is probably not going to matter at all
on those.
Anyway, not saying the operations are useless, but they should be
made available to core kernel and implemented per-arch. (if they are
found to be useful)
On Tue, 5 Feb 2008, Nick Piggin wrote:
> Anyway, not saying the operations are useless, but they should be
> made available to core kernel and implemented per-arch. (if they are
> found to be useful)
The problem is to establish the usefulness. These measures may bring 1-2%
in a pretty unstable operation mode assuming that the system is doing
repetitive work. The micro optimizations seem to be often drowned out
by small other changes to the system.
There is the danger that a gain is seen that is not due to the patch but
due to other changes coming about because code is moved since patches
change execution paths.
Plus they may be only possible on a specific architecture. I know that our
IA64 hardware has special measures ensuring certain behavior of atomic ops
etc, I guess Intel has similar tricks up their sleeve. At 8p there are
likely increasing problems with lock starvation where your ticketlock
helps. That is why I thought we better defer the stuff until there is some
more evidence that these are useful.
I got particularly nervous about these changes after I saw small
performance drops due to the __unlock patch on the dual quad. That should
have been a consistent gain.