Optimize the cpuset impact on page allocation, the most
performance critical cpuset hook in the kernel.
On each page allocation, the cpuset hook needs to check for a
possible change in the current tasks cpuset. It can now handle
the common case, of no change, without taking any spinlock or
semaphore, thanks to RCU.
Convert a spinlock on the current task to an rcu_read_lock(),
saving approximately a memory barrier and an atomic op, depending
on architecture.
This is done by putting struct cpusets in a slab cache marked
SLAB_DESTROY_BY_RCU, so that an rcu_read_lock can ensure that
the slab won't free the memory behind a cpuset struct.
Thanks to Andi Kleen and Nick Piggin for the suggestion.
Signed-off-by: Paul Jackson <[email protected]>
---
kernel/cpuset.c | 42 +++++++++++++++++++++++++++++++-----------
1 files changed, 31 insertions(+), 11 deletions(-)
--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-11 13:31:16.925866678 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-11 14:26:57.184694603 -0800
@@ -63,6 +63,18 @@
*/
int number_of_cpusets;
+/*
+ * Take cpusets from a slab cache marked SLAB_DESTROY_BY_RCU, so
+ * that on the critical code path cpuset_update_task_memory_state(),
+ * we can safely read task->cpuset->mems_generation with just
+ * an rcu lock, w/o risk of a slab cache free invalidating that
+ * memory location. It might not be our tasks cpuset anymore,
+ * but at least it will be safe to load from it, which is good
+ * enough when checking for a changed mems_generation.
+ */
+
+static kmem_cache_t *cpuset_cache;
+
/* See "Frequency meter" comments, below. */
struct fmeter {
@@ -248,6 +260,10 @@ static struct super_block *cpuset_sb;
* a tasks cpuset pointer we use task_lock(), which acts on a spinlock
* (task->alloc_lock) already in the task_struct routinely used for
* such matters.
+ *
+ * P.S. One more locking exception. See the use of rcu_read_lock
+ * when checking a tasks cpuset->mems_generation in the routine
+ * cpuset_update_task_memory_state below.
*/
static DECLARE_MUTEX(manage_sem);
@@ -289,7 +305,7 @@ static void cpuset_diput(struct dentry *
if (S_ISDIR(inode->i_mode)) {
struct cpuset *cs = dentry->d_fsdata;
BUG_ON(!(is_removed(cs)));
- kfree(cs);
+ kmem_cache_free(cpuset_cache, cs);
}
iput(inode);
}
@@ -612,12 +628,11 @@ static void guarantee_online_mems(const
* cpuset pointer. This routine also might acquire callback_sem and
* current->mm->mmap_sem during call.
*
- * The task_lock() is required to dereference current->cpuset safely.
- * Without it, we could pick up the pointer value of current->cpuset
- * in one instruction, and then attach_task could give us a different
- * cpuset, and then the cpuset we had could be removed and freed,
- * and then on our next instruction, we could dereference a no longer
- * valid cpuset pointer to get its mems_generation field.
+ * Reading current->cpuset->mems_generation doesn't need task_lock,
+ * because cpusets are on a slab cache marked SLAB_DESTROY_BY_RCU,
+ * so the rcu_read_lock() insures the memory read will not yet be
+ * unmapped. It's ok if attach_task() replaces our cpuset with
+ * another while we are reading mems_generation, and even frees it.
*
* This routine is needed to update the per-task mems_allowed data,
* within the tasks context, when it is trying to allocate memory
@@ -634,9 +649,9 @@ void cpuset_update_task_memory_state()
if (unlikely(!cs))
return;
- task_lock(tsk);
+ rcu_read_lock();
my_cpusets_mem_gen = cs->mems_generation;
- task_unlock(tsk);
+ rcu_read_unlock();
if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
down(&callback_sem);
@@ -1742,7 +1757,7 @@ static long cpuset_create(struct cpuset
struct cpuset *cs;
int err;
- cs = kmalloc(sizeof(*cs), GFP_KERNEL);
+ cs = kmem_cache_alloc(cpuset_cache, GFP_KERNEL);
if (!cs)
return -ENOMEM;
@@ -1784,7 +1799,7 @@ static long cpuset_create(struct cpuset
err:
list_del(&cs->sibling);
up(&manage_sem);
- kfree(cs);
+ kmem_cache_free(cpuset_cache, cs);
return err;
}
@@ -1847,6 +1862,11 @@ int __init cpuset_init(void)
struct dentry *root;
int err;
+ cpuset_cache = kmem_cache_create("cpuset_cache",
+ sizeof(struct cpuset), 0,
+ SLAB_DESTROY_BY_RCU | SLAB_PANIC,
+ NULL, NULL);
+
top_cpuset.cpus_allowed = CPU_MASK_ALL;
top_cpuset.mems_allowed = NODE_MASK_ALL;
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
> Thanks to Andi Kleen and Nick Piggin for the suggestion.
Thanks. But i guess it would be still a good idea to turn
ia "check that there is no cpuset" test into an inline
so that it can be done without a function call. Only when
it fails call the out of line cpuset full checking function.
This would make the common case of a kernel with cpuset
compiled in but nobody using it faster.
This could be done even without any memory barriers of
any kind.
-Andi
On Sun, Dec 11, 2005 at 10:11:10PM -0800, Paul Jackson wrote:
> > Thanks. But i guess it would be still a good idea to turn
> > ia "check that there is no cpuset" test into an inline
> > so that it can be done without a function call.
>
> Do you mean this check, in cpuset_update_task_memory_state()
>
> if (unlikely(!cs))
> return;
No - i mean a quick check if the cpuset allows all nodes
for memory allocation.
-Andi
> Thanks. But i guess it would be still a good idea to turn
> ia "check that there is no cpuset" test into an inline
> so that it can be done without a function call.
Do you mean this check, in cpuset_update_task_memory_state()
if (unlikely(!cs))
return;
Inlining this check will -not- help systems not using cpusets.
Unlike mempolicies, which use NULL for the default case, tasks
almost always have non-NULL cpuset pointers (on any system with
CONFIG_CPUSET).
This unlikely check was here because during the early boot process,
there was a window during which the memory subsystem was healthy
enough to take calls, but the cpuset system was not initialized yet.
I can remove the above check entirely, by adding a "cpuset_init_early",
that fabricates enough of a valid cpuset for that init task to get
through this code without stumbling.
Yes - that works - patch forthcoming soon to remove the above check,
using a cpuset_init_early() call instead.
===
I have a different scheme in place to minimize the load on systems
with CONFIG_CPUSET enabled. See the number_of_cpusets global kernel
counter, in the *-mm patch:
cpuset_number_of_cpusets_optimization
It enables short circuiting with inline code the other key routine
on the memory allocation path: cpuset_zone_allowed().
However, looking at it just now, I realize that this number_of_cpusets
hack doesn't help with cpuset_update_task_memory_state(). Even though
the currently running system has only one cpuset, it might have had
more in the past, and some task that hasn't had to allocate memory
since the system went back down to one cpuset might still be pointing
at the old cpuset it had.
I'm still looking for a clean way to short circuit this call to
cpuset_update_task_memory_state() in the case that a system is compiled
with CONFIG_CPUSET, but no one has used cpusets.
Any other ideas?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Andi wrote:
> No - i mean a quick check if the cpuset allows all nodes
> for memory allocation.
Good - I think I've already done what you're suggesting, then.
You comment would pertain to the cpuset_zone_allowed() call in
mm/page_alloc.c __alloc_pages(), not to the routine we are
discussing here - cpuset_update_task_memory_state().
There are two key cpuset hooks in the page allocation code path:
1) cpuset_update_task_memory_state() - which transfers
changes in the tasks cpuset (where we need locking
to read) into the task->mems_allowed nodemask (where
the task can look w/o locks).
2) cpuset_zone_allowed() - which checks each zone examined
in the page_alloc() loops to see if it is on an allowed
node.
The first of these was the primary victim of my patches today,
reducing the cost using RCU, and just now removing the NULL
cpuset check that was only needed during init.
The first of these is called once each page allocation, from
a couple of spots in mm/mempolicy.c. The second of these is
called from within (now beneath) __alloc_pages(), once each
zone that is considered for providing the requested pages.
==
The second of these was already short circuited, using inline
code as you recommend, in a patch a couple of days ago (I forgot
to cc you directly on the patch, so sent you a separate message
pointing to those patches).
For that one, see the number_of_cpusets global kernel counter, in the
*-mm patch:
cpuset_number_of_cpusets_optimization
It enables short circuiting with inline code this other key routine
on the memory allocation path: cpuset_zone_allowed().
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson a ?crit :
> +
> +static kmem_cache_t *cpuset_cache;
> +
Hi Paul
Please do use __read_mostly for new kmem_cache :
static kmem_cache_t *cpuset_cache __read_mostly;
If not, the pointer can sit in the midle of a highly modified cache line, and
multiple CPUS will have memory cache misses to access the cpuset_cache, while
slab code/data layout itself is very NUMA/SMP friendly.
Eric
Eric Dumazet wrote:
> Paul Jackson a ?crit :
>
>> +
>> +static kmem_cache_t *cpuset_cache;
>> +
>
>
> Hi Paul
>
> Please do use __read_mostly for new kmem_cache :
>
> static kmem_cache_t *cpuset_cache __read_mostly;
>
> If not, the pointer can sit in the midle of a highly modified cache
> line, and multiple CPUS will have memory cache misses to access the
> cpuset_cache, while slab code/data layout itself is very NUMA/SMP friendly.
>
Is it a good idea for all kmem_cache_t? If so, can we move
__read_mostly to the type definition?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin a ?crit :
> Eric Dumazet wrote:
>
>> Paul Jackson a ?crit :
>>
>>> +
>>> +static kmem_cache_t *cpuset_cache;
>>> +
>>
>>
>>
>> Hi Paul
>>
>> Please do use __read_mostly for new kmem_cache :
>>
>> static kmem_cache_t *cpuset_cache __read_mostly;
>>
>> If not, the pointer can sit in the midle of a highly modified cache
>> line, and multiple CPUS will have memory cache misses to access the
>> cpuset_cache, while slab code/data layout itself is very NUMA/SMP
>> friendly.
>>
>
> Is it a good idea for all kmem_cache_t? If so, can we move
> __read_mostly to the type definition?
>
>
Well this question was already asked.
You cannot move __read_mostly to the type itself as some kmem_cache_t are
included in some data structures.
struct {
...
kmem_cache_t *slab;
...
};
And in this case you cannot automatically add __read_mostly
Eric
Nick Piggin <[email protected]> wrote:
>
> Eric Dumazet wrote:
> > Paul Jackson a ?crit :
> >
> >> +
> >> +static kmem_cache_t *cpuset_cache;
> >> +
> >
> >
> > Hi Paul
> >
> > Please do use __read_mostly for new kmem_cache :
> >
> > static kmem_cache_t *cpuset_cache __read_mostly;
> >
> > If not, the pointer can sit in the midle of a highly modified cache
> > line, and multiple CPUS will have memory cache misses to access the
> > cpuset_cache, while slab code/data layout itself is very NUMA/SMP friendly.
> >
>
> Is it a good idea for all kmem_cache_t? If so, can we move
> __read_mostly to the type definition?
>
Yes, I suppose that's worthwhile.
We've been shuffling towards removing kmem_cache_t in favour of `struct
kmem_cache', but this is an argument against doing that.
If we can work out how:
void foo()
{
kmem_cache_t *p;
}
That'll barf.
Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>
>>Is it a good idea for all kmem_cache_t? If so, can we move
>>__read_mostly to the type definition?
>>
>
>
> Yes, I suppose that's worthwhile.
>
> We've been shuffling towards removing kmem_cache_t in favour of `struct
> kmem_cache', but this is an argument against doing that.
>
> If we can work out how:
>
> void foo()
> {
> kmem_cache_t *p;
> }
>
> That'll barf.
>
Mmm. And the structure within structure, which Eric points out. I assumed
without grepping that those were mostly confined to slab itself and would
be easy to special case, but it turns out networking makes some use of
them too.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
> Please do use __read_mostly for new kmem_cache :
>
> static kmem_cache_t *cpuset_cache __read_mostly;
Is there any downside to this? I ask because accesses through
this 'cpuset_cache' pointer are rather infrequent - only when
the sysadmin or the batch scheduler is creating or removing
cpusets, which for the purposes of 'back of the envelope'
estimates, might be once a minute or less. Further, it is
not at all a performance critical path.
So I really don't give a dang if it takes a few milliseconds
to pick up this pointer, at least so far as cpusets matters.
That said, would you still advise marking this __read_mostly?
Andrew wrote:
> We've been shuffling towards removing kmem_cache_t in favour of `struct
> kmem_cache',
So I suppose what I should really have is:
static struct kmem_cache *cpuset_cache __read_mostly;
??
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson <[email protected]> wrote:
>
> > Please do use __read_mostly for new kmem_cache :
> >
> > static kmem_cache_t *cpuset_cache __read_mostly;
>
> Is there any downside to this? I ask because accesses through
> this 'cpuset_cache' pointer are rather infrequent - only when
> the sysadmin or the batch scheduler is creating or removing
> cpusets, which for the purposes of 'back of the envelope'
> estimates, might be once a minute or less. Further, it is
> not at all a performance critical path.
>
> So I really don't give a dang if it takes a few milliseconds
> to pick up this pointer, at least so far as cpusets matters.
There's no downside, really. It just places the storage into a different
section. There's a small downside to having __read_mostly at all: up to a
page more memory used. But once it's there, adding to it is just moving
things around in memory.
__read_mostly is simply a new (page-aligned) section into we put things
which are considered to not be written to very often.
> That said, would you still advise marking this __read_mostly?
Not at this stage - it'd be better if someone did a big sweep and changed
all kmem_cache_t's in one hit.
Eric Dumazet wrote:
> Please do use __read_mostly for new kmem_cache :
Paul Jackson wrote:
> Is there any downside to this?
Andrew Morton wrote:
> There's no downside, really.
Hmmm ... I suspect one possible downside.
I would think we would want to spread the hot spots out, to reduce the
chance of getting two hot spots in the same cache line, and starting a
bidding war for that line.
So my intuition is:
If read alot but seldom written, mark "__read_mostly".
If seldom read or written, leave unmarked.
so as to leave plenty of the rarely used (neither read nor written on
kernel hot path code) as "cannon fodder" to fill the rest of the cache
lines favored by the hot data.
This leads me to ask, of any item marked "__read_mostly":
Is it accessed (for read, presumably) frequently, on a hot path?
If not, then I'd favor (absent actual measurements to the contrary) not
marking it.
By this criteria:
1) I would -not- mark "struct kmem_cache *cpuset" __read_mostly, as it
is rarely accessed on -any- code path, much less a hot one. It is
ideal cannon fodder.
2) I -would- (following a private email suggestion of Christoph Lameter)
mark my recently added "int number_of_cpusets" __read_mostly,
because it is accessed for every zone considered in the loops
within^Wbeneath __alloc_pages().
Disclaimer -- none of the above speculation is tempered by the heat of any
actual performance measurements. Hence, it is worth about as much as my
legal advice.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson a ?crit :
>
>
> Hmmm ... I suspect one possible downside.
>
> I would think we would want to spread the hot spots out, to reduce the
> chance of getting two hot spots in the same cache line, and starting a
> bidding war for that line.
>
> So my intuition is:
> If read alot but seldom written, mark "__read_mostly".
> If seldom read or written, leave unmarked.
>
Your analysis is very good but not complete :)
There are different kind of hot cache lines, depending if they are :
- Mostly read
- read/written
Say you move to read mostly most of struct kmem_cache *, they are guaranteed
to stay in 'mostly read'.
Mixing for example filp_cachep and dcache_lock in the same cache line is not a
good thing. And this is what happening on typical kernel :
c04f15f0 B dcache_lock
c04f15f4 B names_cachep
c04f15f8 B filp_cachep
c04f15fc b rename_lock
I do think we should have defined a special section for very hot (and written)
spots. It's more easy to locate thos hot spots than 'mostly read and shared by
all cpus without cache ping pongs' places...
> so as to leave plenty of the rarely used (neither read nor written on
> kernel hot path code) as "cannon fodder" to fill the rest of the cache
> lines favored by the hot data.
>
> This leads me to ask, of any item marked "__read_mostly":
>
> Is it accessed (for read, presumably) frequently, on a hot path?
>
> If not, then I'd favor (absent actual measurements to the contrary) not
> marking it.
>
> By this criteria:
>
> 1) I would -not- mark "struct kmem_cache *cpuset" __read_mostly, as it
> is rarely accessed on -any- code path, much less a hot one. It is
> ideal cannon fodder.
>
> 2) I -would- (following a private email suggestion of Christoph Lameter)
> mark my recently added "int number_of_cpusets" __read_mostly,
> because it is accessed for every zone considered in the loops
> within^Wbeneath __alloc_pages().
>
> Disclaimer -- none of the above speculation is tempered by the heat of any
> actual performance measurements. Hence, it is worth about as much as my
> legal advice.
>
On Tue, 13 Dec 2005, Paul Jackson wrote:
> So my intuition is:
> If read alot but seldom written, mark "__read_mostly".
> If seldom read or written, leave unmarked.
Correct.
> 1) I would -not- mark "struct kmem_cache *cpuset" __read_mostly, as it
> is rarely accessed on -any- code path, much less a hot one. It is
> ideal cannon fodder.
Agree.
On Tue, 13 Dec 2005, Eric Dumazet wrote:
> Say you move to read mostly most of struct kmem_cache *, they are guaranteed
> to stay in 'mostly read'.
True but then this variable is not frequently read. So false sharing would
not have much of an impact.
> Mixing for example filp_cachep and dcache_lock in the same cache line is not a
> good thing. And this is what happening on typical kernel :
>
> c04f15f0 B dcache_lock
> c04f15f4 B names_cachep
> c04f15f8 B filp_cachep
> c04f15fc b rename_lock
>
> I do think we should have defined a special section for very hot (and written)
> spots. It's more easy to locate thos hot spots than 'mostly read and shared by
> all cpus without cache ping pongs' places...
In that case I would think we would want to place these very hot
spots in their own cacheline and control which variables end up in
that particular cacheline. Define a struct something {} and put all
elements that need to go into the same cacheline into that struct.
Then cacheline align the struct.
Christoph Lameter a ?crit :
> On Tue, 13 Dec 2005, Eric Dumazet wrote:
>
>
>>Say you move to read mostly most of struct kmem_cache *, they are guaranteed
>>to stay in 'mostly read'.
>
>
> True but then this variable is not frequently read. So false sharing would
> not have much of an impact.
>
If this variable is not frequently used, why then define its own cache ?
Ie why not use kmalloc() and let kernel use a general cache ?
On a 32 CPUS machine, a kmem_create() costs a *lot* of ram.
This overhead is acceptable if and only if :
- A lot of objects can be allocated in some cases (and using a special cache
can save ram because of power of two general caches alignement)
- A lot of objects are created/deleted per unit of time (performance is critical)
For example pipe() could use its own cache to speedup some benchmarks, but the
general kmalloc()/kfree() was chosen instead.
Eric
On Tue, 13 Dec 2005, Eric Dumazet wrote:
> If this variable is not frequently used, why then define its own cache ?
>
> Ie why not use kmalloc() and let kernel use a general cache ?
I was also wondering why Paul switched to use the slab allocator. One
reason may have been to simplify rcu?
Detail question ...
Eric wrote:
> Say you move to read mostly most of struct kmem_cache *
Does the following:
struct kmem_cache *cpuset_cache __read_mostly;
mark just the one word pointer 'cpuset_cache' as __read_mostly,
or does it mark the whole dang cpuset cache?
I presume it just marks the one pointer word. Am I wrong?
I ask because the subtle phrasing of your comment reads to
my ear as if you knew it marked the entire cache. I can't
tell if that is due to my ears having a different language
accent than yours, or if it is due to my getting this wrong.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson a ?crit :
> Detail question ...
>
> Eric wrote:
>
>>Say you move to read mostly most of struct kmem_cache *
>
>
> Does the following:
>
> struct kmem_cache *cpuset_cache __read_mostly;
>
> mark just the one word pointer 'cpuset_cache' as __read_mostly,
> or does it mark the whole dang cpuset cache?
>
> I presume it just marks the one pointer word. Am I wrong?
Only the pointer is placed onto read mostly section.
In fact the pointer is written once at boot time, then it is only read.
kmem_cache implementation is SMP/NUMA friendly, keeping care of false sharing
issues, and node aware memory.
But the initial pointer *should* be in a cache line shared by all cpus to get
best performance. It's easy to achive this since the pointer is only read.
(well... mostly... )
>
> I ask because the subtle phrasing of your comment reads to
> my ear as if you knew it marked the entire cache. I can't
> tell if that is due to my ears having a different language
> accent than yours, or if it is due to my getting this wrong.
>
Sorry...
Eric
Eric wrote:
> If this variable is not frequently used, why then define its own cache ?
>
> Ie why not use kmalloc() and let kernel use a general cache ?
This change from kmalloc() to a dedicated slab cache was made just a
couple of days ago, at the suggestion of Andi Kleen and Nick Piggin, in
order to optimize out a tasklock spinlock from the primary code path
for allocating a page of memory.
Indeed, this email thread is the thread that presented that patch.
By using a dedicated slab cache, I was able to make an unusual use of
Hugh Dicken's SLAB_DESTROY_BY_RCU implementation, and access a variable
inside the cpuset structure safely, even after that cpuset structure
might have been asynchronously free'd. What I read from that variable
might well be garbage, but at least the slab would not have freed that
page of memory entirely, inside my rcu_read_lock section.
Since all I needed was to edge trigger on the condition that the
contents of a variable changed since last read, that was sufficient.
> On a 32 CPUS machine, a kmem_create() costs a *lot* of ram.
Hmmm ... if 32 is bad, then what does it cost for say 512 CPUs?
And when is that memory required? On many systems, that will have
cpusets CONFIG_CPUSET enabled, but that are not using cpusets, just
the kmem_cache_create() will be called to create cpuset_cache, but
-no- kmem_cache_alloc() calls done. On those systems using cpusets,
there might be one 'struct cpuset' allocated per gigabyte of ram, as a
rough idea.
Can you quantify "costs a *lot* of ram" ?
I suppose that I could add a little bit of logic that avoided the
initial kmem_cache_create() until needed by actual cpuset usage on the
system (on the first cpuset_create(), the first time that user code
tries to create a cpuset). In a related optimization, I might be able
to avoid -even- the rcu_read_lock() guards on systems not using
cpusets (never called cpuset_create() since boot), reducing that guard
to a simple comparison of the current tasks cpuset pointer with the
pointer to the one statically allocated global cpuset, known as the root
cpuset. Actually, that last opimization would benefit any task still
in the root cpuset, even after other cpusets had been dynamically
created.
Or, if using the slab cache was still too expensive for this use, I
could perhaps make a more conventional use of RCU, to guard the kfree()
myself, instead of making this unusual use of SLAB_DESTROY_BY_RCU. I'd
have to learn more about RCU to know how to do that, or even it made
sense.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Tue, 13 Dec 2005, Paul Jackson wrote:
> By using a dedicated slab cache, I was able to make an unusual use of
> Hugh Dicken's SLAB_DESTROY_BY_RCU implementation, and access a variable
> inside the cpuset structure safely, even after that cpuset structure
> might have been asynchronously free'd. What I read from that variable
> might well be garbage, but at least the slab would not have freed that
> page of memory entirely, inside my rcu_read_lock section.
You can accomplish the same thing by using RCU directly without using so
much storage. (and you said so later ...)
> And when is that memory required? On many systems, that will have
> cpusets CONFIG_CPUSET enabled, but that are not using cpusets, just
> the kmem_cache_create() will be called to create cpuset_cache, but
> -no- kmem_cache_alloc() calls done. On those systems using cpusets,
> there might be one 'struct cpuset' allocated per gigabyte of ram, as a
> rough idea.
In this case the slab would allocate one page for the one cpuset. However,
there are lots of control structures allocated for all nodes that would go
unused. The control structures are allocated when the slab is created.
Paul Jackson a ?crit :
> Eric wrote:
>
>>If this variable is not frequently used, why then define its own cache ?
>>
>>Ie why not use kmalloc() and let kernel use a general cache ?
>
>
> This change from kmalloc() to a dedicated slab cache was made just a
> couple of days ago, at the suggestion of Andi Kleen and Nick Piggin, in
> order to optimize out a tasklock spinlock from the primary code path
> for allocating a page of memory.
>
> Indeed, this email thread is the thread that presented that patch.
>
> By using a dedicated slab cache, I was able to make an unusual use of
> Hugh Dicken's SLAB_DESTROY_BY_RCU implementation, and access a variable
> inside the cpuset structure safely, even after that cpuset structure
> might have been asynchronously free'd. What I read from that variable
> might well be garbage, but at least the slab would not have freed that
> page of memory entirely, inside my rcu_read_lock section.
OK, I'm afraid I cannot comment on this, this is too complex for me :)
>
> Since all I needed was to edge trigger on the condition that the
> contents of a variable changed since last read, that was sufficient.
>
>
>>On a 32 CPUS machine, a kmem_create() costs a *lot* of ram.
>
>
> Hmmm ... if 32 is bad, then what does it cost for say 512 CPUs?
You dont want to know :)
struct kmem_cache itself will be about 512*8 + some bytes
then for each cpu a 'struct array_cache' will be allocated (count 128 bytes
minimum each, it depends on various factors (and sizeof(void*) of course)
So I would say about 80 K Bytes at a very minimum.
>
> And when is that memory required? On many systems, that will have
> cpusets CONFIG_CPUSET enabled, but that are not using cpusets, just
> the kmem_cache_create() will be called to create cpuset_cache, but
> -no- kmem_cache_alloc() calls done. On those systems using cpusets,
> there might be one 'struct cpuset' allocated per gigabyte of ram, as a
> rough idea.
>
> Can you quantify "costs a *lot* of ram" ?
>
> I suppose that I could add a little bit of logic that avoided the
> initial kmem_cache_create() until needed by actual cpuset usage on the
> system (on the first cpuset_create(), the first time that user code
> tries to create a cpuset). In a related optimization, I might be able
> to avoid -even- the rcu_read_lock() guards on systems not using
> cpusets (never called cpuset_create() since boot), reducing that guard
> to a simple comparison of the current tasks cpuset pointer with the
> pointer to the one statically allocated global cpuset, known as the root
> cpuset. Actually, that last opimization would benefit any task still
> in the root cpuset, even after other cpusets had been dynamically
> created.
>
> Or, if using the slab cache was still too expensive for this use, I
> could perhaps make a more conventional use of RCU, to guard the kfree()
> myself, instead of making this unusual use of SLAB_DESTROY_BY_RCU. I'd
> have to learn more about RCU to know how to do that, or even it made
> sense.
>
Thank you for this details.
Eric
Eric wrote:
> I do think we should have defined a special section for very hot (and written)
> spots. It's more easy to locate thos hot spots than 'mostly read and shared by
> all cpus without cache ping pongs' places...
Should we do something like:
1) identify the hot write spots, to arrange them by access pattern,
as Christoph considered, in another reply on this thread.
2) identify the hot read, cold write spots, to bunch them up away from (1)
3) leave the rest as "inert filler" (aka "cannon fodder", in my previous
reply), but unmarked in any case.
4) change the word "__read_mostly" to "__hot_read_cold_write", to more
accurately fit item (2).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Eric wrote:
> struct kmem_cache itself will be about 512*8 + some bytes
> then for each cpu a 'struct array_cache' will be allocated (count 128 bytes
Hmmm ... 'struct array_cache' looks to be about 6 integer words,
so if that is the main per-CPU cost, the minimal cost of a slab
cache (once created, before use) is about 24 bytes per cpu.
But whether its 24 or 128 bytes per cpu, that's a heavier weight
hammer than is needed here.
Time for me to learn more about rcu.
Thanks for raising this issue.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Tue, 13 Dec 2005, Paul Jackson wrote:
> Eric wrote:
> > struct kmem_cache itself will be about 512*8 + some bytes
> > then for each cpu a 'struct array_cache' will be allocated (count 128 bytes
>
> Hmmm ... 'struct array_cache' looks to be about 6 integer words,
> so if that is the main per-CPU cost, the minimal cost of a slab
> cache (once created, before use) is about 24 bytes per cpu.
>
> But whether its 24 or 128 bytes per cpu, that's a heavier weight
> hammer than is needed here.
The main per node costs is struct kmem_list3 on top of the array_cache.
Eric wrote:
> But the initial pointer *should* be in a cache line shared by all cpus to get
> best performance.
No - not if that pointer is seldom used. Then its location does not
directly affect performance at all, and better for it to be used as
"inert filler" to fill out cache lines of words that are write hot.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
pj wrote:
> Time for me to learn more about rcu.
Well, that was easy.
Directly using RCU to guard that task->cpuset pointer instead of
cheating via the RCU hooks in the slab cache was just a few lines of
code.
But, boy oh boy, that synchronize_rcu() call sure takes it time.
My cpuset torture test was creating, destroying and abusing about 2600
cpusets/sec before this change, and now it does about 144 cpusets/sec.
That cost 95% of the performance. This only hits on the cost of
attaching a task to a different cpuset (by writing its <pid> to
some other cpuset 'tasks' file.)
Just commenting out the synchronize_cpu() restores the previous speed.
Nothing surprising here - this is just how rcu claims it behaves,
heavily favoring the read side performance. These delays are waiting
for the other cpus doing rcu_read_lock() to go through a quite period
(reschedule, idle or return to user code).
Patches coming soon, to remove the cpuset_cache slab cache, and to
use RCU directly instead.
Thanks, Eric.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
> But, boy oh boy, that synchronize_rcu() call sure takes it time.
>
> My cpuset torture test was creating, destroying and abusing about 2600
> cpusets/sec before this change, and now it does about 144 cpusets/sec.
>
> That cost 95% of the performance. This only hits on the cost of
> attaching a task to a different cpuset (by writing its <pid> to
> some other cpuset 'tasks' file.)
That is why call_rcu.et.al. is a better interface if you want performance.
It runs the freeing batched in the background.
-Andi
Andi wrote:
> That is why call_rcu.et.al. is a better interface if you want performance.
> It runs the freeing batched in the background.
True.
In this case, my free'ing code is non-trivial, and my performance
requirements very minimal. So I'll take the easier synchronize_rcu()
over the asynchronous call_rcu().
But, yes, that's the tradeoff.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson a ?crit :
> Eric wrote:
>
>>struct kmem_cache itself will be about 512*8 + some bytes
>>then for each cpu a 'struct array_cache' will be allocated (count 128 bytes
>
>
> Hmmm ... 'struct array_cache' looks to be about 6 integer words,
> so if that is the main per-CPU cost, the minimal cost of a slab
> cache (once created, before use) is about 24 bytes per cpu.
Nope, because struct array_cache includes a variable length table of pointers
to hold a cache of available objects per cpu. The 'limit' (the number of
pointer in this cache) depends on the object size.
See enable_cpucache in mm/slab.c for 'limit' determination :
if (cachep->objsize > 131072)
limit = 1;
else if (cachep->objsize > PAGE_SIZE)
limit = 8;
else if (cachep->objsize > 1024)
limit = 24;
else if (cachep->objsize > 256)
limit = 54;
else
limit = 120;
Let's take an example :
grep dentry /proc/slabinfo
dentry_cache 157113 425289 224 17 1 : tunables 120 60 8 :
slabdata 25017 25017 0
'limit' is the number following 'tunable' : 120
On a 64 bits machines, 120*sizeof(void*) = 120*8 = 960
So for small objects (<= 256 bytes), you end with a sizeof(arracy_cache) =
1024 bytes per cpu
If 512 CPUS : 512*1024 = 512 Kbytes + all other kmem_cache structures : (If
you have a lot of Memory Nodes, then it can be very big too)
If you know that no more than 100 objects are used in 99% of setups, then a
dedicated cache is overkill.
>
> But whether its 24 or 128 bytes per cpu, that's a heavier weight
> hammer than is needed here.
>
> Time for me to learn more about rcu.
>
> Thanks for raising this issue.
>
You are welcome.
Eric
Eric wrote:
> So for small objects (<= 256 bytes), you end with a sizeof(arracy_cache) =
> 1024 bytes per cpu
A kbyte per cpu, for something 98% of systems will use -none- of.
Ouch.
All the more motivation for nuking cpuset_cache.
Patch coming in a minute.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401