2005-12-14 08:40:51

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 01/04] Cpuset: remove rcu slab cache optimization

Reverse the two patches that added slab-based rcu opimization:
cpuset-rcu-slab-cache-optimization
cpuset-rcu-slab-cache-optimization-comment

An improved version, not using the slab cache, will be used
instead. Each slab cache created costs some memory on each cpu
to manage the cache, which is not worth it in this case.

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 48 +++++++++++-------------------------------------
1 files changed, 11 insertions(+), 37 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-13 16:46:31.446507555 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-13 16:49:01.767509666 -0800
@@ -63,18 +63,6 @@
*/
int number_of_cpusets;

-/*
- * Take cpusets from a slab cache marked SLAB_DESTROY_BY_RCU, so
- * that on the critical code path cpuset_update_task_memory_state(),
- * we can safely read task->cpuset->mems_generation with just
- * an rcu lock, w/o risk of a slab cache free invalidating that
- * memory location. It might not be our tasks cpuset anymore,
- * but at least it will be safe to load from it, which is good
- * enough when checking for a changed mems_generation.
- */
-
-static kmem_cache_t *cpuset_cache;
-
/* See "Frequency meter" comments, below. */

struct fmeter {
@@ -260,10 +248,6 @@ static struct super_block *cpuset_sb;
* a tasks cpuset pointer we use task_lock(), which acts on a spinlock
* (task->alloc_lock) already in the task_struct routinely used for
* such matters.
- *
- * P.S. One more locking exception. See the use of rcu_read_lock
- * when checking a tasks cpuset->mems_generation in the routine
- * cpuset_update_task_memory_state below.
*/

static DECLARE_MUTEX(manage_sem);
@@ -305,7 +289,7 @@ static void cpuset_diput(struct dentry *
if (S_ISDIR(inode->i_mode)) {
struct cpuset *cs = dentry->d_fsdata;
BUG_ON(!(is_removed(cs)));
- kmem_cache_free(cpuset_cache, cs);
+ kfree(cs);
}
iput(inode);
}
@@ -626,17 +610,12 @@ static void guarantee_online_mems(const
* cpuset pointer. This routine also might acquire callback_sem and
* current->mm->mmap_sem during call.
*
- * Reading current->cpuset->mems_generation doesn't need task_lock,
- * because cpusets are on a slab cache marked SLAB_DESTROY_BY_RCU,
- * so the rcu_read_lock() insures the memory read will not yet be
- * unmapped. It's ok if attach_task() replaces our cpuset with
- * another while we are reading mems_generation, and even frees it.
- *
- * We do -not- need to guard the 'cs' pointer dereference within the
- * rcu_read_lock section with rcu_dereference(), because we don't
- * mind getting bogus out-of-order results. New cpuset pointer and
- * old mems_generation is ok - we'll realize that our cpuset memory
- * placement changed the next time through here.
+ * The task_lock() is required to dereference current->cpuset safely.
+ * Without it, we could pick up the pointer value of current->cpuset
+ * in one instruction, and then attach_task could give us a different
+ * cpuset, and then the cpuset we had could be removed and freed,
+ * and then on our next instruction, we could dereference a no longer
+ * valid cpuset pointer to get its mems_generation field.
*
* This routine is needed to update the per-task mems_allowed data,
* within the tasks context, when it is trying to allocate memory
@@ -650,9 +629,9 @@ void cpuset_update_task_memory_state()
struct task_struct *tsk = current;
struct cpuset *cs = tsk->cpuset;

- rcu_read_lock();
+ task_lock(tsk);
my_cpusets_mem_gen = cs->mems_generation;
- rcu_read_unlock();
+ task_unlock(tsk);

if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
down(&callback_sem);
@@ -1758,7 +1737,7 @@ static long cpuset_create(struct cpuset
struct cpuset *cs;
int err;

- cs = kmem_cache_alloc(cpuset_cache, GFP_KERNEL);
+ cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return -ENOMEM;

@@ -1800,7 +1779,7 @@ static long cpuset_create(struct cpuset
err:
list_del(&cs->sibling);
up(&manage_sem);
- kmem_cache_free(cpuset_cache, cs);
+ kfree(cs);
return err;
}

@@ -1878,11 +1857,6 @@ int __init cpuset_init(void)
struct dentry *root;
int err;

- cpuset_cache = kmem_cache_create("cpuset_cache",
- sizeof(struct cpuset), 0,
- SLAB_DESTROY_BY_RCU | SLAB_PANIC,
- NULL, NULL);
-
top_cpuset.cpus_allowed = CPU_MASK_ALL;
top_cpuset.mems_allowed = NODE_MASK_ALL;


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373


2005-12-14 08:40:52

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 02/04] Cpuset: use rcu directly optimization

Optimize the cpuset impact on page allocation, the most
performance critical cpuset hook in the kernel.

On each page allocation, the cpuset hook needs to check for a
possible change in the current tasks cpuset. It can now handle
the common case, of no change, without taking any spinlock or
semaphore, thanks to RCU.

Convert a spinlock on the current task to an rcu_read_lock(),
saving approximately a memory barrier and an atomic op, depending
on architecture.

This is done by adding rcu_assign_pointer() and synchronize_rcu()
calls to the write side of the task->cpuset pointer, in
cpuset.c:attach_task(), to delay freeing up a detached cpuset
until after any critical sections referencing that pointer.

Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 40 ++++++++++++++++++++++++++++++----------
1 files changed, 30 insertions(+), 10 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-13 16:49:01.767509666 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-13 17:19:37.989982316 -0800
@@ -39,6 +39,7 @@
#include <linux/namei.h>
#include <linux/pagemap.h>
#include <linux/proc_fs.h>
+#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
@@ -248,6 +249,11 @@ static struct super_block *cpuset_sb;
* a tasks cpuset pointer we use task_lock(), which acts on a spinlock
* (task->alloc_lock) already in the task_struct routinely used for
* such matters.
+ *
+ * P.S. One more locking exception. RCU is used to guard the
+ * update of a tasks cpuset pointer by attach_task() and the
+ * access of task->cpuset->mems_generation via that pointer in
+ * the routine cpuset_update_task_memory_state().
*/

static DECLARE_MUTEX(manage_sem);
@@ -610,12 +616,24 @@ static void guarantee_online_mems(const
* cpuset pointer. This routine also might acquire callback_sem and
* current->mm->mmap_sem during call.
*
- * The task_lock() is required to dereference current->cpuset safely.
- * Without it, we could pick up the pointer value of current->cpuset
- * in one instruction, and then attach_task could give us a different
- * cpuset, and then the cpuset we had could be removed and freed,
- * and then on our next instruction, we could dereference a no longer
- * valid cpuset pointer to get its mems_generation field.
+ * Reading current->cpuset->mems_generation doesn't need task_lock
+ * to guard the current->cpuset derefence, because it is guarded
+ * from concurrent freeing of current->cpuset by attach_task(),
+ * using RCU.
+ *
+ * The rcu_dereference() is technically probably not needed,
+ * as I don't actually mind if I see a new cpuset pointer but
+ * an old value of mems_generation. However this really only
+ * matters on alpha systems using cpusets heavily. If I dropped
+ * that rcu_dereference(), it would save them a memory barrier.
+ * For all other arch's, rcu_dereference is a no-op anyway, and for
+ * alpha systems not using cpusets, another planned optimization,
+ * avoiding the rcu critical section for tasks in the root cpuset
+ * which is statically allocated, so can't vanish, will make this
+ * irrelevant. Better to use RCU as intended, than to engage in
+ * some cute trick to save a memory barrier that is impossible to
+ * test, for alpha systems using cpusets heavily, which might not
+ * even exist.
*
* This routine is needed to update the per-task mems_allowed data,
* within the tasks context, when it is trying to allocate memory
@@ -627,11 +645,12 @@ void cpuset_update_task_memory_state()
{
int my_cpusets_mem_gen;
struct task_struct *tsk = current;
- struct cpuset *cs = tsk->cpuset;
+ struct cpuset *cs;

- task_lock(tsk);
+ rcu_read_lock();
+ cs = rcu_dereference(tsk->cpuset);
my_cpusets_mem_gen = cs->mems_generation;
- task_unlock(tsk);
+ rcu_read_unlock();

if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
down(&callback_sem);
@@ -1131,7 +1150,7 @@ static int attach_task(struct cpuset *cs
return -ESRCH;
}
atomic_inc(&cs->count);
- tsk->cpuset = cs;
+ rcu_assign_pointer(tsk->cpuset, cs);
task_unlock(tsk);

guarantee_online_cpus(cs, &cpus);
@@ -1151,6 +1170,7 @@ static int attach_task(struct cpuset *cs
if (is_memory_migrate(cs))
do_migrate_pages(tsk->mm, &from, &to, MPOL_MF_MOVE_ALL);
put_task_struct(tsk);
+ synchronize_rcu();
if (atomic_dec_and_test(&oldcs->count))
check_for_release(oldcs, ppathbuf);
return 0;

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-12-14 08:41:28

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 04/04] Cpuset: skip rcu check if task is in root cpuset

For systems that aren't using cpusets, but have them
CONFIG_CPUSET enabled in their kernel (eventually this
may be most distribution kernels), this patch removes
even the minimal rcu_read_lock() from the memory page
allocation path.

Actually, it removes that rcu call for any task that is
in the root cpuset (top_cpuset), which on systems not
actively using cpusets, is all tasks.

We don't need the rcu check for tasks in the top_cpuset,
because the top_cpuset is statically allocated, so at
no risk of being freed out from underneath us.

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-13 18:14:42.529952708 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-13 20:54:26.323911532 -0800
@@ -647,10 +647,15 @@ void cpuset_update_task_memory_state()
struct task_struct *tsk = current;
struct cpuset *cs;

- rcu_read_lock();
- cs = rcu_dereference(tsk->cpuset);
- my_cpusets_mem_gen = cs->mems_generation;
- rcu_read_unlock();
+ if (tsk->cpuset == &top_cpuset) {
+ /* Don't need rcu for top_cpuset. It's never freed. */
+ my_cpusets_mem_gen = top_cpuset.mems_generation;
+ } else {
+ rcu_read_lock();
+ cs = rcu_dereference(tsk->cpuset);
+ my_cpusets_mem_gen = cs->mems_generation;
+ rcu_read_unlock();
+ }

if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
down(&callback_sem);

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-12-14 08:41:57

by Paul Jackson

[permalink] [raw]
Subject: [PATCH 03/04] Cpuset: mark number_of_cpusets read_mostly

Mark cpuset global 'number_of_cpusets' as __read_mostly.

This global is accessed everytime a zone is considered
in the zonelist loops beneath __alloc_pages, looking for
a free memory page. If number_of_cpusets is just one,
then we can short circuit the mems_allowed check.

Since this global is read alot on a hot path, and written
rarely, it is an excellent candidate for __read_mostly.

Thanks to Christoph Lameter for the suggestion.

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-13 18:08:26.644979252 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-13 18:09:55.658647865 -0800
@@ -62,7 +62,7 @@
* When there is only one cpuset (the root cpuset) we can
* short circuit some hooks.
*/
-int number_of_cpusets;
+int number_of_cpusets __read_mostly;

/* See "Frequency meter" comments, below. */


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2005-12-16 17:51:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 04/04] Cpuset: skip rcu check if task is in root cpuset

On Wed, Dec 14, 2005 at 12:40:49AM -0800, Paul Jackson wrote:
> For systems that aren't using cpusets, but have them
> CONFIG_CPUSET enabled in their kernel (eventually this
> may be most distribution kernels), this patch removes
> even the minimal rcu_read_lock() from the memory page
> allocation path.
>
> Actually, it removes that rcu call for any task that is
> in the root cpuset (top_cpuset), which on systems not
> actively using cpusets, is all tasks.
>
> We don't need the rcu check for tasks in the top_cpuset,
> because the top_cpuset is statically allocated, so at
> no risk of being freed out from underneath us.
>
> Signed-off-by: Paul Jackson <[email protected]>
>
> ---
>
> kernel/cpuset.c | 13 +++++++++----
> 1 files changed, 9 insertions(+), 4 deletions(-)
>
> --- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-13 18:14:42.529952708 -0800
> +++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-13 20:54:26.323911532 -0800
> @@ -647,10 +647,15 @@ void cpuset_update_task_memory_state()
> struct task_struct *tsk = current;
> struct cpuset *cs;
>
> - rcu_read_lock();
> - cs = rcu_dereference(tsk->cpuset);
> - my_cpusets_mem_gen = cs->mems_generation;
> - rcu_read_unlock();
> + if (tsk->cpuset == &top_cpuset) {
> + /* Don't need rcu for top_cpuset. It's never freed. */
> + my_cpusets_mem_gen = top_cpuset.mems_generation;
> + } else {
> + rcu_read_lock();
> + cs = rcu_dereference(tsk->cpuset);
> + my_cpusets_mem_gen = cs->mems_generation;
> + rcu_read_unlock();
> + }

Hmmm... In non-CONFIG_PREEMPT kernels on non-Alpha CPUs, rcu_read_lock(),
rcu_read_unlock(), and rcu_reference() do nothing. So in such cases, the
above code will be slower than unconditionally using RCU read side.

In CONFIG_PREEMPT kernels on non-Alpha CPUs, rcu_read_lock() and
rcu_read_unlock() are private non-atomic increment and decrement,
so are likely to be about the same cost as the branch.

In CONFIG_PREEMPT_RT kernels, this optimization would currently buy
you something, but might not once we get rcu_read_lock() and
rcu_read_unlock() more fully optimized.

So I am not convinced that this optimization is worthwhile.

Thanx, Paul

>
> if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
> down(&callback_sem);
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2005-12-16 17:55:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 02/04] Cpuset: use rcu directly optimization

On Wed, Dec 14, 2005 at 12:40:37AM -0800, Paul Jackson wrote:
> Optimize the cpuset impact on page allocation, the most
> performance critical cpuset hook in the kernel.
>
> On each page allocation, the cpuset hook needs to check for a
> possible change in the current tasks cpuset. It can now handle
> the common case, of no change, without taking any spinlock or
> semaphore, thanks to RCU.
>
> Convert a spinlock on the current task to an rcu_read_lock(),
> saving approximately a memory barrier and an atomic op, depending
> on architecture.
>
> This is done by adding rcu_assign_pointer() and synchronize_rcu()
> calls to the write side of the task->cpuset pointer, in
> cpuset.c:attach_task(), to delay freeing up a detached cpuset
> until after any critical sections referencing that pointer.
>
> Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.

Looks good to me from an RCU perspective!

Moving from synchronize_rcu() to call_rcu() would be tricky, since
the check_for_release() function can block in kmalloc(). If updates
become a bottleneck, one approach would be to invoke work queues
from within the RCU callback.

Thanx, Paul

Acked-by: <[email protected]>
> Signed-off-by: Paul Jackson <[email protected]>
>
> ---
>
> kernel/cpuset.c | 40 ++++++++++++++++++++++++++++++----------
> 1 files changed, 30 insertions(+), 10 deletions(-)
>
> --- 2.6.15-rc3-mm1.orig/kernel/cpuset.c 2005-12-13 16:49:01.767509666 -0800
> +++ 2.6.15-rc3-mm1/kernel/cpuset.c 2005-12-13 17:19:37.989982316 -0800
> @@ -39,6 +39,7 @@
> #include <linux/namei.h>
> #include <linux/pagemap.h>
> #include <linux/proc_fs.h>
> +#include <linux/rcupdate.h>
> #include <linux/sched.h>
> #include <linux/seq_file.h>
> #include <linux/slab.h>
> @@ -248,6 +249,11 @@ static struct super_block *cpuset_sb;
> * a tasks cpuset pointer we use task_lock(), which acts on a spinlock
> * (task->alloc_lock) already in the task_struct routinely used for
> * such matters.
> + *
> + * P.S. One more locking exception. RCU is used to guard the
> + * update of a tasks cpuset pointer by attach_task() and the
> + * access of task->cpuset->mems_generation via that pointer in
> + * the routine cpuset_update_task_memory_state().
> */
>
> static DECLARE_MUTEX(manage_sem);
> @@ -610,12 +616,24 @@ static void guarantee_online_mems(const
> * cpuset pointer. This routine also might acquire callback_sem and
> * current->mm->mmap_sem during call.
> *
> - * The task_lock() is required to dereference current->cpuset safely.
> - * Without it, we could pick up the pointer value of current->cpuset
> - * in one instruction, and then attach_task could give us a different
> - * cpuset, and then the cpuset we had could be removed and freed,
> - * and then on our next instruction, we could dereference a no longer
> - * valid cpuset pointer to get its mems_generation field.
> + * Reading current->cpuset->mems_generation doesn't need task_lock
> + * to guard the current->cpuset derefence, because it is guarded
> + * from concurrent freeing of current->cpuset by attach_task(),
> + * using RCU.
> + *
> + * The rcu_dereference() is technically probably not needed,
> + * as I don't actually mind if I see a new cpuset pointer but
> + * an old value of mems_generation. However this really only
> + * matters on alpha systems using cpusets heavily. If I dropped
> + * that rcu_dereference(), it would save them a memory barrier.
> + * For all other arch's, rcu_dereference is a no-op anyway, and for
> + * alpha systems not using cpusets, another planned optimization,
> + * avoiding the rcu critical section for tasks in the root cpuset
> + * which is statically allocated, so can't vanish, will make this
> + * irrelevant. Better to use RCU as intended, than to engage in
> + * some cute trick to save a memory barrier that is impossible to
> + * test, for alpha systems using cpusets heavily, which might not
> + * even exist.
> *
> * This routine is needed to update the per-task mems_allowed data,
> * within the tasks context, when it is trying to allocate memory
> @@ -627,11 +645,12 @@ void cpuset_update_task_memory_state()
> {
> int my_cpusets_mem_gen;
> struct task_struct *tsk = current;
> - struct cpuset *cs = tsk->cpuset;
> + struct cpuset *cs;
>
> - task_lock(tsk);
> + rcu_read_lock();
> + cs = rcu_dereference(tsk->cpuset);
> my_cpusets_mem_gen = cs->mems_generation;
> - task_unlock(tsk);
> + rcu_read_unlock();
>
> if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
> down(&callback_sem);
> @@ -1131,7 +1150,7 @@ static int attach_task(struct cpuset *cs
> return -ESRCH;
> }
> atomic_inc(&cs->count);
> - tsk->cpuset = cs;
> + rcu_assign_pointer(tsk->cpuset, cs);
> task_unlock(tsk);
>
> guarantee_online_cpus(cs, &cpus);
> @@ -1151,6 +1170,7 @@ static int attach_task(struct cpuset *cs
> if (is_memory_migrate(cs))
> do_migrate_pages(tsk->mm, &from, &to, MPOL_MF_MOVE_ALL);
> put_task_struct(tsk);
> + synchronize_rcu();
> if (atomic_dec_and_test(&oldcs->count))
> check_for_release(oldcs, ppathbuf);
> return 0;
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2005-12-16 20:07:18

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 04/04] Cpuset: skip rcu check if task is in root cpuset

Paul wrote:
> So I am not convinced that this optimization is worthwhile.

Nice analysis - thanks.

I read from your analysis that, except on alpha, we're down to
so little that it's going to be difficult on PREEMPT kernels to
discern a clear difference, and on non-PREEMPT kernels, the
rcu read lock is a no-op, so definitely not worth trying to
optimize away.

On the ia64 sn2_defconfig kernel (which is CONFIG_PREEMPT and
CONFIG_DEBUG_PREEMPT, and is the one I happen to care most about) I see
one short branch with this optimization, versus two calls, to the
add_preempt_count() and sub_preempt_count() routines, if I don't have
this optimization. These two *_preempt_count() routines in
kernel/sched.c generate 172 lines of assembly code, containing
several branches, due to the BUG checks.

So in that case (obviously not one of the cases with a huge installed
base ;) this optimization seems well worth it. One short branch is
cheaper than two subroutine calls to a couple of pages of code.

I can easily accept either way on this one - keeping or removing this
optimization. And it involves tradeoffs that vary by architecture and
configuration, but that aren't (so far as I know) worth custom per-arch
optimized code.

I'd slightly prefer to leave this optimization in, on the grounds that
it makes a worthwhile (albeit modest) improvement in the cases, and is
only trivially worse (an added short branch) in other cases.

What's your recommendation, Paul? You have far more experience here than I.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-12-16 20:22:57

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 02/04] Cpuset: use rcu directly optimization

Paul wrote:
> Looks good to me from an RCU perspective!

Thanks for looking at it.

> Moving from synchronize_rcu() to call_rcu() would be tricky,

Yes, likely so.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-12-17 16:47:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 04/04] Cpuset: skip rcu check if task is in root cpuset

On Fri, Dec 16, 2005 at 12:06:51PM -0800, Paul Jackson wrote:
> Paul wrote:
> > So I am not convinced that this optimization is worthwhile.
>
> Nice analysis - thanks.

Glad you liked it. ;-)

> I read from your analysis that, except on alpha, we're down to
> so little that it's going to be difficult on PREEMPT kernels to
> discern a clear difference, and on non-PREEMPT kernels, the
> rcu read lock is a no-op, so definitely not worth trying to
> optimize away.
>
> On the ia64 sn2_defconfig kernel (which is CONFIG_PREEMPT and
> CONFIG_DEBUG_PREEMPT, and is the one I happen to care most about) I see
> one short branch with this optimization, versus two calls, to the
> add_preempt_count() and sub_preempt_count() routines, if I don't have
> this optimization. These two *_preempt_count() routines in
> kernel/sched.c generate 172 lines of assembly code, containing
> several branches, due to the BUG checks.

Hmmmm... Looking at 2.6.14, I see the following:

#define rcu_read_lock() preempt_disable()
#define rcu_read_unlock() preempt_enable()

Chasing down preempt_disable():

#define preempt_disable() \
do { \
inc_preempt_count(); \
barrier(); \
} while (0)

#define inc_preempt_count() add_preempt_count(1)
#define add_preempt_count(val) do {preempt_count() += (val);} while (0)

This reduces to a single increment and a compiler directive.

#define preempt_enable() \
do { \
preempt_enable_no_resched(); \
preempt_check_resched(); \
} while (0)

#define preempt_enable_no_resched() \
do { \
barrier(); \
dec_preempt_count(); \
} while (0)

#define dec_preempt_count() sub_preempt_count(1)
#define sub_preempt_count(val) do {preempt_count() -= (val);} while (0)

#define preempt_check_resched() \
do { \
if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
preempt_schedule(); \
} while (0)

This reduces to a compiler directive, a single decrement, and (in
the common case) a test. I don't see 172 lines of assembly code here.

Ah!!! You have CONFIG_DEBUG_PREEMPT!!!

But that is a debugging configuration. I don't understand the desire
to optimize that configuration.

> So in that case (obviously not one of the cases with a huge installed
> base ;) this optimization seems well worth it. One short branch is
> cheaper than two subroutine calls to a couple of pages of code.

Looking at the debugging versions of add_preempt_count() and
sub_preempt_count() in sched.c, I can indeed believe that the optimization
could provide some benefit. But it seems like you could gain this
benefit and more simply by disabling CONFIG_DEBUG_PREEMPT.

> I can easily accept either way on this one - keeping or removing this
> optimization. And it involves tradeoffs that vary by architecture and
> configuration, but that aren't (so far as I know) worth custom per-arch
> optimized code.
>
> I'd slightly prefer to leave this optimization in, on the grounds that
> it makes a worthwhile (albeit modest) improvement in the cases, and is
> only trivially worse (an added short branch) in other cases.
>
> What's your recommendation, Paul? You have far more experience here than I.

My usual position would be to avoid putting too much effort into optimizing
debug code, but please feel free to educate me on this one!

Thanx, Paul

2005-12-19 14:48:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 04/04] Cpuset: skip rcu check if task is in root cpuset

Paul wrote:
> But it seems like you could gain this
> benefit and more simply by disabling CONFIG_DEBUG_PREEMPT.

Yup - that would save oodles more than what we're dickering over here.

> My usual position would be to avoid putting too much effort into optimizing
> debug code, but please feel free to educate me on this one!

My position too ;).

I should quit wasting your time (and mine) with further fine tuning
of this test to skip rcu check if task->cpuset == &top_cpuset, and
instead consider removing CONFIG_DEBUG_PREEMPT from at least ia64
(sn2), if not also from the other defconfigs that have it:

collie simpad s390 se7705
lpd7a400 bigsur dreamcast sh03
lpd7a404 microdev systemh mx1ads

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-12-19 16:05:58

by Paul Jackson

[permalink] [raw]
Subject: CONFIG_DEBUG_PREEMPT (was: [PATCH 04/04] Cpuset: skip rcu check ...)

pj wrote:
> instead consider removing CONFIG_DEBUG_PREEMPT from at least sn2

Ah - perhaps not so. Adding my SGI colleague Greg Edwards to the cc
list. My email archives suggest that he enabled CONFIG_DEBUG_PREEMPT
in ia64 sn2_defconfig a few months ago, and I presume did so intentionally.

The change enabling CONFIG_DEBUG_PREEMPT was:
user: Greg Edwards <[email protected]>
date: Tue Aug 16 23:38:16 2005 +0011
summary: [IA64] Refresh arch/ia64/configs/sn2_defconfig.

Greg - CONFIG_DEBUG_PREEMPT adds a couple of pages of assembly code
due to various BUG checks beneath rcu_read_lock() on some hot code
paths (which is where rcu is most popular). See the two calls
add_preempt_count() and sub_preempt_count() in kernel/sched.c.

Was this intentional to enable CONFIG_DEBUG_PREEMPT in sn2_defconfig?

Other evidence opposing this DEBUG opttion:

Most other DEBUG options are turned off in the defconfigs.

Other evidence supporting setting this DEBUG option:

We're not the only arch enabling CONFIG_DEBUG_PREEMPT. See also:
collie simpad s390 se7705
lpd7a400 bigsur dreamcast sh03
lpd7a404 microdev systemh mx1ads

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-12-19 16:39:56

by Greg Edwards

[permalink] [raw]
Subject: Re: CONFIG_DEBUG_PREEMPT (was: [PATCH 04/04] Cpuset: skip rcu check ...)

On Mon, Dec 19, 2005 at 08:04:01AM -0800, Paul Jackson wrote:
| Greg - CONFIG_DEBUG_PREEMPT adds a couple of pages of assembly code
| due to various BUG checks beneath rcu_read_lock() on some hot code
| paths (which is where rcu is most popular). See the two calls
| add_preempt_count() and sub_preempt_count() in kernel/sched.c.
|
| Was this intentional to enable CONFIG_DEBUG_PREEMPT in sn2_defconfig?

It wasn't intentional at the time. I believe it was pulled in
automatically when we refreshed since we had CONFIG_PREEMPT on. That
said, it has proven itself useful in turning up some bugs.

Greg