2006-09-03 16:34:43

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH] simplify/improve rcu batch tuning

I have no idea how to test this patch. It passed rcutorture.ko,
but the "improve" part of the subject is only my speculation, so
please comment.

This patch kills a hard-to-calculate 'rsinterval' boot parameter
and per-cpu rcu_data.last_rs_qlen. Instead, it adds adds a flag
rcu_ctrlblk.signaled, which records the fact that one of CPUs
has sent a resched IPI since the last rcu_start_batch().

Roughly speaking, we need two rcu_start_batch()s in order to move
callbacks from ->nxtlist to ->donelist. This means that when ->qlen
exceeds qhimark and continues to grow, we should send a resched IPI,
and then do it again after we gone through a quiescent state.

On the other hand, if it was already sent, we don't need to do
it again when another CPU detects overflow of the queue.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 2.6.18-rc4/Documentation/kernel-parameters.txt~1_auto 2006-08-19 17:50:56.000000000 +0400
+++ 2.6.18-rc4/Documentation/kernel-parameters.txt 2006-09-03 17:47:38.000000000 +0400
@@ -1342,10 +1342,6 @@ running once the system is up.
rcu.qlowmark= [KNL,BOOT] Set threshold of queued
RCU callbacks below which batch limiting is re-enabled.

- rcu.rsinterval= [KNL,BOOT,SMP] Set the number of additional
- RCU callbacks to queued before forcing reschedule
- on all cpus.
-
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
--- 2.6.18-rc4/include/linux/rcupdate.h~1_auto 2006-07-16 01:53:08.000000000 +0400
+++ 2.6.18-rc4/include/linux/rcupdate.h 2006-09-03 18:00:59.000000000 +0400
@@ -66,6 +66,8 @@ struct rcu_ctrlblk {
long completed; /* Number of the last completed batch */
int next_pending; /* Is the next batch already waiting? */

+ int signaled;
+
spinlock_t lock ____cacheline_internodealigned_in_smp;
cpumask_t cpumask; /* CPUs that need to switch in order */
/* for current batch to proceed. */
@@ -106,9 +108,6 @@ struct rcu_data {
long blimit; /* Upper limit on a processed batch */
int cpu;
struct rcu_head barrier;
-#ifdef CONFIG_SMP
- long last_rs_qlen; /* qlen during the last resched */
-#endif
};

DECLARE_PER_CPU(struct rcu_data, rcu_data);
--- 2.6.18-rc4/kernel/rcupdate.c~1_auto 2006-08-19 17:50:56.000000000 +0400
+++ 2.6.18-rc4/kernel/rcupdate.c 2006-09-03 18:01:23.000000000 +0400
@@ -71,9 +71,6 @@ static DEFINE_PER_CPU(struct tasklet_str
static int blimit = 10;
static int qhimark = 10000;
static int qlowmark = 100;
-#ifdef CONFIG_SMP
-static int rsinterval = 1000;
-#endif

static atomic_t rcu_barrier_cpu_count;
static DEFINE_MUTEX(rcu_barrier_mutex);
@@ -86,8 +83,8 @@ static void force_quiescent_state(struct
int cpu;
cpumask_t cpumask;
set_need_resched();
- if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
- rdp->last_rs_qlen = rdp->qlen;
+ if (unlikely(!rcp->signaled)) {
+ rcp->signaled = 1;
/*
* Don't send IPI to itself. With irqs disabled,
* rdp->cpu is the current cpu.
@@ -297,6 +294,7 @@ static void rcu_start_batch(struct rcu_c
smp_mb();
cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);

+ rcp->signaled = 0;
}
}

@@ -624,9 +622,6 @@ void synchronize_rcu(void)
module_param(blimit, int, 0);
module_param(qhimark, int, 0);
module_param(qlowmark, int, 0);
-#ifdef CONFIG_SMP
-module_param(rsinterval, int, 0);
-#endif
EXPORT_SYMBOL_GPL(rcu_batches_completed);
EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
EXPORT_SYMBOL_GPL(call_rcu);


--
VGER BF report: U 0.50412


2006-09-07 20:36:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] simplify/improve rcu batch tuning

On Sun, Sep 03, 2006 at 08:34:19PM +0400, Oleg Nesterov wrote:
> I have no idea how to test this patch. It passed rcutorture.ko,
> but the "improve" part of the subject is only my speculation, so
> please comment.

I definitely like the fact that it kills a config parameter!!!

Some thoughts for testing...

1. Modify rcutorture.c to keep all the rcutorture kernel threads
off of at least one CPU. Run a CPU-bound user process on that
CPU. Compare the rate a which grace periods progress in
the following three situations:

a. With your patch.

b. With stock kernel.

c. With the function disabled (e.g., use the
not-CONFIG_SMP version of force_quiescent_state()).

You would expect to see fast grace-period progress for (a) and
(b), slow for (c).

2. As above, but have another process generating lots of
RCU callbacks, for example, by opening and closing lots
of files, creating and deleting lots of files with long
randomly selected names, thrashing the route cache, or
whatever.

> This patch kills a hard-to-calculate 'rsinterval' boot parameter
> and per-cpu rcu_data.last_rs_qlen. Instead, it adds adds a flag
> rcu_ctrlblk.signaled, which records the fact that one of CPUs
> has sent a resched IPI since the last rcu_start_batch().
>
> Roughly speaking, we need two rcu_start_batch()s in order to move
> callbacks from ->nxtlist to ->donelist. This means that when ->qlen
> exceeds qhimark and continues to grow, we should send a resched IPI,
> and then do it again after we gone through a quiescent state.
>
> On the other hand, if it was already sent, we don't need to do
> it again when another CPU detects overflow of the queue.

And it does it exactly once for each set of quiescent states until
the qlen drops -- though it seems to on a call_rcu() being called
on at least one CPU for each stage of queuing. This is a weakness,
but is no worse than the current behavior, so cannot complain too
much.

Some additional comments below, but this patch seems to me to be in all
ways a significant improvement over the current code, so acking as is.

Acked-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
>
> --- 2.6.18-rc4/Documentation/kernel-parameters.txt~1_auto 2006-08-19 17:50:56.000000000 +0400
> +++ 2.6.18-rc4/Documentation/kernel-parameters.txt 2006-09-03 17:47:38.000000000 +0400
> @@ -1342,10 +1342,6 @@ running once the system is up.
> rcu.qlowmark= [KNL,BOOT] Set threshold of queued
> RCU callbacks below which batch limiting is re-enabled.
>
> - rcu.rsinterval= [KNL,BOOT,SMP] Set the number of additional
> - RCU callbacks to queued before forcing reschedule
> - on all cpus.
> -
> rdinit= [KNL]
> Format: <full_path>
> Run specified binary instead of /init from the ramdisk,
> --- 2.6.18-rc4/include/linux/rcupdate.h~1_auto 2006-07-16 01:53:08.000000000 +0400
> +++ 2.6.18-rc4/include/linux/rcupdate.h 2006-09-03 18:00:59.000000000 +0400
> @@ -66,6 +66,8 @@ struct rcu_ctrlblk {
> long completed; /* Number of the last completed batch */
> int next_pending; /* Is the next batch already waiting? */
>
> + int signaled;
> +
> spinlock_t lock ____cacheline_internodealigned_in_smp;
> cpumask_t cpumask; /* CPUs that need to switch in order */
> /* for current batch to proceed. */
> @@ -106,9 +108,6 @@ struct rcu_data {
> long blimit; /* Upper limit on a processed batch */
> int cpu;
> struct rcu_head barrier;
> -#ifdef CONFIG_SMP
> - long last_rs_qlen; /* qlen during the last resched */
> -#endif
> };
>
> DECLARE_PER_CPU(struct rcu_data, rcu_data);
> --- 2.6.18-rc4/kernel/rcupdate.c~1_auto 2006-08-19 17:50:56.000000000 +0400
> +++ 2.6.18-rc4/kernel/rcupdate.c 2006-09-03 18:01:23.000000000 +0400
> @@ -71,9 +71,6 @@ static DEFINE_PER_CPU(struct tasklet_str
> static int blimit = 10;
> static int qhimark = 10000;
> static int qlowmark = 100;
> -#ifdef CONFIG_SMP
> -static int rsinterval = 1000;
> -#endif
>
> static atomic_t rcu_barrier_cpu_count;
> static DEFINE_MUTEX(rcu_barrier_mutex);
> @@ -86,8 +83,8 @@ static void force_quiescent_state(struct
> int cpu;
> cpumask_t cpumask;
> set_need_resched();

Not that it makes a big difference, but why is the above
set_need_resched() not in the body of the following "if" statement?
It used to be important, because it could prevent additional IPIs in
the same grace period, but since the current code will only send one
IPI per grace period, it seems like it can safely be tucked under the
"if" statement.

> - if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
> - rdp->last_rs_qlen = rdp->qlen;
> + if (unlikely(!rcp->signaled)) {
> + rcp->signaled = 1;
> /*
> * Don't send IPI to itself. With irqs disabled,
> * rdp->cpu is the current cpu.
> @@ -297,6 +294,7 @@ static void rcu_start_batch(struct rcu_c
> smp_mb();
> cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
>
> + rcp->signaled = 0;

Would it make sense to invoke force_quiescent_state() here in the
case that rdp->qlen is still large? The disadvantage is that qlen
still counts the number of callbacks that are already slated for
invocation. Another approach would be to check rdp->qlen and
rcp->signaled in rcu_do_batch(), but only once rdp->donlist goes
NULL.

Thoughts?

> }
> }
>
> @@ -624,9 +622,6 @@ void synchronize_rcu(void)
> module_param(blimit, int, 0);
> module_param(qhimark, int, 0);
> module_param(qlowmark, int, 0);
> -#ifdef CONFIG_SMP
> -module_param(rsinterval, int, 0);
> -#endif
> EXPORT_SYMBOL_GPL(rcu_batches_completed);
> EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
> EXPORT_SYMBOL_GPL(call_rcu);
>

2006-09-08 11:40:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] simplify/improve rcu batch tuning

On 09/07, Paul E. McKenney wrote:
>
> Some thoughts for testing...
>
> 1. Modify rcutorture.c to keep all the rcutorture kernel threads
> off of at least one CPU. Run a CPU-bound user process on that
> CPU. Compare the rate a which grace periods progress in
> the following three situations:
>
> a. With your patch.
>
> b. With stock kernel.
>
> c. With the function disabled (e.g., use the
> not-CONFIG_SMP version of force_quiescent_state()).
>
> You would expect to see fast grace-period progress for (a) and
> (b), slow for (c).
>
> 2. As above, but have another process generating lots of
> RCU callbacks, for example, by opening and closing lots
> of files, creating and deleting lots of files with long
> randomly selected names, thrashing the route cache, or
> whatever.

Thanks for review and suggestions. I'll try to run these tests next week.
Afaics, it is enough to just do

for (;;) close(open(...))

for '2.'.

> > @@ -86,8 +83,8 @@ static void force_quiescent_state(struct
> > int cpu;
> > cpumask_t cpumask;
> > set_need_resched();
>
> Not that it makes a big difference, but why is the above
> set_need_resched() not in the body of the following "if" statement?
> It used to be important, because it could prevent additional IPIs in
> the same grace period, but since the current code will only send one
> IPI per grace period, it seems like it can safely be tucked under the
> "if" statement.

I think there was another reason to do set_need_resched() unconditionally,
but this is only my guess. We are sending IPIs to speedup the flashing of
callbacks we already have in the queue. But set_need_resched() tries to
suppress current process from adding new callbacks (not that it is perfect,
though). Consider the 'for (;;) close(open(...))' loop.

Actually I think it also makes sense to do tasklet_schedule(rcu_tasklet)
in call_rcu(), this way we can detect that we need to start the next batch
earlier.

> > - if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
> > - rdp->last_rs_qlen = rdp->qlen;
> > + if (unlikely(!rcp->signaled)) {
> > + rcp->signaled = 1;
> > /*
> > * Don't send IPI to itself. With irqs disabled,
> > * rdp->cpu is the current cpu.
> > @@ -297,6 +294,7 @@ static void rcu_start_batch(struct rcu_c
> > smp_mb();
> > cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
> >
> > + rcp->signaled = 0;
>
> Would it make sense to invoke force_quiescent_state() here in the
> case that rdp->qlen is still large? The disadvantage is that qlen
> still counts the number of callbacks that are already slated for
> invocation.

This is not easy to do. rcu_start_batch() is "global", we need
to scan all per-cpu 'struct rcu_data' and check it's ->qlen.

> Another approach would be to check rdp->qlen and
> rcp->signaled in rcu_do_batch(), but only once rdp->donlist goes
> NULL.

Agree. Probably we don't need to check !rdp->donlist, it should be
empty after rcu_do_batch() invocation when ->qlen > qhimark, because
in that case ->blimit == INT_MAX.

But first I'd like to do a couple of other cleanups here, I'll send
the patches on weekend.

Thanks!

Oleg.

2006-09-08 16:01:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] simplify/improve rcu batch tuning

On Fri, Sep 08, 2006 at 03:39:30PM +0400, Oleg Nesterov wrote:
> On 09/07, Paul E. McKenney wrote:
> >
> > Some thoughts for testing...
> >
> > 1. Modify rcutorture.c to keep all the rcutorture kernel threads
> > off of at least one CPU. Run a CPU-bound user process on that
> > CPU. Compare the rate a which grace periods progress in
> > the following three situations:
> >
> > a. With your patch.
> >
> > b. With stock kernel.
> >
> > c. With the function disabled (e.g., use the
> > not-CONFIG_SMP version of force_quiescent_state()).
> >
> > You would expect to see fast grace-period progress for (a) and
> > (b), slow for (c).
> >
> > 2. As above, but have another process generating lots of
> > RCU callbacks, for example, by opening and closing lots
> > of files, creating and deleting lots of files with long
> > randomly selected names, thrashing the route cache, or
> > whatever.
>
> Thanks for review and suggestions. I'll try to run these tests next week.
> Afaics, it is enough to just do
>
> for (;;) close(open(...))
>
> for '2.'.

That code sequence certainly has been known to generate very large
numbers of RCU callbacks very quickly in the past... ;-)

> > > @@ -86,8 +83,8 @@ static void force_quiescent_state(struct
> > > int cpu;
> > > cpumask_t cpumask;
> > > set_need_resched();
> >
> > Not that it makes a big difference, but why is the above
> > set_need_resched() not in the body of the following "if" statement?
> > It used to be important, because it could prevent additional IPIs in
> > the same grace period, but since the current code will only send one
> > IPI per grace period, it seems like it can safely be tucked under the
> > "if" statement.
>
> I think there was another reason to do set_need_resched() unconditionally,
> but this is only my guess. We are sending IPIs to speedup the flashing of
> callbacks we already have in the queue. But set_need_resched() tries to
> suppress current process from adding new callbacks (not that it is perfect,
> though). Consider the 'for (;;) close(open(...))' loop.

Good point -- should get things going upon return from the next syscall.

> Actually I think it also makes sense to do tasklet_schedule(rcu_tasklet)
> in call_rcu(), this way we can detect that we need to start the next batch
> earlier.

As long as we don't do this too often... One way to prevent doing this
too often would be to check rcp->completed against rdp->batch similarly
to __rcu_process_callbacks()'s checks. In call_rcu(), perhaps something
like the following inside the ->qlen check:

if (__rcu_pending(&rcu_ctrlblk, rdp) {
tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
}

with similar code in call_rcu_bh().

> > > - if (unlikely(rdp->qlen - rdp->last_rs_qlen > rsinterval)) {
> > > - rdp->last_rs_qlen = rdp->qlen;
> > > + if (unlikely(!rcp->signaled)) {
> > > + rcp->signaled = 1;
> > > /*
> > > * Don't send IPI to itself. With irqs disabled,
> > > * rdp->cpu is the current cpu.
> > > @@ -297,6 +294,7 @@ static void rcu_start_batch(struct rcu_c
> > > smp_mb();
> > > cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
> > >
> > > + rcp->signaled = 0;
> >
> > Would it make sense to invoke force_quiescent_state() here in the
> > case that rdp->qlen is still large? The disadvantage is that qlen
> > still counts the number of callbacks that are already slated for
> > invocation.
>
> This is not easy to do. rcu_start_batch() is "global", we need
> to scan all per-cpu 'struct rcu_data' and check it's ->qlen.

My thought was that it might make sense to check only this CPU's struct
rcu_data. But I agree that the next approach seems more promising.

> > Another approach would be to check rdp->qlen and
> > rcp->signaled in rcu_do_batch(), but only once rdp->donlist goes
> > NULL.
>
> Agree. Probably we don't need to check !rdp->donlist, it should be
> empty after rcu_do_batch() invocation when ->qlen > qhimark, because
> in that case ->blimit == INT_MAX.

And rcu_do_batch() already checks rdp->donelist anyway, so this code
could just be put into the "else" clause of the existing rdp->donelist
check at the end of rcu_do_batch(). So, yes, this does appear to be
the best approach.

> But first I'd like to do a couple of other cleanups here, I'll send
> the patches on weekend.

Look forward to seeing them!

Thanx, Paul

2006-09-08 16:30:36

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] simplify/improve rcu batch tuning

On 09/08, Paul E. McKenney wrote:
>
> On Fri, Sep 08, 2006 at 03:39:30PM +0400, Oleg Nesterov wrote:
> > >
> > Actually I think it also makes sense to do tasklet_schedule(rcu_tasklet)
> > in call_rcu(), this way we can detect that we need to start the next batch
> > earlier.
>
> As long as we don't do this too often... One way to prevent doing this
> too often would be to check rcp->completed against rdp->batch similarly
> to __rcu_process_callbacks()'s checks. In call_rcu(), perhaps something
> like the following inside the ->qlen check:
>
> if (__rcu_pending(&rcu_ctrlblk, rdp) {
> tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
> }

Yes.

> > > > @@ -297,6 +294,7 @@ static void rcu_start_batch(struct rcu_c
> > > > smp_mb();
> > > > cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
> > > >
> > > > + rcp->signaled = 0;
> > >
> > > Would it make sense to invoke force_quiescent_state() here in the
> > > case that rdp->qlen is still large? The disadvantage is that qlen
> > > still counts the number of callbacks that are already slated for
> > > invocation.
> >
> > This is not easy to do. rcu_start_batch() is "global", we need
> > to scan all per-cpu 'struct rcu_data' and check it's ->qlen.
>
> My thought was that it might make sense to check only this CPU's struct
> rcu_data. But I agree that the next approach seems more promising.

Yes, I understood this. And I don't say this is "bad", I just think this is
"not perfect". Because the CPU which actually starts a new grace period and
clears ->signaled may have ->qlen == 0, and the caller is cpu_quiet(), which
is a "response" to some other CPU's force_quiescent_state().

Oleg.

2006-09-08 18:58:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] simplify/improve rcu batch tuning

On Fri, Sep 08, 2006 at 08:30:22PM +0400, Oleg Nesterov wrote:
> On 09/08, Paul E. McKenney wrote:
> > On Fri, Sep 08, 2006 at 03:39:30PM +0400, Oleg Nesterov wrote:
> > > > > @@ -297,6 +294,7 @@ static void rcu_start_batch(struct rcu_c
> > > > > smp_mb();
> > > > > cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
> > > > >
> > > > > + rcp->signaled = 0;
> > > >
> > > > Would it make sense to invoke force_quiescent_state() here in the
> > > > case that rdp->qlen is still large? The disadvantage is that qlen
> > > > still counts the number of callbacks that are already slated for
> > > > invocation.
> > >
> > > This is not easy to do. rcu_start_batch() is "global", we need
> > > to scan all per-cpu 'struct rcu_data' and check it's ->qlen.
> >
> > My thought was that it might make sense to check only this CPU's struct
> > rcu_data. But I agree that the next approach seems more promising.
>
> Yes, I understood this. And I don't say this is "bad", I just think this is
> "not perfect". Because the CPU which actually starts a new grace period and
> clears ->signaled may have ->qlen == 0, and the caller is cpu_quiet(), which
> is a "response" to some other CPU's force_quiescent_state().

Sounds good!

Thanx, Paul