2004-01-08 11:48:00

by Dipankar Sarma

[permalink] [raw]
Subject: [patch] RCU for low latency [0/1]

On akpm's prodding, I have been looking at scheduling latency
issues due to long running interrupts. Since RCU
callbacks are invoked from softirq context, it is one of the first
things needed investigation. I used a P4 2.4Ghz UP system with 256MB
memory and dbench 32 in a loop to generate a large number of RCU
updates and measured scheduling latency using amlat
(http://www.zipworld.com.au/~akpm/linux/schedlat.html#amlat).
All measurements were done with preemption enabled of course.

Vanilla 2.6.0-test8 kernel showed a latency of around 800 microseconds
with RCU callback batches as long as 1800 callbacks. Each of these
were essentially just dentries being returned to their slab and it turns
out that we can return on an avarage 4 objects per microsecond
in that system. Thus the longest RCU callback handling took
over 400 microseconds. Clearly for sub-500 microsecond latency,
we needed to do something about this.

I have a number of experimental patches that improves latency.
The first approach took advantage of UP kernel and by ensuring
that read-side critical sections of RCU users are not interrupted
by a handler that can update it. This showed good results -
consistently around 400 microseconds of latency. However since
people seem to care about latency in SMP too, there is another
set of patches that uses per-cpu kernel thread for RCU callback processing
pretty much like ksoftirqd. Beyond a certain limit, remaining
RCU callbacks are punted to the kernel thread which can be preempted
for better latency. I did consider keventd, but it turned out
to be problematic mainly because it can get starved and other
keventd work can block for long holding up RCU. One other
optimization I did at akpm's suggestion was to do the RCU
callback punting only if there is a RT priority task in
the runqueue of that CPU. This approach too showed good results
with latency limited to around 400 microseconds.

Of course, all these were done with one type of workload aimed
at generating RCU callbacks, it would be interesting to see
how they fair with other workloads. Yet another kernel thread
is icky, but can't see another way to allow RT tasks to run.
I have also been doing instrumentation to see where the rest
of the (non-rcu-related) latency is coming from, particularly
if there are long running irqs/softirqs and it apparently is not so.
It looks like consecutive interrupts at inopportune moments
is causing additional latency.

The actual RCU patches for low latency in SMP follows. A
kernel parementer rcubhlimit sets the maximum number of callbacks
invoked from softirq context before punting to krcud (default 256).
rcubhlimit = 0 indicates no krcud at all.

These are meant for latency investigations, not meant for inclusion
yet.

Thanks
Dipankar


2004-01-08 11:52:09

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]


Reduce bh processing time of rcu callbacks by using tunable per-cpu
krcud daemeons.


include/linux/rcupdate.h | 4 +
kernel/rcupdate.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 103 insertions(+), 2 deletions(-)

diff -puN include/linux/rcupdate.h~rcu-reduce-bh-time include/linux/rcupdate.h
--- linux-2.6.0-test8-smprcu/include/linux/rcupdate.h~rcu-reduce-bh-time 2003-12-29 22:50:32.000000000 +0530
+++ linux-2.6.0-test8-smprcu-dipankar/include/linux/rcupdate.h 2003-12-29 22:50:32.000000000 +0530
@@ -93,9 +93,11 @@ struct rcu_data {
long qsctr; /* User-mode/idle loop etc. */
long last_qsctr; /* value of qsctr at beginning */
/* of rcu grace period */
+ struct task_struct *krcud;
long batch; /* Batch # for current RCU batch */
struct list_head nxtlist;
struct list_head curlist;
+ struct list_head rcudlist;
};

DECLARE_PER_CPU(struct rcu_data, rcu_data);
@@ -103,9 +105,11 @@ extern struct rcu_ctrlblk rcu_ctrlblk;

#define RCU_qsctr(cpu) (per_cpu(rcu_data, (cpu)).qsctr)
#define RCU_last_qsctr(cpu) (per_cpu(rcu_data, (cpu)).last_qsctr)
+#define RCU_krcud(cpu) (per_cpu(rcu_data, (cpu)).krcud)
#define RCU_batch(cpu) (per_cpu(rcu_data, (cpu)).batch)
#define RCU_nxtlist(cpu) (per_cpu(rcu_data, (cpu)).nxtlist)
#define RCU_curlist(cpu) (per_cpu(rcu_data, (cpu)).curlist)
+#define RCU_rcudlist(cpu) (per_cpu(rcu_data, (cpu)).rcudlist)

#define RCU_QSCTR_INVALID 0

diff -puN kernel/rcupdate.c~rcu-reduce-bh-time kernel/rcupdate.c
--- linux-2.6.0-test8-smprcu/kernel/rcupdate.c~rcu-reduce-bh-time 2003-12-29 22:50:32.000000000 +0530
+++ linux-2.6.0-test8-smprcu-dipankar/kernel/rcupdate.c 2003-12-29 22:50:32.000000000 +0530
@@ -54,6 +54,11 @@ DEFINE_PER_CPU(struct rcu_data, rcu_data
/* Fake initialization required by compiler */
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
#define RCU_tasklet(cpu) (per_cpu(rcu_tasklet, cpu))
+#ifdef CONFIG_PREEMPT
+static int rcu_max_bh_callbacks = 256;
+#else
+static int rcu_max_bh_callbacks = 0;
+#endif

/**
* call_rcu - Queue an RCU update request.
@@ -79,6 +84,13 @@ void call_rcu(struct rcu_head *head, voi
local_irq_restore(flags);
}

+static inline int rcu_limiting_needed(int cpu)
+{
+ if (in_softirq() && RCU_krcud(cpu))
+ return 1;
+ return 0;
+}
+
/*
* Invoke the completed RCU callbacks. They are expected to be in
* a per-cpu list.
@@ -87,13 +99,24 @@ static void rcu_do_batch(struct list_hea
{
struct list_head *entry;
struct rcu_head *head;
+ int count = 0;
+ int cpu = smp_processor_id();
+ int limit = rcu_limiting_needed(cpu);

while (!list_empty(list)) {
entry = list->next;
list_del(entry);
head = list_entry(entry, struct rcu_head, list);
head->func(head->arg);
+ count++;
+ if (limit && count > rcu_max_bh_callbacks &&
+ rq_has_rt_task(cpu)) {
+ list_splice(list, &RCU_rcudlist(cpu));
+ wake_up_process(RCU_krcud(cpu));
+ break;
+ }
}
+
}

/*
@@ -198,12 +221,67 @@ void rcu_check_callbacks(int cpu, int us
tasklet_schedule(&RCU_tasklet(cpu));
}

+static int krcud(void * __bind_cpu)
+{
+ int cpu = (int) (long) __bind_cpu;
+
+ daemonize("krcud/%d", cpu);
+ set_user_nice(current, -19);
+ current->flags |= PF_IOTHREAD;
+
+ /* Migrate to the right CPU */
+ set_cpus_allowed(current, cpumask_of_cpu(cpu));
+ BUG_ON(smp_processor_id() != cpu);
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ mb();
+
+ RCU_krcud(cpu) = current;
+
+ for (;;) {
+ LIST_HEAD(list);
+
+ if (list_empty(&RCU_rcudlist(cpu)))
+ schedule();
+
+ __set_current_state(TASK_RUNNING);
+
+ local_bh_disable();
+ while (!list_empty(&RCU_rcudlist(cpu))) {
+ list_splice(&RCU_rcudlist(cpu), &list);
+ INIT_LIST_HEAD(&RCU_rcudlist(cpu));
+ local_bh_enable();
+ rcu_do_batch(&list);
+ cond_resched();
+ local_bh_disable();
+ }
+ local_bh_enable();
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ }
+}
+
+static int start_krcud(int cpu)
+{
+ if (rcu_max_bh_callbacks) {
+ if (kernel_thread(krcud, (void *)(long)cpu, CLONE_KERNEL) < 0) {
+ printk("krcud for %i failed\n", cpu);
+ return -1;
+ }
+
+ while (!RCU_krcud(cpu))
+ yield();
+ }
+ return 0;
+}
+
static void __devinit rcu_online_cpu(int cpu)
{
memset(&per_cpu(rcu_data, cpu), 0, sizeof(struct rcu_data));
tasklet_init(&RCU_tasklet(cpu), rcu_process_callbacks, 0UL);
INIT_LIST_HEAD(&RCU_nxtlist(cpu));
INIT_LIST_HEAD(&RCU_curlist(cpu));
+ INIT_LIST_HEAD(&RCU_rcudlist(cpu));
}

static int __devinit rcu_cpu_notify(struct notifier_block *self,
@@ -214,6 +292,10 @@ static int __devinit rcu_cpu_notify(stru
case CPU_UP_PREPARE:
rcu_online_cpu(cpu);
break;
+ case CPU_ONLINE:
+ if (start_krcud(cpu) != 0)
+ return NOTIFY_BAD;
+ break;
/* Space reserved for CPU_OFFLINE :) */
default:
break;
@@ -233,12 +315,27 @@ static struct notifier_block __devinitda
*/
void __init rcu_init(void)
{
- rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
- (void *)(long)smp_processor_id());
+ rcu_online_cpu(smp_processor_id());
/* Register notifier for non-boot CPUs */
register_cpu_notifier(&rcu_nb);
}

+static int __init rcu_late_init(void)
+{
+ return start_krcud(smp_processor_id());
+}
+
+__initcall(rcu_late_init);
+
+static int __init rcu_bh_limit_setup(char *str)
+{
+ if (get_option(&str, &rcu_max_bh_callbacks) != 1)
+ BUG();
+ return 0;
+}
+
+__setup("rcubhlimit=", rcu_bh_limit_setup);
+

/* Because of FASTCALL declaration of complete, we use this wrapper */
static void wakeme_after_rcu(void *completion)

_

2004-01-08 11:50:02

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [1/2]


Provide a rq_has_rt_task() interface to detect runqueues with
real time priority tasks. Useful for RCU optimizations.


include/linux/sched.h | 1 +
kernel/sched.c | 31 ++++++++++++++++++++++++++++++-
2 files changed, 31 insertions(+), 1 deletion(-)

diff -puN include/linux/sched.h~rq-has-rt-task include/linux/sched.h
--- linux-2.6.0-test8-smprcu/include/linux/sched.h~rq-has-rt-task 2003-12-29 19:40:46.000000000 +0530
+++ linux-2.6.0-test8-smprcu-dipankar/include/linux/sched.h 2003-12-29 19:40:46.000000000 +0530
@@ -524,6 +524,7 @@ extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
extern int task_curr(task_t *p);
extern int idle_cpu(int cpu);
+extern int rq_has_rt_task(int cpu);

void yield(void);

diff -puN kernel/sched.c~rq-has-rt-task kernel/sched.c
--- linux-2.6.0-test8-smprcu/kernel/sched.c~rq-has-rt-task 2003-12-29 19:40:46.000000000 +0530
+++ linux-2.6.0-test8-smprcu-dipankar/kernel/sched.c 2003-12-29 19:40:46.000000000 +0530
@@ -199,7 +199,7 @@ struct prio_array {
struct runqueue {
spinlock_t lock;
unsigned long nr_running, nr_switches, expired_timestamp,
- nr_uninterruptible;
+ nr_uninterruptible, nr_rt_running;
task_t *curr, *idle;
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
@@ -275,6 +275,19 @@ __init void node_nr_running_init(void)

#endif /* CONFIG_NUMA */

+static inline void nr_rt_running_inc(runqueue_t *rq, struct task_struct *p)
+{
+ if (rt_task(p))
+ rq->nr_rt_running++;
+}
+
+static inline void nr_rt_running_dec(runqueue_t *rq, struct task_struct *p)
+{
+ if (rt_task(p))
+ rq->nr_rt_running--;
+}
+
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -339,6 +352,17 @@ static inline void enqueue_task(struct t
}

/*
+ * rq_has_rt_task - return 1 if the runqueue has any RT task else return 0
+ * It does not lock the runqueue, the caller needs to explicitly lock
+ * the runqueue if it cares about that.
+ */
+int rq_has_rt_task(int cpu)
+{
+ struct runqueue *rq = cpu_rq(cpu);
+ return (rq->nr_rt_running != 0);
+}
+
+/*
* effective_prio - return the priority that is based on the static
* priority but is modified by bonuses/penalties.
*
@@ -376,6 +400,7 @@ static inline void __activate_task(task_
{
enqueue_task(p, rq->active);
nr_running_inc(rq);
+ nr_rt_running_inc(rq, p);
}

static void recalc_task_prio(task_t *p, unsigned long long now)
@@ -498,6 +523,7 @@ static inline void activate_task(task_t
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
nr_running_dec(rq);
+ nr_rt_running_dec(rq, p);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -691,6 +717,7 @@ void wake_up_forked_process(task_t * p)
p->array = current->array;
p->array->nr_active++;
nr_running_inc(rq);
+ nr_rt_running_inc(rq, p);
}
task_rq_unlock(rq, &flags);
}
@@ -1117,8 +1144,10 @@ static inline void pull_task(runqueue_t
{
dequeue_task(p, src_array);
nr_running_dec(src_rq);
+ nr_rt_running_dec(src_rq, p);
set_task_cpu(p, this_cpu);
nr_running_inc(this_rq);
+ nr_rt_running_inc(this_rq, p);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test

_

2004-01-08 15:13:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [1/2]



Dipankar Sarma wrote:

>Provide a rq_has_rt_task() interface to detect runqueues with
>real time priority tasks. Useful for RCU optimizations.
>

Can you make rq_has_rt_task the slow path? Adding things like this
can actually be noticable on microbenchmarks (eg. pipe based ctx
switching). Its probably cache artifacts that I see, but it wouldn't
hurt to keep the scheduler as tight as possible.

I think this should cover it.

int rq_has_rt_task(int cpu)
{
runqueue_t *rq = cpu_rq(cpu);
return (sched_find_first_bit(rq->active) < MAX_RT_PRIO);
}

Any good?



2004-01-08 19:34:38

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [1/2]

On Fri, Jan 09, 2004 at 02:12:00AM +1100, Nick Piggin wrote:
> Dipankar Sarma wrote:
>
> >Provide a rq_has_rt_task() interface to detect runqueues with
> >real time priority tasks. Useful for RCU optimizations.
> >
>
> Can you make rq_has_rt_task the slow path? Adding things like this
> can actually be noticable on microbenchmarks (eg. pipe based ctx
> switching). Its probably cache artifacts that I see, but it wouldn't
> hurt to keep the scheduler as tight as possible.
>
> I think this should cover it.
>
> int rq_has_rt_task(int cpu)
> {
> runqueue_t *rq = cpu_rq(cpu);
> return (sched_find_first_bit(rq->active) < MAX_RT_PRIO);
> }
>
> Any good?

Much better, except, well, rq->active->bitmap. Here is the new
rq_has_rt_task() patch.

Thanks
Dipankar



Provide a rq_has_rt_task() interface to detect runqueues with
real time priority tasks. Useful for RCU optimizations.


include/linux/sched.h | 1 +
kernel/sched.c | 6 ++++++
2 files changed, 7 insertions(+)

diff -puN kernel/sched.c~rq-has-rt-task kernel/sched.c
--- linux-2.6.0-smprcu/kernel/sched.c~rq-has-rt-task 2004-01-08 23:16:02.000000000 +0530
+++ linux-2.6.0-smprcu-dipankar/kernel/sched.c 2004-01-08 23:36:03.000000000 +0530
@@ -338,6 +338,12 @@ static inline void enqueue_task(struct t
p->array = array;
}

+int rq_has_rt_task(int cpu)
+{
+ runqueue_t *rq = cpu_rq(cpu);
+ return (sched_find_first_bit(rq->active->bitmap) < MAX_RT_PRIO);
+}
+
/*
* effective_prio - return the priority that is based on the static
* priority but is modified by bonuses/penalties.
diff -puN include/linux/sched.h~rq-has-rt-task include/linux/sched.h
--- linux-2.6.0-smprcu/include/linux/sched.h~rq-has-rt-task 2004-01-08 23:36:27.000000000 +0530
+++ linux-2.6.0-smprcu-dipankar/include/linux/sched.h 2004-01-08 23:36:52.000000000 +0530
@@ -525,6 +525,7 @@ extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
extern int task_curr(task_t *p);
extern int idle_cpu(int cpu);
+extern int rq_has_rt_task(int cpu);

void yield(void);


_

2004-01-09 00:03:29

by Rusty Russell (IBM)

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]

In message <[email protected]> you write:
> @@ -54,6 +54,11 @@ DEFINE_PER_CPU(struct rcu_data, rcu_data
> /* Fake initialization required by compiler */
> static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
> #define RCU_tasklet(cpu) (per_cpu(rcu_tasklet, cpu))
> +#ifdef CONFIG_PREEMPT
> +static int rcu_max_bh_callbacks = 256;
> +#else
> +static int rcu_max_bh_callbacks = 0;
> +#endif
>
> +static inline int rcu_limiting_needed(int cpu)
> +{
> + if (in_softirq() && RCU_krcud(cpu))
> + return 1;
> + return 0;
> +}
> +
> /*
> * Invoke the completed RCU callbacks. They are expected to be in
> * a per-cpu list.
> @@ -87,13 +99,24 @@ static void rcu_do_batch(struct list_hea
> {
> struct list_head *entry;
> struct rcu_head *head;
> + int count = 0;
> + int cpu = smp_processor_id();
> + int limit = rcu_limiting_needed(cpu);
>
> while (!list_empty(list)) {
> entry = list->next;
> list_del(entry);
> head = list_entry(entry, struct rcu_head, list);
> head->func(head->arg);
> + count++;
> + if (limit && count > rcu_max_bh_callbacks &&
> + rq_has_rt_task(cpu)) {

Perhaps you should do this the other way:

static inline unsigned int max_rcu_at_once(int cpu)
{
if (in_softirq() && RCU_krcud(cpu) && rq_has_rt_task(cpu))
return rcu_max_bh_callbacks;
return (unsigned int)-1;
}

....

unsigned int count, limit = max_rcu_at_once(cpu);

...
if (++count > limit)

Ideally you'd create a new workqueue for this, or at the very least
use kthread primitives (once they're in -mm, hopefully soon).

> +static int start_krcud(int cpu)
> +{
> + if (rcu_max_bh_callbacks) {
> + if (kernel_thread(krcud, (void *)(long)cpu, CLONE_KERNEL) < 0) {
> + printk("krcud for %i failed\n", cpu);
> + return -1;

EPERM seems unusual here. How about:

int pid;
pid = kernel_thread(krcud, (void *)(long)cpu, CLONE_KERNEL);
if (pid < 0) {
printk(printk("krcud for %i failed\n", cpu);
return pid;
}


> +__setup("rcubhlimit=", rcu_bh_limit_setup);

Please use module_param(). If it's a good name for a variable, it's a
good name for a parameter. eg. just call the variable bhlimit, and
then the boot parameter will be called "rcu.bhlimit" which fits the
calling conventio we're trying to encourage.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2004-01-14 08:23:49

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]

Hello Rusty,

On Fri, Jan 09, 2004 at 10:43:34AM +1100, Rusty Russell wrote:
> In message <[email protected]> you write:
> > + int limit = rcu_limiting_needed(cpu);
> >
> > while (!list_empty(list)) {
> > entry = list->next;
> > list_del(entry);
> > head = list_entry(entry, struct rcu_head, list);
> > head->func(head->arg);
> > + count++;
> > + if (limit && count > rcu_max_bh_callbacks &&
> > + rq_has_rt_task(cpu)) {
>
> Perhaps you should do this the other way:
>
> static inline unsigned int max_rcu_at_once(int cpu)
> {
> if (in_softirq() && RCU_krcud(cpu) && rq_has_rt_task(cpu))
> return rcu_max_bh_callbacks;
> return (unsigned int)-1;
> }

Done, except that once we reach the callback limit, we need to check
for RT tasks after every callback, instead of at the start of the RCU batch.

> Ideally you'd create a new workqueue for this, or at the very least
> use kthread primitives (once they're in -mm, hopefully soon).

I will use kthread primitives once they are available in mainline.

>
> > +static int start_krcud(int cpu)
> > +{
> > + if (rcu_max_bh_callbacks) {
> > + if (kernel_thread(krcud, (void *)(long)cpu, CLONE_KERNEL) < 0) {
> > + printk("krcud for %i failed\n", cpu);
> > + return -1;
>
> EPERM seems unusual here. How about:
>
> int pid;
> pid = kernel_thread(krcud, (void *)(long)cpu, CLONE_KERNEL);
> if (pid < 0) {
> printk(printk("krcud for %i failed\n", cpu);
> return pid;
> }

All these go away with kthreads.

>
> > +__setup("rcubhlimit=", rcu_bh_limit_setup);
>
> Please use module_param(). If it's a good name for a variable, it's a
> good name for a parameter. eg. just call the variable bhlimit, and
> then the boot parameter will be called "rcu.bhlimit" which fits the
> calling conventio we're trying to encourage.

I will clean this up later should we come to a conclusion that
we need the low-latency changes in mainline. I don't see
any non-driver kernel code using module_param() though.

New patch below. Needs rq-has-rt-task.patch I mailed earlier.
There are more issues that need investigations - can we starve
RCU callbacks leading to OOMs, should we compile out krcuds
based on a config option (CONFIG_PREEMPT?). Any suggestions ?

Thanks
Dipankar


Reduce bh processing time of rcu callbacks by using tunable per-cpu
krcud daemeons.


include/linux/rcupdate.h | 4 +
kernel/rcupdate.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 101 insertions(+), 2 deletions(-)

diff -puN include/linux/rcupdate.h~rcu-reduce-bh-time include/linux/rcupdate.h
--- linux-2.6.0-smprcu/include/linux/rcupdate.h~rcu-reduce-bh-time 2004-01-08 23:38:36.000000000 +0530
+++ linux-2.6.0-smprcu-dipankar/include/linux/rcupdate.h 2004-01-12 15:18:53.000000000 +0530
@@ -93,9 +93,11 @@ struct rcu_data {
long qsctr; /* User-mode/idle loop etc. */
long last_qsctr; /* value of qsctr at beginning */
/* of rcu grace period */
+ struct task_struct *krcud;
long batch; /* Batch # for current RCU batch */
struct list_head nxtlist;
struct list_head curlist;
+ struct list_head rcudlist;
};

DECLARE_PER_CPU(struct rcu_data, rcu_data);
@@ -103,9 +105,11 @@ extern struct rcu_ctrlblk rcu_ctrlblk;

#define RCU_qsctr(cpu) (per_cpu(rcu_data, (cpu)).qsctr)
#define RCU_last_qsctr(cpu) (per_cpu(rcu_data, (cpu)).last_qsctr)
+#define RCU_krcud(cpu) (per_cpu(rcu_data, (cpu)).krcud)
#define RCU_batch(cpu) (per_cpu(rcu_data, (cpu)).batch)
#define RCU_nxtlist(cpu) (per_cpu(rcu_data, (cpu)).nxtlist)
#define RCU_curlist(cpu) (per_cpu(rcu_data, (cpu)).curlist)
+#define RCU_rcudlist(cpu) (per_cpu(rcu_data, (cpu)).rcudlist)

#define RCU_QSCTR_INVALID 0

diff -puN kernel/rcupdate.c~rcu-reduce-bh-time kernel/rcupdate.c
--- linux-2.6.0-smprcu/kernel/rcupdate.c~rcu-reduce-bh-time 2004-01-08 23:38:36.000000000 +0530
+++ linux-2.6.0-smprcu-dipankar/kernel/rcupdate.c 2004-01-14 13:37:05.000000000 +0530
@@ -54,6 +54,11 @@ DEFINE_PER_CPU(struct rcu_data, rcu_data
/* Fake initialization required by compiler */
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
#define RCU_tasklet(cpu) (per_cpu(rcu_tasklet, cpu))
+#ifdef CONFIG_PREEMPT
+static int rcu_max_bh_callbacks = 256;
+#else
+static int rcu_max_bh_callbacks = 0;
+#endif

/**
* call_rcu - Queue an RCU update request.
@@ -79,6 +84,13 @@ void call_rcu(struct rcu_head *head, voi
local_irq_restore(flags);
}

+static inline unsigned int rcu_bh_callback_limit(int cpu)
+{
+ if (in_softirq() && RCU_krcud(cpu))
+ return rcu_max_bh_callbacks;
+ return (unsigned int)-1;
+}
+
/*
* Invoke the completed RCU callbacks. They are expected to be in
* a per-cpu list.
@@ -87,13 +99,22 @@ static void rcu_do_batch(struct list_hea
{
struct list_head *entry;
struct rcu_head *head;
+ unsigned int count = 0;
+ int cpu = smp_processor_id();
+ unsigned int limit = rcu_bh_callback_limit(cpu);

while (!list_empty(list)) {
entry = list->next;
list_del(entry);
head = list_entry(entry, struct rcu_head, list);
head->func(head->arg);
+ if (++count > limit && rq_has_rt_task(cpu)) {
+ list_splice(list, &RCU_rcudlist(cpu));
+ wake_up_process(RCU_krcud(cpu));
+ break;
+ }
}
+
}

/*
@@ -198,12 +219,67 @@ void rcu_check_callbacks(int cpu, int us
tasklet_schedule(&RCU_tasklet(cpu));
}

+static int krcud(void * __bind_cpu)
+{
+ int cpu = (int) (long) __bind_cpu;
+
+ daemonize("krcud/%d", cpu);
+ set_user_nice(current, -19);
+ current->flags |= PF_IOTHREAD;
+
+ /* Migrate to the right CPU */
+ set_cpus_allowed(current, cpumask_of_cpu(cpu));
+ BUG_ON(smp_processor_id() != cpu);
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ mb();
+
+ RCU_krcud(cpu) = current;
+
+ for (;;) {
+ LIST_HEAD(list);
+
+ if (list_empty(&RCU_rcudlist(cpu)))
+ schedule();
+
+ __set_current_state(TASK_RUNNING);
+
+ local_bh_disable();
+ while (!list_empty(&RCU_rcudlist(cpu))) {
+ list_splice(&RCU_rcudlist(cpu), &list);
+ INIT_LIST_HEAD(&RCU_rcudlist(cpu));
+ local_bh_enable();
+ rcu_do_batch(&list);
+ cond_resched();
+ local_bh_disable();
+ }
+ local_bh_enable();
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ }
+}
+
+static int start_krcud(int cpu)
+{
+ if (rcu_max_bh_callbacks) {
+ if (kernel_thread(krcud, (void *)(long)cpu, CLONE_KERNEL) < 0) {
+ printk("krcud for %i failed\n", cpu);
+ return -1;
+ }
+
+ while (!RCU_krcud(cpu))
+ yield();
+ }
+ return 0;
+}
+
static void __devinit rcu_online_cpu(int cpu)
{
memset(&per_cpu(rcu_data, cpu), 0, sizeof(struct rcu_data));
tasklet_init(&RCU_tasklet(cpu), rcu_process_callbacks, 0UL);
INIT_LIST_HEAD(&RCU_nxtlist(cpu));
INIT_LIST_HEAD(&RCU_curlist(cpu));
+ INIT_LIST_HEAD(&RCU_rcudlist(cpu));
}

static int __devinit rcu_cpu_notify(struct notifier_block *self,
@@ -214,6 +290,10 @@ static int __devinit rcu_cpu_notify(stru
case CPU_UP_PREPARE:
rcu_online_cpu(cpu);
break;
+ case CPU_ONLINE:
+ if (start_krcud(cpu) != 0)
+ return NOTIFY_BAD;
+ break;
/* Space reserved for CPU_OFFLINE :) */
default:
break;
@@ -233,12 +313,27 @@ static struct notifier_block __devinitda
*/
void __init rcu_init(void)
{
- rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
- (void *)(long)smp_processor_id());
+ rcu_online_cpu(smp_processor_id());
/* Register notifier for non-boot CPUs */
register_cpu_notifier(&rcu_nb);
}

+static int __init rcu_late_init(void)
+{
+ return start_krcud(smp_processor_id());
+}
+
+__initcall(rcu_late_init);
+
+static int __init rcu_bh_limit_setup(char *str)
+{
+ if (get_option(&str, &rcu_max_bh_callbacks) != 1)
+ BUG();
+ return 0;
+}
+
+__setup("rcubhlimit=", rcu_bh_limit_setup);
+

/* Because of FASTCALL declaration of complete, we use this wrapper */
static void wakeme_after_rcu(void *completion)

_

2004-01-15 03:39:16

by Rusty Russell (IBM)

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]

On Wed, 14 Jan 2004 13:54:20 +0530
Dipankar Sarma <[email protected]> wrote:
> > static inline unsigned int max_rcu_at_once(int cpu)
> > {
> > if (in_softirq() && RCU_krcud(cpu) && rq_has_rt_task(cpu))
> > return rcu_max_bh_callbacks;
> > return (unsigned int)-1;
> > }
>
> Done, except that once we reach the callback limit, we need to check
> for RT tasks after every callback, instead of at the start of the RCU batch.

AFAICT, if you're in a softirq it can't change. If you're not, there's
no limit anyway.

> > Ideally you'd create a new workqueue for this, or at the very least
> > use kthread primitives (once they're in -mm, hopefully soon).
>
> I will use kthread primitives once they are available in mainline.

But ulterior motive is to push the kthread primitives by making as much
code depend on it as possible 8)

> I will clean this up later should we come to a conclusion that
> we need the low-latency changes in mainline. I don't see
> any non-driver kernel code using module_param() though.

I'm trying to catch them as new ones get introduced. If the name is
old-style, then there's little point changing (at least for 2.6).

>From now on, I'm being more vigilant 8)

> New patch below. Needs rq-has-rt-task.patch I mailed earlier.
> There are more issues that need investigations - can we starve
> RCU callbacks leading to OOMs

You can screw your machine up with RT tasks, yes. This is no new problem,
I think.

> should we compile out krcuds
> based on a config option (CONFIG_PREEMPT?). Any suggestions ?

Depends on the neatness of the code, I think...

Cheers,
Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2004-01-15 06:02:49

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]

On Thu, Jan 15, 2004 at 10:35:00AM +1100, Rusty Russell wrote:
> On Wed, 14 Jan 2004 13:54:20 +0530
> Dipankar Sarma <[email protected]> wrote:
> > Done, except that once we reach the callback limit, we need to check
> > for RT tasks after every callback, instead of at the start of the RCU batch.
>
> AFAICT, if you're in a softirq it can't change. If you're not, there's
> no limit anyway.

What if a blocked RT task was woken up by an irq that interrupted
RCU callback processing ? All of a sudden you now have a RT task
in the queue. Isn't this possible ?


> > > Ideally you'd create a new workqueue for this, or at the very least
> > > use kthread primitives (once they're in -mm, hopefully soon).
> >
> > I will use kthread primitives once they are available in mainline.
>
> But ulterior motive is to push the kthread primitives by making as much
> code depend on it as possible 8)

hehe. How nefarious :-)

> > I will clean this up later should we come to a conclusion that
> > we need the low-latency changes in mainline. I don't see
> > any non-driver kernel code using module_param() though.
>
> I'm trying to catch them as new ones get introduced. If the name is
> old-style, then there's little point changing (at least for 2.6).

OK, but I am not sure how to do this for non-module code.

> > New patch below. Needs rq-has-rt-task.patch I mailed earlier.
> > There are more issues that need investigations - can we starve
> > RCU callbacks leading to OOMs
>
> You can screw your machine up with RT tasks, yes. This is no new problem,
> I think.

That is another way to look at it :)

> > should we compile out krcuds
> > based on a config option (CONFIG_PREEMPT?). Any suggestions ?
>
> Depends on the neatness of the code, I think...

Well there seems to be a difference in opinion about whether the
krcuds should be pervasive or not. Some think they should be,
some thinks they should not be when we aren't aiming for low
latency (say CONFIG_PREEMPT=n).

Thanks
Dipankar

2004-01-20 00:51:39

by Rusty Russell (IBM)

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]

In message <[email protected]> you write:
> On Thu, Jan 15, 2004 at 10:35:00AM +1100, Rusty Russell wrote:
> > On Wed, 14 Jan 2004 13:54:20 +0530
> > Dipankar Sarma <[email protected]> wrote:
> > > Done, except that once we reach the callback limit, we need to check
> > > for RT tasks after every callback, instead of at the start of the RCU batch.
> >
> > AFAICT, if you're in a softirq it can't change. If you're not, there's
> > no limit anyway.
>
> What if a blocked RT task was woken up by an irq that interrupted
> RCU callback processing ? All of a sudden you now have a RT task
> in the queue. Isn't this possible ?

Yes, you're absolutely right. You could just handle this by breaking
out of the loop (after processing one rcu) as soon as there's a
runnable rt task, independent of limit.

> > But ulterior motive is to push the kthread primitives by making as much
> > code depend on it as possible 8)
>
> hehe. How nefarious :-)

Well, you don't get to be a kernel hacker simply by looking good in
Speedos.

> > I'm trying to catch them as new ones get introduced. If the name is
> > old-style, then there's little point changing (at least for 2.6).
>
> OK, but I am not sure how to do this for non-module code.

module_param() works for non-module code (it automatically inserts a
"rcu." prefix in the name though).

> > You can screw your machine up with RT tasks, yes. This is no new problem,
> > I think.
>
> That is another way to look at it :)

I think it's fair, though. If you really absorb all your CPU with RT
tasks, you will starve important things: that's why RT is a root priv.

> > > should we compile out krcuds
> > > based on a config option (CONFIG_PREEMPT?). Any suggestions ?
> >
> > Depends on the neatness of the code, I think...
>
> Well there seems to be a difference in opinion about whether the
> krcuds should be pervasive or not. Some think they should be,
> some thinks they should not be when we aren't aiming for low
> latency (say CONFIG_PREEMPT=n).

Personally I don't like overloading the semantic of CONFIG_PREEMPT.
But I think CONFIG_PREEMPT is a bad idea anyway: it's not really a
question you can ask a user during config.

CONFIG_LOW_LATENCY (Description: Sacrifice some raw performance to
increase latency) makes more sense, and would fit here, too.

Another option is to overload ksoftirq to do what krcud does: they're
v. similar in nature AFAICT.

Sorry I can't be more helpful 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2004-01-20 01:26:58

by Lincoln Dale

[permalink] [raw]
Subject: Re: [patch] RCU for low latency [2/2]

At 10:25 AM 20/01/2004, Rusty Russell wrote:
>CONFIG_LOW_LATENCY (Description: Sacrifice some raw performance to
>increase latency) makes more sense, and would fit here, too.

_high_ latency and _low_ performance.
i want one of those!

:-)


cheers,

lincoln.