2007-09-10 18:30:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 0/9] RCU: Preemptible RCU

Work in progress, still not for inclusion. But code now complete!

This is a respin of the following prior posting:

http://lkml.org/lkml/2007/9/5/268

This release adds an additional patch that adds fixes to comments and RCU
documentation, along with one macro being renamed. The rcutorture patch
has a modification to make it a bit more vicious to priority boosting
(though the current design relies on -rt latencies for much of the
priority-boost torturing effectiveness in this case -- run the test
in presence of CPU hotplug operations to get the same effect in -mm).
Next step is rebasing this to a more recent version of Linux.

Thanx, Paul


2007-09-10 18:32:24

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 1/9] RCU: Split API to permit multiple RCU implementations

Work in progress, not for inclusion.

This patch re-organizes the RCU code to enable multiple implementations
of RCU. Users of RCU continues to include rcupdate.h and the
RCU interfaces remain the same. This is in preparation for
subsequently merging the preemptible RCU implementation.

Signed-off-by: Dipankar Sarma <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---

include/linux/rcuclassic.h | 149 +++++++++++
include/linux/rcupdate.h | 151 +++---------
kernel/Makefile | 2
kernel/rcuclassic.c | 558 ++++++++++++++++++++++++++++++++++++++++++++
kernel/rcupdate.c | 561 ++-------------------------------------------
5 files changed, 779 insertions(+), 642 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcuclassic.h linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h
--- linux-2.6.22/include/linux/rcuclassic.h 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
@@ -0,0 +1,149 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (classic version)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2001
+ *
+ * Author: Dipankar Sarma <[email protected]>
+ *
+ * Based on the original work by Paul McKenney <[email protected]>
+ * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU
+ *
+ */
+
+#ifndef __LINUX_RCUCLASSIC_H
+#define __LINUX_RCUCLASSIC_H
+
+#ifdef __KERNEL__
+
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/threads.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/seqlock.h>
+
+
+/* Global control variables for rcupdate callback mechanism. */
+struct rcu_ctrlblk {
+ long cur; /* Current batch number. */
+ long completed; /* Number of the last completed batch */
+ int next_pending; /* Is the next batch already waiting? */
+
+ int signaled;
+
+ spinlock_t lock ____cacheline_internodealigned_in_smp;
+ cpumask_t cpumask; /* CPUs that need to switch in order */
+ /* for current batch to proceed. */
+} ____cacheline_internodealigned_in_smp;
+
+/* Is batch a before batch b ? */
+static inline int rcu_batch_before(long a, long b)
+{
+ return (a - b) < 0;
+}
+
+/* Is batch a after batch b ? */
+static inline int rcu_batch_after(long a, long b)
+{
+ return (a - b) > 0;
+}
+
+/*
+ * Per-CPU data for Read-Copy UPdate.
+ * nxtlist - new callbacks are added here
+ * curlist - current batch for which quiescent cycle started if any
+ */
+struct rcu_data {
+ /* 1) quiescent state handling : */
+ long quiescbatch; /* Batch # for grace period */
+ int passed_quiesc; /* User-mode/idle loop etc. */
+ int qs_pending; /* core waits for quiesc state */
+
+ /* 2) batch handling */
+ long batch; /* Batch # for current RCU batch */
+ struct rcu_head *nxtlist;
+ struct rcu_head **nxttail;
+ long qlen; /* # of queued callbacks */
+ struct rcu_head *curlist;
+ struct rcu_head **curtail;
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+ long blimit; /* Upper limit on a processed batch */
+ int cpu;
+ struct rcu_head barrier;
+};
+
+DECLARE_PER_CPU(struct rcu_data, rcu_data);
+DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
+
+/*
+ * Increment the quiescent state counter.
+ * The counter is a bit degenerated: We do not need to know
+ * how many quiescent states passed, just if there was at least
+ * one since the start of the grace period. Thus just a flag.
+ */
+static inline void rcu_qsctr_inc(int cpu)
+{
+ struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ rdp->passed_quiesc = 1;
+}
+static inline void rcu_bh_qsctr_inc(int cpu)
+{
+ struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
+ rdp->passed_quiesc = 1;
+}
+
+extern int rcu_pending(int cpu);
+extern int rcu_needs_cpu(int cpu);
+
+#define __rcu_read_lock() \
+ do { \
+ preempt_disable(); \
+ __acquire(RCU); \
+ } while (0)
+#define __rcu_read_unlock() \
+ do { \
+ __release(RCU); \
+ preempt_enable(); \
+ } while (0)
+#define __rcu_read_lock_bh() \
+ do { \
+ local_bh_disable(); \
+ __acquire(RCU_BH); \
+ } while (0)
+#define __rcu_read_unlock_bh() \
+ do { \
+ __release(RCU_BH); \
+ local_bh_enable(); \
+ } while (0)
+
+#define __synchronize_sched() synchronize_rcu()
+
+extern void __rcu_init(void);
+extern void rcu_check_callbacks(int cpu, int user);
+extern void rcu_restart_cpu(int cpu);
+extern long rcu_batches_completed(void);
+extern long rcu_batches_completed_bh(void);
+
+#endif /* __KERNEL__ */
+#endif /* __LINUX_RCUCLASSIC_H */
diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcupdate.h linux-2.6.22-a-splitclassic/include/linux/rcupdate.h
--- linux-2.6.22/include/linux/rcupdate.h 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-a-splitclassic/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
@@ -15,7 +15,7 @@
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
- * Copyright (C) IBM Corporation, 2001
+ * Copyright IBM Corporation, 2001
*
* Author: Dipankar Sarma <[email protected]>
*
@@ -52,6 +52,8 @@ struct rcu_head {
void (*func)(struct rcu_head *head);
};

+#include <linux/rcuclassic.h>
+
#define RCU_HEAD_INIT { .next = NULL, .func = NULL }
#define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
#define INIT_RCU_HEAD(ptr) do { \
@@ -59,80 +61,6 @@ struct rcu_head {
} while (0)


-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
- long cur; /* Current batch number. */
- long completed; /* Number of the last completed batch */
- int next_pending; /* Is the next batch already waiting? */
-
- int signaled;
-
- spinlock_t lock ____cacheline_internodealigned_in_smp;
- cpumask_t cpumask; /* CPUs that need to switch in order */
- /* for current batch to proceed. */
-} ____cacheline_internodealigned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
- return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
- return (a - b) > 0;
-}
-
-/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
- /* 1) quiescent state handling : */
- long quiescbatch; /* Batch # for grace period */
- int passed_quiesc; /* User-mode/idle loop etc. */
- int qs_pending; /* core waits for quiesc state */
-
- /* 2) batch handling */
- long batch; /* Batch # for current RCU batch */
- struct rcu_head *nxtlist;
- struct rcu_head **nxttail;
- long qlen; /* # of queued callbacks */
- struct rcu_head *curlist;
- struct rcu_head **curtail;
- struct rcu_head *donelist;
- struct rcu_head **donetail;
- long blimit; /* Upper limit on a processed batch */
- int cpu;
- struct rcu_head barrier;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
- rdp->passed_quiesc = 1;
-}
-
-extern int rcu_pending(int cpu);
-extern int rcu_needs_cpu(int cpu);
-
/**
* rcu_read_lock - mark the beginning of an RCU read-side critical section.
*
@@ -162,22 +90,14 @@ extern int rcu_needs_cpu(int cpu);
*
* It is illegal to block while in an RCU read-side critical section.
*/
-#define rcu_read_lock() \
- do { \
- preempt_disable(); \
- __acquire(RCU); \
- } while(0)
+#define rcu_read_lock() __rcu_read_lock()

/**
* rcu_read_unlock - marks the end of an RCU read-side critical section.
*
* See rcu_read_lock() for more information.
*/
-#define rcu_read_unlock() \
- do { \
- __release(RCU); \
- preempt_enable(); \
- } while(0)
+#define rcu_read_unlock() __rcu_read_unlock()

/*
* So where is rcu_write_lock()? It does not exist, as there is no
@@ -200,22 +120,14 @@ extern int rcu_needs_cpu(int cpu);
* can use just rcu_read_lock().
*
*/
-#define rcu_read_lock_bh() \
- do { \
- local_bh_disable(); \
- __acquire(RCU_BH); \
- } while(0)
+#define rcu_read_lock_bh() __rcu_read_lock_bh()

/*
* rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
*
* See rcu_read_lock_bh() for more information.
*/
-#define rcu_read_unlock_bh() \
- do { \
- __release(RCU_BH); \
- local_bh_enable(); \
- } while(0)
+#define rcu_read_unlock_bh() __rcu_read_unlock_bh()

/**
* rcu_dereference - fetch an RCU-protected pointer in an
@@ -267,22 +179,49 @@ extern int rcu_needs_cpu(int cpu);
* In "classic RCU", these two guarantees happen to be one and
* the same, but can differ in realtime RCU implementations.
*/
-#define synchronize_sched() synchronize_rcu()
+#define synchronize_sched() __synchronize_sched()

-extern void rcu_init(void);
-extern void rcu_check_callbacks(int cpu, int user);
-extern void rcu_restart_cpu(int cpu);
-extern long rcu_batches_completed(void);
-extern long rcu_batches_completed_bh(void);
-
-/* Exported interfaces */
-extern void FASTCALL(call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *head)));
+/**
+ * call_rcu - Queue an RCU callback for invocation after a grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual update function to be invoked after the grace period
+ *
+ * The update function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. RCU read-side critical
+ * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
+ * delimited by rcu_read_lock() and rcu_read_unlock(),
+ * and may be nested.
+ */
+extern void FASTCALL(call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *head)));
+
+/**
+ * call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual update function to be invoked after the grace period
+ *
+ * The update function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_bh() assumes
+ * that the read-side critical sections end on completion of a softirq
+ * handler. This means that read-side critical sections in process
+ * context must not be interrupted by softirqs. This interface is to be
+ * used when most of the read-side critical sections are in softirq context.
+ * RCU read-side critical sections are delimited by rcu_read_lock() and
+ * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh()
+ * and rcu_read_unlock_bh(), if in process context. These may be nested.
+ */
extern void FASTCALL(call_rcu_bh(struct rcu_head *head,
void (*func)(struct rcu_head *head)));
+
+/* Exported common interfaces */
extern void synchronize_rcu(void);
-void synchronize_idle(void);
extern void rcu_barrier(void);

+/* Internal to kernel */
+extern void rcu_init(void);
+extern void rcu_check_callbacks(int cpu, int user);
+
#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.22/kernel/Makefile linux-2.6.22-a-splitclassic/kernel/Makefile
--- linux-2.6.22/kernel/Makefile 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-a-splitclassic/kernel/Makefile 2007-07-19 12:16:03.000000000 -0700
@@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o
exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
- rcupdate.o extable.o params.o posix-timers.o \
+ rcupdate.o rcuclassic.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o

diff -urpNa -X dontdiff linux-2.6.22/kernel/rcuclassic.c linux-2.6.22-a-splitclassic/kernel/rcuclassic.c
--- linux-2.6.22/kernel/rcuclassic.c 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-a-splitclassic/kernel/rcuclassic.c 2007-08-22 14:47:13.000000000 -0700
@@ -0,0 +1,558 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2001
+ *
+ * Authors: Dipankar Sarma <[email protected]>
+ * Manfred Spraul <[email protected]>
+ *
+ * Based on the original work by Paul McKenney <[email protected]>
+ * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU
+ *
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/rcupdate.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu.h>
+#include <linux/notifier.h>
+/* #include <linux/rcupdate.h> @@@ */
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+
+/* Definition for rcupdate control block. */
+static struct rcu_ctrlblk rcu_ctrlblk = {
+ .cur = -300,
+ .completed = -300,
+ .lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
+ .cpumask = CPU_MASK_NONE,
+};
+static struct rcu_ctrlblk rcu_bh_ctrlblk = {
+ .cur = -300,
+ .completed = -300,
+ .lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
+ .cpumask = CPU_MASK_NONE,
+};
+
+DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+
+/* Fake initialization required by compiler */
+static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
+static int blimit = 10;
+static int qhimark = 10000;
+static int qlowmark = 100;
+
+#ifdef CONFIG_SMP
+static void force_quiescent_state(struct rcu_data *rdp,
+ struct rcu_ctrlblk *rcp)
+{
+ int cpu;
+ cpumask_t cpumask;
+ set_need_resched();
+ if (unlikely(!rcp->signaled)) {
+ rcp->signaled = 1;
+ /*
+ * Don't send IPI to itself. With irqs disabled,
+ * rdp->cpu is the current cpu.
+ */
+ cpumask = rcp->cpumask;
+ cpu_clear(rdp->cpu, cpumask);
+ for_each_cpu_mask(cpu, cpumask)
+ smp_send_reschedule(cpu);
+ }
+}
+#else
+static inline void force_quiescent_state(struct rcu_data *rdp,
+ struct rcu_ctrlblk *rcp)
+{
+ set_need_resched();
+}
+#endif
+
+/**
+ * call_rcu - Queue an RCU callback for invocation after a grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual update function to be invoked after the grace period
+ *
+ * The update function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. RCU read-side critical
+ * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
+ * and may be nested.
+ */
+void fastcall call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+
+ head->func = func;
+ head->next = NULL;
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_data);
+ *rdp->nxttail = head;
+ rdp->nxttail = &head->next;
+ if (unlikely(++rdp->qlen > qhimark)) {
+ rdp->blimit = INT_MAX;
+ force_quiescent_state(rdp, &rcu_ctrlblk);
+ }
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
+
+/**
+ * call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual update function to be invoked after the grace period
+ *
+ * The update function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed. call_rcu_bh() assumes
+ * that the read-side critical sections end on completion of a softirq
+ * handler. This means that read-side critical sections in process
+ * context must not be interrupted by softirqs. This interface is to be
+ * used when most of the read-side critical sections are in softirq context.
+ * RCU read-side critical sections are delimited by rcu_read_lock() and
+ * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh()
+ * and rcu_read_unlock_bh(), if in process context. These may be nested.
+ */
+void fastcall call_rcu_bh(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+
+ head->func = func;
+ head->next = NULL;
+ local_irq_save(flags);
+ rdp = &__get_cpu_var(rcu_bh_data);
+ *rdp->nxttail = head;
+ rdp->nxttail = &head->next;
+
+ if (unlikely(++rdp->qlen > qhimark)) {
+ rdp->blimit = INT_MAX;
+ force_quiescent_state(rdp, &rcu_bh_ctrlblk);
+ }
+
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(call_rcu_bh);
+
+/*
+ * Return the number of RCU batches processed thus far. Useful
+ * for debug and statistics.
+ */
+long rcu_batches_completed(void)
+{
+ return rcu_ctrlblk.completed;
+}
+EXPORT_SYMBOL_GPL(rcu_batches_completed);
+
+/*
+ * Return the number of RCU batches processed thus far. Useful
+ * for debug and statistics.
+ */
+long rcu_batches_completed_bh(void)
+{
+ return rcu_bh_ctrlblk.completed;
+}
+EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
+
+/*
+ * Invoke the completed RCU callbacks. They are expected to be in
+ * a per-cpu list.
+ */
+static void rcu_do_batch(struct rcu_data *rdp)
+{
+ struct rcu_head *next, *list;
+ int count = 0;
+
+ list = rdp->donelist;
+ while (list) {
+ next = list->next;
+ prefetch(next);
+ list->func(list);
+ list = next;
+ if (++count >= rdp->blimit)
+ break;
+ }
+ rdp->donelist = list;
+
+ local_irq_disable();
+ rdp->qlen -= count;
+ local_irq_enable();
+ if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark)
+ rdp->blimit = blimit;
+
+ if (!rdp->donelist)
+ rdp->donetail = &rdp->donelist;
+ else
+ tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
+}
+
+/*
+ * Grace period handling:
+ * The grace period handling consists out of two steps:
+ * - A new grace period is started.
+ * This is done by rcu_start_batch. The start is not broadcasted to
+ * all cpus, they must pick this up by comparing rcp->cur with
+ * rdp->quiescbatch. All cpus are recorded in the
+ * rcu_ctrlblk.cpumask bitmap.
+ * - All cpus must go through a quiescent state.
+ * Since the start of the grace period is not broadcasted, at least two
+ * calls to rcu_check_quiescent_state are required:
+ * The first call just notices that a new grace period is running. The
+ * following calls check if there was a quiescent state since the beginning
+ * of the grace period. If so, it updates rcu_ctrlblk.cpumask. If
+ * the bitmap is empty, then the grace period is completed.
+ * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace
+ * period (if necessary).
+ */
+/*
+ * Register a new batch of callbacks, and start it up if there is currently no
+ * active batch and the batch to be registered has not already occurred.
+ * Caller must hold rcu_ctrlblk.lock.
+ */
+static void rcu_start_batch(struct rcu_ctrlblk *rcp)
+{
+ if (rcp->next_pending &&
+ rcp->completed == rcp->cur) {
+ rcp->next_pending = 0;
+ /*
+ * next_pending == 0 must be visible in
+ * __rcu_process_callbacks() before it can see new value of cur.
+ */
+ smp_wmb();
+ rcp->cur++;
+
+ /*
+ * Accessing nohz_cpu_mask before incrementing rcp->cur needs a
+ * Barrier Otherwise it can cause tickless idle CPUs to be
+ * included in rcp->cpumask, which will extend graceperiods
+ * unnecessarily.
+ */
+ smp_mb();
+ cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
+
+ rcp->signaled = 0;
+ }
+}
+
+/*
+ * cpu went through a quiescent state since the beginning of the grace period.
+ * Clear it from the cpu mask and complete the grace period if it was the last
+ * cpu. Start another grace period if someone has further entries pending
+ */
+static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp)
+{
+ cpu_clear(cpu, rcp->cpumask);
+ if (cpus_empty(rcp->cpumask)) {
+ /* batch completed ! */
+ rcp->completed = rcp->cur;
+ rcu_start_batch(rcp);
+ }
+}
+
+/*
+ * Check if the cpu has gone through a quiescent state (say context
+ * switch). If so and if it already hasn't done so in this RCU
+ * quiescent cycle, then indicate that it has done so.
+ */
+static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
+ struct rcu_data *rdp)
+{
+ if (rdp->quiescbatch != rcp->cur) {
+ /* start new grace period: */
+ rdp->qs_pending = 1;
+ rdp->passed_quiesc = 0;
+ rdp->quiescbatch = rcp->cur;
+ return;
+ }
+
+ /* Grace period already completed for this cpu?
+ * qs_pending is checked instead of the actual bitmap to avoid
+ * cacheline trashing.
+ */
+ if (!rdp->qs_pending)
+ return;
+
+ /*
+ * Was there a quiescent state since the beginning of the grace
+ * period? If no, then exit and wait for the next call.
+ */
+ if (!rdp->passed_quiesc)
+ return;
+ rdp->qs_pending = 0;
+
+ spin_lock(&rcp->lock);
+ /*
+ * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync
+ * during cpu startup. Ignore the quiescent state.
+ */
+ if (likely(rdp->quiescbatch == rcp->cur))
+ cpu_quiet(rdp->cpu, rcp);
+
+ spin_unlock(&rcp->lock);
+}
+
+
+#ifdef CONFIG_HOTPLUG_CPU
+
+/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing
+ * locking requirements, the list it's pulling from has to belong to a cpu
+ * which is dead and hence not processing interrupts.
+ */
+static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
+ struct rcu_head **tail)
+{
+ local_irq_disable();
+ *this_rdp->nxttail = list;
+ if (list)
+ this_rdp->nxttail = tail;
+ local_irq_enable();
+}
+
+static void __rcu_offline_cpu(struct rcu_data *this_rdp,
+ struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
+{
+ /* if the cpu going offline owns the grace period
+ * we can block indefinitely waiting for it, so flush
+ * it here
+ */
+ spin_lock_bh(&rcp->lock);
+ if (rcp->cur != rcp->completed)
+ cpu_quiet(rdp->cpu, rcp);
+ spin_unlock_bh(&rcp->lock);
+ rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
+ rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
+ rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
+}
+
+static void rcu_offline_cpu(int cpu)
+{
+ struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
+ struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data);
+
+ __rcu_offline_cpu(this_rdp, &rcu_ctrlblk,
+ &per_cpu(rcu_data, cpu));
+ __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk,
+ &per_cpu(rcu_bh_data, cpu));
+ put_cpu_var(rcu_data);
+ put_cpu_var(rcu_bh_data);
+ tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu);
+}
+
+#else
+
+static void rcu_offline_cpu(int cpu)
+{
+}
+
+#endif
+
+/*
+ * This does the RCU processing work from tasklet context.
+ */
+static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
+ struct rcu_data *rdp)
+{
+ if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
+ *rdp->donetail = rdp->curlist;
+ rdp->donetail = rdp->curtail;
+ rdp->curlist = NULL;
+ rdp->curtail = &rdp->curlist;
+ }
+
+ if (rdp->nxtlist && !rdp->curlist) {
+ local_irq_disable();
+ rdp->curlist = rdp->nxtlist;
+ rdp->curtail = rdp->nxttail;
+ rdp->nxtlist = NULL;
+ rdp->nxttail = &rdp->nxtlist;
+ local_irq_enable();
+
+ /*
+ * start the next batch of callbacks
+ */
+
+ /* determine batch number */
+ rdp->batch = rcp->cur + 1;
+ /* see the comment and corresponding wmb() in
+ * the rcu_start_batch()
+ */
+ smp_rmb();
+
+ if (!rcp->next_pending) {
+ /* and start it/schedule start if it's a new batch */
+ spin_lock(&rcp->lock);
+ rcp->next_pending = 1;
+ rcu_start_batch(rcp);
+ spin_unlock(&rcp->lock);
+ }
+ }
+
+ rcu_check_quiescent_state(rcp, rdp);
+ if (rdp->donelist)
+ rcu_do_batch(rdp);
+}
+
+static void rcu_process_callbacks(unsigned long unused)
+{
+ __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
+ __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
+}
+
+static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
+{
+ /* This cpu has pending rcu entries and the grace period
+ * for them has completed.
+ */
+ if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
+ return 1;
+
+ /* This cpu has no pending entries, but there are new entries */
+ if (!rdp->curlist && rdp->nxtlist)
+ return 1;
+
+ /* This cpu has finished callbacks to invoke */
+ if (rdp->donelist)
+ return 1;
+
+ /* The rcu core waits for a quiescent state from the cpu */
+ if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
+ return 1;
+
+ /* nothing to do */
+ return 0;
+}
+
+/*
+ * Check to see if there is any immediate RCU-related work to be done
+ * by the current CPU, returning 1 if so. This function is part of the
+ * RCU implementation; it is -not- an exported member of the RCU API.
+ */
+int rcu_pending(int cpu)
+{
+ return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
+ __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
+}
+
+/*
+ * Check to see if any future RCU-related work will need to be done
+ * by the current CPU, even if none need be done immediately, returning
+ * 1 if so. This function is part of the RCU implementation; it is -not-
+ * an exported member of the RCU API.
+ */
+int rcu_needs_cpu(int cpu)
+{
+ struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);
+
+ return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu));
+}
+
+void rcu_check_callbacks(int cpu, int user)
+{
+ if (user ||
+ (idle_cpu(cpu) && !in_softirq() &&
+ hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
+ rcu_qsctr_inc(cpu);
+ rcu_bh_qsctr_inc(cpu);
+ } else if (!in_softirq())
+ rcu_bh_qsctr_inc(cpu);
+ tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
+}
+
+static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
+ struct rcu_data *rdp)
+{
+ memset(rdp, 0, sizeof(*rdp));
+ rdp->curtail = &rdp->curlist;
+ rdp->nxttail = &rdp->nxtlist;
+ rdp->donetail = &rdp->donelist;
+ rdp->quiescbatch = rcp->completed;
+ rdp->qs_pending = 0;
+ rdp->cpu = cpu;
+ rdp->blimit = blimit;
+}
+
+static void __devinit rcu_online_cpu(int cpu)
+{
+ struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu);
+
+ rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp);
+ rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
+ tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL);
+}
+
+static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ rcu_online_cpu(cpu);
+ break;
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ rcu_offline_cpu(cpu);
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata rcu_nb = {
+ .notifier_call = rcu_cpu_notify,
+};
+
+/*
+ * Initializes rcu mechanism. Assumed to be called early.
+ * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
+ * Note that rcu_qsctr and friends are implicitly
+ * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
+ */
+void __init __rcu_init(void)
+{
+ rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
+ (void *)(long)smp_processor_id());
+ /* Register notifier for non-boot CPUs */
+ register_cpu_notifier(&rcu_nb);
+}
+
+module_param(blimit, int, 0);
+module_param(qhimark, int, 0);
+module_param(qlowmark, int, 0);
diff -urpNa -X dontdiff linux-2.6.22/kernel/rcupdate.c linux-2.6.22-a-splitclassic/kernel/rcupdate.c
--- linux-2.6.22/kernel/rcupdate.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-a-splitclassic/kernel/rcupdate.c 2007-08-22 14:47:59.000000000 -0700
@@ -15,7 +15,7 @@
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*
- * Copyright (C) IBM Corporation, 2001
+ * Copyright IBM Corporation, 2001
*
* Authors: Dipankar Sarma <[email protected]>
* Manfred Spraul <[email protected]>
@@ -35,164 +35,65 @@
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/smp.h>
-#include <linux/rcupdate.h>
#include <linux/interrupt.h>
#include <linux/sched.h>
#include <asm/atomic.h>
#include <linux/bitops.h>
-#include <linux/module.h>
#include <linux/completion.h>
-#include <linux/moduleparam.h>
#include <linux/percpu.h>
#include <linux/notifier.h>
#include <linux/rcupdate.h>
#include <linux/cpu.h>
#include <linux/mutex.h>
+#include <linux/module.h>

-/* Definition for rcupdate control block. */
-static struct rcu_ctrlblk rcu_ctrlblk = {
- .cur = -300,
- .completed = -300,
- .lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
- .cpumask = CPU_MASK_NONE,
-};
-static struct rcu_ctrlblk rcu_bh_ctrlblk = {
- .cur = -300,
- .completed = -300,
- .lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
- .cpumask = CPU_MASK_NONE,
+struct rcu_synchronize {
+ struct rcu_head head;
+ struct completion completion;
};

-DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
-DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
-
-/* Fake initialization required by compiler */
-static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
-static int blimit = 10;
-static int qhimark = 10000;
-static int qlowmark = 100;
-
+static DEFINE_PER_CPU(struct rcu_head, rcu_barrier_head) = {NULL};
static atomic_t rcu_barrier_cpu_count;
static DEFINE_MUTEX(rcu_barrier_mutex);
static struct completion rcu_barrier_completion;

-#ifdef CONFIG_SMP
-static void force_quiescent_state(struct rcu_data *rdp,
- struct rcu_ctrlblk *rcp)
-{
- int cpu;
- cpumask_t cpumask;
- set_need_resched();
- if (unlikely(!rcp->signaled)) {
- rcp->signaled = 1;
- /*
- * Don't send IPI to itself. With irqs disabled,
- * rdp->cpu is the current cpu.
- */
- cpumask = rcp->cpumask;
- cpu_clear(rdp->cpu, cpumask);
- for_each_cpu_mask(cpu, cpumask)
- smp_send_reschedule(cpu);
- }
-}
-#else
-static inline void force_quiescent_state(struct rcu_data *rdp,
- struct rcu_ctrlblk *rcp)
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme_after_rcu(struct rcu_head *head)
{
- set_need_resched();
+ struct rcu_synchronize *rcu;
+
+ rcu = container_of(head, struct rcu_synchronize, head);
+ complete(&rcu->completion);
}
-#endif

/**
- * call_rcu - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
+ * synchronize_rcu - wait until a grace period has elapsed.
*
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
+ * Control will return to the caller some time after a full grace
+ * period has elapsed, in other words after all currently executing RCU
* read-side critical sections have completed. RCU read-side critical
* sections are delimited by rcu_read_lock() and rcu_read_unlock(),
* and may be nested.
*/
-void fastcall call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
-{
- unsigned long flags;
- struct rcu_data *rdp;
-
- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
- if (unlikely(++rdp->qlen > qhimark)) {
- rdp->blimit = INT_MAX;
- force_quiescent_state(rdp, &rcu_ctrlblk);
- }
- local_irq_restore(flags);
-}
-
-/**
- * call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual update function to be invoked after the grace period
- *
- * The update function will be invoked some time after a full grace
- * period elapses, in other words after all currently executing RCU
- * read-side critical sections have completed. call_rcu_bh() assumes
- * that the read-side critical sections end on completion of a softirq
- * handler. This means that read-side critical sections in process
- * context must not be interrupted by softirqs. This interface is to be
- * used when most of the read-side critical sections are in softirq context.
- * RCU read-side critical sections are delimited by rcu_read_lock() and
- * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh()
- * and rcu_read_unlock_bh(), if in process context. These may be nested.
- */
-void fastcall call_rcu_bh(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void synchronize_rcu(void)
{
- unsigned long flags;
- struct rcu_data *rdp;
-
- head->func = func;
- head->next = NULL;
- local_irq_save(flags);
- rdp = &__get_cpu_var(rcu_bh_data);
- *rdp->nxttail = head;
- rdp->nxttail = &head->next;
-
- if (unlikely(++rdp->qlen > qhimark)) {
- rdp->blimit = INT_MAX;
- force_quiescent_state(rdp, &rcu_bh_ctrlblk);
- }
-
- local_irq_restore(flags);
-}
+ struct rcu_synchronize rcu;

-/*
- * Return the number of RCU batches processed thus far. Useful
- * for debug and statistics.
- */
-long rcu_batches_completed(void)
-{
- return rcu_ctrlblk.completed;
-}
+ init_completion(&rcu.completion);
+ /* Will wake me after RCU finished */
+ call_rcu(&rcu.head, wakeme_after_rcu);

-/*
- * Return the number of RCU batches processed thus far. Useful
- * for debug and statistics.
- */
-long rcu_batches_completed_bh(void)
-{
- return rcu_bh_ctrlblk.completed;
+ /* Wait for it */
+ wait_for_completion(&rcu.completion);
}
+EXPORT_SYMBOL_GPL(synchronize_rcu);

static void rcu_barrier_callback(struct rcu_head *notused)
{
if (atomic_dec_and_test(&rcu_barrier_cpu_count))
complete(&rcu_barrier_completion);
}
+EXPORT_SYMBOL_GPL(rcu_barrier);

/*
* Called with preemption disabled, and from cross-cpu IRQ context.
@@ -200,10 +101,8 @@ static void rcu_barrier_callback(struct
static void rcu_barrier_func(void *notused)
{
int cpu = smp_processor_id();
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- struct rcu_head *head;
+ struct rcu_head *head = &per_cpu(rcu_barrier_head, cpu);

- head = &rdp->barrier;
atomic_inc(&rcu_barrier_cpu_count);
call_rcu(head, rcu_barrier_callback);
}
@@ -222,416 +121,8 @@ void rcu_barrier(void)
wait_for_completion(&rcu_barrier_completion);
mutex_unlock(&rcu_barrier_mutex);
}
-EXPORT_SYMBOL_GPL(rcu_barrier);
-
-/*
- * Invoke the completed RCU callbacks. They are expected to be in
- * a per-cpu list.
- */
-static void rcu_do_batch(struct rcu_data *rdp)
-{
- struct rcu_head *next, *list;
- int count = 0;
-
- list = rdp->donelist;
- while (list) {
- next = list->next;
- prefetch(next);
- list->func(list);
- list = next;
- if (++count >= rdp->blimit)
- break;
- }
- rdp->donelist = list;
-
- local_irq_disable();
- rdp->qlen -= count;
- local_irq_enable();
- if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark)
- rdp->blimit = blimit;
-
- if (!rdp->donelist)
- rdp->donetail = &rdp->donelist;
- else
- tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
-}
-
-/*
- * Grace period handling:
- * The grace period handling consists out of two steps:
- * - A new grace period is started.
- * This is done by rcu_start_batch. The start is not broadcasted to
- * all cpus, they must pick this up by comparing rcp->cur with
- * rdp->quiescbatch. All cpus are recorded in the
- * rcu_ctrlblk.cpumask bitmap.
- * - All cpus must go through a quiescent state.
- * Since the start of the grace period is not broadcasted, at least two
- * calls to rcu_check_quiescent_state are required:
- * The first call just notices that a new grace period is running. The
- * following calls check if there was a quiescent state since the beginning
- * of the grace period. If so, it updates rcu_ctrlblk.cpumask. If
- * the bitmap is empty, then the grace period is completed.
- * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace
- * period (if necessary).
- */
-/*
- * Register a new batch of callbacks, and start it up if there is currently no
- * active batch and the batch to be registered has not already occurred.
- * Caller must hold rcu_ctrlblk.lock.
- */
-static void rcu_start_batch(struct rcu_ctrlblk *rcp)
-{
- if (rcp->next_pending &&
- rcp->completed == rcp->cur) {
- rcp->next_pending = 0;
- /*
- * next_pending == 0 must be visible in
- * __rcu_process_callbacks() before it can see new value of cur.
- */
- smp_wmb();
- rcp->cur++;
-
- /*
- * Accessing nohz_cpu_mask before incrementing rcp->cur needs a
- * Barrier Otherwise it can cause tickless idle CPUs to be
- * included in rcp->cpumask, which will extend graceperiods
- * unnecessarily.
- */
- smp_mb();
- cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
-
- rcp->signaled = 0;
- }
-}
-
-/*
- * cpu went through a quiescent state since the beginning of the grace period.
- * Clear it from the cpu mask and complete the grace period if it was the last
- * cpu. Start another grace period if someone has further entries pending
- */
-static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp)
-{
- cpu_clear(cpu, rcp->cpumask);
- if (cpus_empty(rcp->cpumask)) {
- /* batch completed ! */
- rcp->completed = rcp->cur;
- rcu_start_batch(rcp);
- }
-}
-
-/*
- * Check if the cpu has gone through a quiescent state (say context
- * switch). If so and if it already hasn't done so in this RCU
- * quiescent cycle, then indicate that it has done so.
- */
-static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- if (rdp->quiescbatch != rcp->cur) {
- /* start new grace period: */
- rdp->qs_pending = 1;
- rdp->passed_quiesc = 0;
- rdp->quiescbatch = rcp->cur;
- return;
- }
-
- /* Grace period already completed for this cpu?
- * qs_pending is checked instead of the actual bitmap to avoid
- * cacheline trashing.
- */
- if (!rdp->qs_pending)
- return;
-
- /*
- * Was there a quiescent state since the beginning of the grace
- * period? If no, then exit and wait for the next call.
- */
- if (!rdp->passed_quiesc)
- return;
- rdp->qs_pending = 0;
-
- spin_lock(&rcp->lock);
- /*
- * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync
- * during cpu startup. Ignore the quiescent state.
- */
- if (likely(rdp->quiescbatch == rcp->cur))
- cpu_quiet(rdp->cpu, rcp);
-
- spin_unlock(&rcp->lock);
-}
-
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing
- * locking requirements, the list it's pulling from has to belong to a cpu
- * which is dead and hence not processing interrupts.
- */
-static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list,
- struct rcu_head **tail)
-{
- local_irq_disable();
- *this_rdp->nxttail = list;
- if (list)
- this_rdp->nxttail = tail;
- local_irq_enable();
-}
-
-static void __rcu_offline_cpu(struct rcu_data *this_rdp,
- struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
-{
- /* if the cpu going offline owns the grace period
- * we can block indefinitely waiting for it, so flush
- * it here
- */
- spin_lock_bh(&rcp->lock);
- if (rcp->cur != rcp->completed)
- cpu_quiet(rdp->cpu, rcp);
- spin_unlock_bh(&rcp->lock);
- rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
- rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
- rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
-}
-
-static void rcu_offline_cpu(int cpu)
-{
- struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
- struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data);
-
- __rcu_offline_cpu(this_rdp, &rcu_ctrlblk,
- &per_cpu(rcu_data, cpu));
- __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk,
- &per_cpu(rcu_bh_data, cpu));
- put_cpu_var(rcu_data);
- put_cpu_var(rcu_bh_data);
- tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu);
-}
-
-#else
-
-static void rcu_offline_cpu(int cpu)
-{
-}
-
-#endif
-
-/*
- * This does the RCU processing work from tasklet context.
- */
-static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
- *rdp->donetail = rdp->curlist;
- rdp->donetail = rdp->curtail;
- rdp->curlist = NULL;
- rdp->curtail = &rdp->curlist;
- }
-
- if (rdp->nxtlist && !rdp->curlist) {
- local_irq_disable();
- rdp->curlist = rdp->nxtlist;
- rdp->curtail = rdp->nxttail;
- rdp->nxtlist = NULL;
- rdp->nxttail = &rdp->nxtlist;
- local_irq_enable();
-
- /*
- * start the next batch of callbacks
- */
-
- /* determine batch number */
- rdp->batch = rcp->cur + 1;
- /* see the comment and corresponding wmb() in
- * the rcu_start_batch()
- */
- smp_rmb();
-
- if (!rcp->next_pending) {
- /* and start it/schedule start if it's a new batch */
- spin_lock(&rcp->lock);
- rcp->next_pending = 1;
- rcu_start_batch(rcp);
- spin_unlock(&rcp->lock);
- }
- }
-
- rcu_check_quiescent_state(rcp, rdp);
- if (rdp->donelist)
- rcu_do_batch(rdp);
-}
-
-static void rcu_process_callbacks(unsigned long unused)
-{
- __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
- __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
-}
-
-static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
-{
- /* This cpu has pending rcu entries and the grace period
- * for them has completed.
- */
- if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
- return 1;
-
- /* This cpu has no pending entries, but there are new entries */
- if (!rdp->curlist && rdp->nxtlist)
- return 1;
-
- /* This cpu has finished callbacks to invoke */
- if (rdp->donelist)
- return 1;
-
- /* The rcu core waits for a quiescent state from the cpu */
- if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
- return 1;
-
- /* nothing to do */
- return 0;
-}
-
-/*
- * Check to see if there is any immediate RCU-related work to be done
- * by the current CPU, returning 1 if so. This function is part of the
- * RCU implementation; it is -not- an exported member of the RCU API.
- */
-int rcu_pending(int cpu)
-{
- return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
- __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
-}
-
-/*
- * Check to see if any future RCU-related work will need to be done
- * by the current CPU, even if none need be done immediately, returning
- * 1 if so. This function is part of the RCU implementation; it is -not-
- * an exported member of the RCU API.
- */
-int rcu_needs_cpu(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);
-
- return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu));
-}
-
-void rcu_check_callbacks(int cpu, int user)
-{
- if (user ||
- (idle_cpu(cpu) && !in_softirq() &&
- hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
- rcu_qsctr_inc(cpu);
- rcu_bh_qsctr_inc(cpu);
- } else if (!in_softirq())
- rcu_bh_qsctr_inc(cpu);
- tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
-}
-
-static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
- struct rcu_data *rdp)
-{
- memset(rdp, 0, sizeof(*rdp));
- rdp->curtail = &rdp->curlist;
- rdp->nxttail = &rdp->nxtlist;
- rdp->donetail = &rdp->donelist;
- rdp->quiescbatch = rcp->completed;
- rdp->qs_pending = 0;
- rdp->cpu = cpu;
- rdp->blimit = blimit;
-}
-
-static void __devinit rcu_online_cpu(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu);
-
- rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp);
- rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
- tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL);
-}
-
-static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- long cpu = (long)hcpu;
- switch (action) {
- case CPU_UP_PREPARE:
- case CPU_UP_PREPARE_FROZEN:
- rcu_online_cpu(cpu);
- break;
- case CPU_DEAD:
- case CPU_DEAD_FROZEN:
- rcu_offline_cpu(cpu);
- break;
- default:
- break;
- }
- return NOTIFY_OK;
-}

-static struct notifier_block __cpuinitdata rcu_nb = {
- .notifier_call = rcu_cpu_notify,
-};
-
-/*
- * Initializes rcu mechanism. Assumed to be called early.
- * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
- * Note that rcu_qsctr and friends are implicitly
- * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
- */
void __init rcu_init(void)
{
- rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
- (void *)(long)smp_processor_id());
- /* Register notifier for non-boot CPUs */
- register_cpu_notifier(&rcu_nb);
-}
-
-struct rcu_synchronize {
- struct rcu_head head;
- struct completion completion;
-};
-
-/* Because of FASTCALL declaration of complete, we use this wrapper */
-static void wakeme_after_rcu(struct rcu_head *head)
-{
- struct rcu_synchronize *rcu;
-
- rcu = container_of(head, struct rcu_synchronize, head);
- complete(&rcu->completion);
-}
-
-/**
- * synchronize_rcu - wait until a grace period has elapsed.
- *
- * Control will return to the caller some time after a full grace
- * period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed. RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
- *
- * If your read-side code is not protected by rcu_read_lock(), do -not-
- * use synchronize_rcu().
- */
-void synchronize_rcu(void)
-{
- struct rcu_synchronize rcu;
-
- init_completion(&rcu.completion);
- /* Will wake me after RCU finished */
- call_rcu(&rcu.head, wakeme_after_rcu);
-
- /* Wait for it */
- wait_for_completion(&rcu.completion);
+ __rcu_init();
}
-
-module_param(blimit, int, 0);
-module_param(qhimark, int, 0);
-module_param(qlowmark, int, 0);
-EXPORT_SYMBOL_GPL(rcu_batches_completed);
-EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
-EXPORT_SYMBOL_GPL(call_rcu);
-EXPORT_SYMBOL_GPL(call_rcu_bh);
-EXPORT_SYMBOL_GPL(synchronize_rcu);

2007-09-10 18:33:21

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 2/9] RCU: Fix barriers

Work in progress, not for inclusion.

Fix rcu_barrier() to work properly in preemptive kernel environment.
Also, the ordering of callback must be preserved while moving
callbacks to another CPU during CPU hotplug.

Signed-off-by: Dipankar Sarma <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---

rcuclassic.c | 2 +-
rcupdate.c | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletion(-)

diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcuclassic.c linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c
--- linux-2.6.22-a-splitclassic/kernel/rcuclassic.c 2007-07-19 15:03:51.000000000 -0700
+++ linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c 2007-07-19 17:10:46.000000000 -0700
@@ -349,9 +349,9 @@ static void __rcu_offline_cpu(struct rcu
if (rcp->cur != rcp->completed)
cpu_quiet(rdp->cpu, rcp);
spin_unlock_bh(&rcp->lock);
+ rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
- rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
}

static void rcu_offline_cpu(int cpu)
diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcupdate.c linux-2.6.22-b-fixbarriers/kernel/rcupdate.c
--- linux-2.6.22-a-splitclassic/kernel/rcupdate.c 2007-07-19 14:19:03.000000000 -0700
+++ linux-2.6.22-b-fixbarriers/kernel/rcupdate.c 2007-07-19 17:13:31.000000000 -0700
@@ -115,7 +115,17 @@ void rcu_barrier(void)
mutex_lock(&rcu_barrier_mutex);
init_completion(&rcu_barrier_completion);
atomic_set(&rcu_barrier_cpu_count, 0);
+ /*
+ * The queueing of callbacks in all CPUs must be atomic with
+ * respect to RCU, otherwise one CPU may queue a callback,
+ * wait for a grace period, decrement barrier count and call
+ * complete(), while other CPUs have not yet queued anything.
+ * So, we need to make sure that grace periods cannot complete
+ * until all the callbacks are queued.
+ */
+ rcu_read_lock();
on_each_cpu(rcu_barrier_func, NULL, 0, 1);
+ rcu_read_unlock();
wait_for_completion(&rcu_barrier_completion);
mutex_unlock(&rcu_barrier_mutex);
}

2007-09-10 18:34:28

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 3/9] RCU: Preemptible RCU

Work in progress, not for inclusion.

This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -

http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption. As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.

Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for
the preemptible variant of RCU.

Signed-off-by: Dipankar Sarma <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]> (for RCU_SOFTIRQ)
Signed-off-by: Paul McKenney <[email protected]>
---

include/linux/interrupt.h | 1
include/linux/rcuclassic.h | 2
include/linux/rcupdate.h | 7
include/linux/rcupreempt.h | 78 +++
include/linux/rcupreempt_trace.h | 100 +++++
include/linux/sched.h | 5
kernel/Kconfig.preempt | 38 +
kernel/Makefile | 7
kernel/fork.c | 4
kernel/rcupreempt.c | 767 +++++++++++++++++++++++++++++++++++++++
kernel/rcupreempt_trace.c | 330 ++++++++++++++++
11 files changed, 1336 insertions(+), 3 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/interrupt.h linux-2.6.22-c-preemptrcu/include/linux/interrupt.h
--- linux-2.6.22-b-fixbarriers/include/linux/interrupt.h 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/interrupt.h 2007-08-22 15:21:06.000000000 -0700
@@ -269,6 +269,7 @@ enum
#ifdef CONFIG_HIGH_RES_TIMERS
HRTIMER_SOFTIRQ,
#endif
+ RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
};

/* softirq mask and active fields moved to irq_cpustat_t in
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h 2007-08-22 15:21:06.000000000 -0700
@@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
extern void __rcu_init(void);
extern void rcu_check_callbacks(int cpu, int user);
extern void rcu_restart_cpu(int cpu);
-extern long rcu_batches_completed(void);
-extern long rcu_batches_completed_bh(void);

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUCLASSIC_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
@@ -52,7 +52,11 @@ struct rcu_head {
void (*func)(struct rcu_head *head);
};

+#ifdef CONFIG_CLASSIC_RCU
#include <linux/rcuclassic.h>
+#else /* #ifdef CONFIG_CLASSIC_RCU */
+#include <linux/rcupreempt.h>
+#endif /* #else #ifdef CONFIG_CLASSIC_RCU */

#define RCU_HEAD_INIT { .next = NULL, .func = NULL }
#define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
@@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct
/* Exported common interfaces */
extern void synchronize_rcu(void);
extern void rcu_barrier(void);
+extern long rcu_batches_completed(void);
+extern long rcu_batches_completed_bh(void);

/* Internal to kernel */
extern void rcu_init(void);
extern void rcu_check_callbacks(int cpu, int user);
+extern int rcu_needs_cpu(int cpu);

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h 2007-08-22 15:21:06.000000000 -0700
@@ -0,0 +1,78 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (RT implementation)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Author: Paul McKenney <[email protected]>
+ *
+ * Based on the original work by Paul McKenney <[email protected]>
+ * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU
+ *
+ */
+
+#ifndef __LINUX_RCUPREEMPT_H
+#define __LINUX_RCUPREEMPT_H
+
+#ifdef __KERNEL__
+
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/threads.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/seqlock.h>
+
+#define rcu_qsctr_inc(cpu)
+#define rcu_bh_qsctr_inc(cpu)
+#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
+
+extern void __rcu_read_lock(void);
+extern void __rcu_read_unlock(void);
+extern int rcu_pending(int cpu);
+extern int rcu_needs_cpu(int cpu);
+
+#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
+#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); }
+
+#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting)
+
+extern void __synchronize_sched(void);
+
+extern void __rcu_init(void);
+extern void rcu_check_callbacks(int cpu, int user);
+extern void rcu_restart_cpu(int cpu);
+
+#ifdef CONFIG_RCU_TRACE
+struct rcupreempt_trace;
+extern int *rcupreempt_flipctr(int cpu);
+extern long rcupreempt_data_completed(void);
+extern int rcupreempt_flip_flag(int cpu);
+extern int rcupreempt_mb_flag(int cpu);
+extern char *rcupreempt_try_flip_state_name(void);
+extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
+#endif
+
+struct softirq_action;
+
+#endif /* __KERNEL__ */
+#endif /* __LINUX_RCUPREEMPT_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
--- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h 2007-08-22 15:21:06.000000000 -0700
@@ -0,0 +1,100 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (RT implementation)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Author: Paul McKenney <[email protected]>
+ *
+ * Based on the original work by Paul McKenney <[email protected]>
+ * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ */
+
+#ifndef __LINUX_RCUPREEMPT_TRACE_H
+#define __LINUX_RCUPREEMPT_TRACE_H
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+#include <asm/atomic.h>
+
+/*
+ * PREEMPT_RCU data structures.
+ */
+
+struct rcupreempt_trace {
+ long next_length;
+ long next_add;
+ long wait_length;
+ long wait_add;
+ long done_length;
+ long done_add;
+ long done_remove;
+ atomic_t done_invoked;
+ long rcu_check_callbacks;
+ atomic_t rcu_try_flip_1;
+ atomic_t rcu_try_flip_e1;
+ long rcu_try_flip_i1;
+ long rcu_try_flip_ie1;
+ long rcu_try_flip_g1;
+ long rcu_try_flip_a1;
+ long rcu_try_flip_ae1;
+ long rcu_try_flip_a2;
+ long rcu_try_flip_z1;
+ long rcu_try_flip_ze1;
+ long rcu_try_flip_z2;
+ long rcu_try_flip_m1;
+ long rcu_try_flip_me1;
+ long rcu_try_flip_m2;
+};
+
+#ifdef CONFIG_RCU_TRACE
+#define RCU_TRACE(fn, arg) fn(arg);
+#else
+#define RCU_TRACE(fn, arg)
+#endif
+
+extern void rcupreempt_trace_move2done(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_invoke(struct rcupreempt_trace *trace);
+extern void rcupreempt_trace_next_add(struct rcupreempt_trace *trace);
+
+#endif /* __KERNEL__ */
+#endif /* __LINUX_RCUPREEMPT_TRACE_H */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/sched.h linux-2.6.22-c-preemptrcu/include/linux/sched.h
--- linux-2.6.22-b-fixbarriers/include/linux/sched.h 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/sched.h 2007-08-22 15:21:06.000000000 -0700
@@ -850,6 +850,11 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;

+#ifdef CONFIG_PREEMPT_RCU
+ int rcu_read_lock_nesting;
+ int rcu_flipctr_idx;
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
+
#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
#endif
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/fork.c linux-2.6.22-c-preemptrcu/kernel/fork.c
--- linux-2.6.22-b-fixbarriers/kernel/fork.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/kernel/fork.c 2007-08-22 15:21:06.000000000 -0700
@@ -1032,6 +1032,10 @@ static struct task_struct *copy_process(

INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
+#ifdef CONFIG_PREEMPT_RCU
+ p->rcu_read_lock_nesting = 0;
+ p->rcu_flipctr_idx = 0;
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
p->vfork_done = NULL;
spin_lock_init(&p->alloc_lock);

diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/Kconfig.preempt linux-2.6.22-c-preemptrcu/kernel/Kconfig.preempt
--- linux-2.6.22-b-fixbarriers/kernel/Kconfig.preempt 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/kernel/Kconfig.preempt 2007-08-22 15:21:06.000000000 -0700
@@ -63,3 +63,41 @@ config PREEMPT_BKL
Say Y here if you are building a kernel for a desktop system.
Say N if you are unsure.

+choice
+ prompt "RCU implementation type:"
+ default CLASSIC_RCU
+
+config CLASSIC_RCU
+ bool "Classic RCU"
+ help
+ This option selects the classic RCU implementation that is
+ designed for best read-side performance on non-realtime
+ systems.
+
+ Say Y if you are unsure.
+
+config PREEMPT_RCU
+ bool "Preemptible RCU"
+ depends on PREEMPT
+ help
+ This option reduces the latency of the kernel by making certain
+ RCU sections preemptible. Normally RCU code is non-preemptible, if
+ this option is selected then read-only RCU sections become
+ preemptible. This helps latency, but may expose bugs due to
+ now-naive assumptions about each RCU read-side critical section
+ remaining on a given CPU through its execution.
+
+ Say N if you are unsure.
+
+endchoice
+
+config RCU_TRACE
+ bool "Enable tracing for RCU - currently stats in debugfs"
+ select DEBUG_FS
+ default y
+ help
+ This option provides tracing in RCU which presents stats
+ in debugfs for debugging RCU implementation.
+
+ Say Y here if you want to enable RCU tracing
+ Say N if you are unsure.
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/Makefile linux-2.6.22-c-preemptrcu/kernel/Makefile
--- linux-2.6.22-b-fixbarriers/kernel/Makefile 2007-07-19 12:16:03.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/kernel/Makefile 2007-08-22 15:21:06.000000000 -0700
@@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o
exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
- rcupdate.o rcuclassic.o extable.o params.o posix-timers.o \
+ rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o

@@ -46,6 +46,11 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
+obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
+ifeq ($(CONFIG_PREEMPT_RCU),y)
+obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
+endif
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
--- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 15:35:19.000000000 -0700
@@ -0,0 +1,767 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion, realtime implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2006
+ *
+ * Authors: Paul E. McKenney <[email protected]>
+ * With thanks to Esben Nielsen, Bill Huey, and Ingo Molnar
+ * for pushing me away from locks and towards counters, and
+ * to Suparna Bhattacharya for pushing me completely away
+ * from atomic instructions on the read side.
+ *
+ * Papers: http://www.rdrop.com/users/paulmck/RCU
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU/ *.txt
+ *
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/rcupdate.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/rcupdate.h>
+#include <linux/cpu.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include <linux/byteorder/swabb.h>
+#include <linux/cpumask.h>
+#include <linux/rcupreempt_trace.h>
+
+/*
+ * PREEMPT_RCU data structures.
+ */
+
+#define GP_STAGES 4
+struct rcu_data {
+ spinlock_t lock; /* Protect rcu_data fields. */
+ long completed; /* Number of last completed batch. */
+ int waitlistcount;
+ struct tasklet_struct rcu_tasklet;
+ struct rcu_head *nextlist;
+ struct rcu_head **nexttail;
+ struct rcu_head *waitlist[GP_STAGES];
+ struct rcu_head **waittail[GP_STAGES];
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+#ifdef CONFIG_RCU_TRACE
+ struct rcupreempt_trace trace;
+#endif /* #ifdef CONFIG_RCU_TRACE */
+};
+struct rcu_ctrlblk {
+ spinlock_t fliplock; /* Protect state-machine transitions. */
+ long completed; /* Number of last completed batch. */
+};
+static DEFINE_PER_CPU(struct rcu_data, rcu_data);
+static struct rcu_ctrlblk rcu_ctrlblk = {
+ .fliplock = SPIN_LOCK_UNLOCKED,
+ .completed = 0,
+};
+static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
+
+/*
+ * States for rcu_try_flip() and friends.
+ */
+
+enum rcu_try_flip_states {
+ rcu_try_flip_idle_state, /* "I" */
+ rcu_try_flip_waitack_state, /* "A" */
+ rcu_try_flip_waitzero_state, /* "Z" */
+ rcu_try_flip_waitmb_state /* "M" */
+};
+static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
+#ifdef CONFIG_RCU_TRACE
+static char *rcu_try_flip_state_names[] =
+ { "idle", "waitack", "waitzero", "waitmb" };
+#endif /* #ifdef CONFIG_RCU_TRACE */
+
+/*
+ * Enum and per-CPU flag to determine when each CPU has seen
+ * the most recent counter flip.
+ */
+
+enum rcu_flip_flag_values {
+ rcu_flip_seen, /* Steady/initial state, last flip seen. */
+ /* Only GP detector can update. */
+ rcu_flipped /* Flip just completed, need confirmation. */
+ /* Only corresponding CPU can update. */
+};
+static DEFINE_PER_CPU(enum rcu_flip_flag_values, rcu_flip_flag) = rcu_flip_seen;
+
+/*
+ * Enum and per-CPU flag to determine when each CPU has executed the
+ * needed memory barrier to fence in memory references from its last RCU
+ * read-side critical section in the just-completed grace period.
+ */
+
+enum rcu_mb_flag_values {
+ rcu_mb_done, /* Steady/initial state, no mb()s required. */
+ /* Only GP detector can update. */
+ rcu_mb_needed /* Flip just completed, need an mb(). */
+ /* Only corresponding CPU can update. */
+};
+static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done;
+
+/*
+ * Macro that prevents the compiler from reordering accesses, but does
+ * absolutely -nothing- to prevent CPUs from reordering. This is used
+ * only to mediate communication between mainline code and hardware
+ * interrupt and NMI handlers.
+ */
+#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
+
+/*
+ * RCU_DATA_ME: find the current CPU's rcu_data structure.
+ * RCU_DATA_CPU: find the specified CPU's rcu_data structure.
+ */
+#define RCU_DATA_ME() (&__get_cpu_var(rcu_data))
+#define RCU_DATA_CPU(cpu) (&per_cpu(rcu_data, cpu))
+
+/*
+ * Helper macro for tracing when the appropriate rcu_data is not
+ * cached in a local variable, but where the CPU number is so cached.
+ */
+#define RCU_TRACE_CPU(f, cpu) RCU_TRACE(f, &(RCU_DATA_CPU(cpu)->trace));
+
+/*
+ * Helper macro for tracing when the appropriate rcu_data is not
+ * cached in a local variable.
+ */
+#define RCU_TRACE_ME(f) RCU_TRACE(f, &(RCU_DATA_ME()->trace));
+
+/*
+ * Helper macro for tracing when the appropriate rcu_data is pointed
+ * to by a local variable.
+ */
+#define RCU_TRACE_RDP(f, rdp) RCU_TRACE(f, &((rdp)->trace));
+
+/*
+ * Return the number of RCU batches processed thus far. Useful
+ * for debug and statistics.
+ */
+long rcu_batches_completed(void)
+{
+ return rcu_ctrlblk.completed;
+}
+EXPORT_SYMBOL_GPL(rcu_batches_completed);
+
+/*
+ * Return the number of RCU batches processed thus far. Useful for debug
+ * and statistics. The _bh variant is identical to straight RCU.
+ */
+long rcu_batches_completed_bh(void)
+{
+ return rcu_ctrlblk.completed;
+}
+EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
+
+void __rcu_read_lock(void)
+{
+ int idx;
+ struct task_struct *me = current;
+ int nesting;
+
+ nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
+ if (nesting != 0) {
+
+ /* An earlier rcu_read_lock() covers us, just count it. */
+
+ me->rcu_read_lock_nesting = nesting + 1;
+
+ } else {
+ unsigned long oldirq;
+
+ /*
+ * Disable local interrupts to prevent the grace-period
+ * detection state machine from seeing us half-done.
+ * NMIs can still occur, of course, and might themselves
+ * contain rcu_read_lock().
+ */
+
+ local_irq_save(oldirq);
+
+ /*
+ * Outermost nesting of rcu_read_lock(), so increment
+ * the current counter for the current CPU. Use volatile
+ * casts to prevent the compiler from reordering.
+ */
+
+ idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
+ smp_read_barrier_depends(); /* @@@@ might be unneeded */
+ ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
+
+ /*
+ * Now that the per-CPU counter has been incremented, we
+ * are protected from races with rcu_read_lock() invoked
+ * from NMI handlers on this CPU. We can therefore safely
+ * increment the nesting counter, relieving further NMIs
+ * of the need to increment the per-CPU counter.
+ */
+
+ ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
+
+ /*
+ * Now that we have preventing any NMIs from storing
+ * to the ->rcu_flipctr_idx, we can safely use it to
+ * remember which counter to decrement in the matching
+ * rcu_read_unlock().
+ */
+
+ ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
+ local_irq_restore(oldirq);
+ }
+}
+EXPORT_SYMBOL_GPL(__rcu_read_lock);
+
+void __rcu_read_unlock(void)
+{
+ int idx;
+ struct task_struct *me = current;
+ int nesting;
+
+ nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
+ if (nesting > 1) {
+
+ /*
+ * We are still protected by the enclosing rcu_read_lock(),
+ * so simply decrement the counter.
+ */
+
+ me->rcu_read_lock_nesting = nesting - 1;
+
+ } else {
+ unsigned long oldirq;
+
+ /*
+ * Disable local interrupts to prevent the grace-period
+ * detection state machine from seeing us half-done.
+ * NMIs can still occur, of course, and might themselves
+ * contain rcu_read_lock() and rcu_read_unlock().
+ */
+
+ local_irq_save(oldirq);
+
+ /*
+ * Outermost nesting of rcu_read_unlock(), so we must
+ * decrement the current counter for the current CPU.
+ * This must be done carefully, because NMIs can
+ * occur at any point in this code, and any rcu_read_lock()
+ * and rcu_read_unlock() pairs in the NMI handlers
+ * must interact non-destructively with this code.
+ * Lots of volatile casts, and -very- careful ordering.
+ *
+ * Changes to this code, including this one, must be
+ * inspected, validated, and tested extremely carefully!!!
+ */
+
+ /*
+ * First, pick up the index. Enforce ordering for
+ * DEC Alpha.
+ */
+
+ idx = ORDERED_WRT_IRQ(me->rcu_flipctr_idx);
+ smp_read_barrier_depends(); /* @@@ Needed??? */
+
+ /*
+ * Now that we have fetched the counter index, it is
+ * safe to decrement the per-task RCU nesting counter.
+ * After this, any interrupts or NMIs will increment and
+ * decrement the per-CPU counters.
+ */
+ ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1;
+
+ /*
+ * It is now safe to decrement this task's nesting count.
+ * NMIs that occur after this statement will route their
+ * rcu_read_lock() calls through this "else" clause, and
+ * will thus start incrementing the per-CPU coutner on
+ * their own. They will also clobber ->rcu_flipctr_idx,
+ * but that is OK, since we have already fetched it.
+ */
+
+ ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
+ local_irq_restore(oldirq);
+ }
+}
+EXPORT_SYMBOL_GPL(__rcu_read_unlock);
+
+/*
+ * If a global counter flip has occurred since the last time that we
+ * advanced callbacks, advance them. Hardware interrupts must be
+ * disabled when calling this function.
+ */
+static void __rcu_advance_callbacks(struct rcu_data *rdp)
+{
+ int cpu;
+ int i;
+ int wlc = 0;
+
+ if (rdp->completed != rcu_ctrlblk.completed) {
+ if (rdp->waitlist[GP_STAGES - 1] != NULL) {
+ *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
+ rdp->donetail = rdp->waittail[GP_STAGES - 1];
+ RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
+ }
+ for (i = GP_STAGES - 2; i >= 0; i--) {
+ if (rdp->waitlist[i] != NULL) {
+ rdp->waitlist[i + 1] = rdp->waitlist[i];
+ rdp->waittail[i + 1] = rdp->waittail[i];
+ wlc++;
+ } else {
+ rdp->waitlist[i + 1] = NULL;
+ rdp->waittail[i + 1] =
+ &rdp->waitlist[i + 1];
+ }
+ }
+ if (rdp->nextlist != NULL) {
+ rdp->waitlist[0] = rdp->nextlist;
+ rdp->waittail[0] = rdp->nexttail;
+ wlc++;
+ rdp->nextlist = NULL;
+ rdp->nexttail = &rdp->nextlist;
+ RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
+ } else {
+ rdp->waitlist[0] = NULL;
+ rdp->waittail[0] = &rdp->waitlist[0];
+ }
+ rdp->waitlistcount = wlc;
+ rdp->completed = rcu_ctrlblk.completed;
+ }
+
+ /*
+ * Check to see if this CPU needs to report that it has seen
+ * the most recent counter flip, thereby declaring that all
+ * subsequent rcu_read_lock() invocations will respect this flip.
+ */
+
+ cpu = raw_smp_processor_id();
+ if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
+ smp_mb(); /* Subsequent counter accesses must see new value */
+ per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
+ smp_mb(); /* Subsequent RCU read-side critical sections */
+ /* seen -after- acknowledgement. */
+ }
+}
+
+/*
+ * Get here when RCU is idle. Decide whether we need to
+ * move out of idle state, and return non-zero if so.
+ * "Straightforward" approach for the moment, might later
+ * use callback-list lengths, grace-period duration, or
+ * some such to determine when to exit idle state.
+ * Might also need a pre-idle test that does not acquire
+ * the lock, but let's get the simple case working first...
+ */
+
+static int
+rcu_try_flip_idle(void)
+{
+ int cpu;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
+ if (!rcu_pending(smp_processor_id())) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
+ return 0;
+ }
+
+ /*
+ * Do the flip.
+ */
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
+ rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
+
+ /*
+ * Need a memory barrier so that other CPUs see the new
+ * counter value before they see the subsequent change of all
+ * the rcu_flip_flag instances to rcu_flipped.
+ */
+
+ smp_mb(); /* see above block comment. */
+
+ /* Now ask each CPU for acknowledgement of the flip. */
+
+ for_each_possible_cpu(cpu)
+ per_cpu(rcu_flip_flag, cpu) = rcu_flipped;
+
+ return 1;
+}
+
+/*
+ * Wait for CPUs to acknowledge the flip.
+ */
+
+static int
+rcu_try_flip_waitack(void)
+{
+ int cpu;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_a1);
+ for_each_possible_cpu(cpu)
+ if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1);
+ return 0;
+ }
+
+ /*
+ * Make sure our checks above don't bleed into subsequent
+ * waiting for the sum of the counters to reach zero.
+ */
+
+ smp_mb(); /* see above block comment. */
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_a2);
+ return 1;
+}
+
+/*
+ * Wait for collective ``last'' counter to reach zero,
+ * then tell all CPUs to do an end-of-grace-period memory barrier.
+ */
+
+static int
+rcu_try_flip_waitzero(void)
+{
+ int cpu;
+ int lastidx = !(rcu_ctrlblk.completed & 0x1);
+ int sum = 0;
+
+ /* Check to see if the sum of the "last" counters is zero. */
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_z1);
+ for_each_possible_cpu(cpu)
+ sum += per_cpu(rcu_flipctr, cpu)[lastidx];
+ if (sum != 0) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_ze1);
+ return 0;
+ }
+
+ smp_mb(); /* Don't call for memory barriers before we see zero. */
+
+ /* Call for a memory barrier from each CPU. */
+
+ for_each_possible_cpu(cpu)
+ per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_z2);
+ return 1;
+}
+
+/*
+ * Wait for all CPUs to do their end-of-grace-period memory barrier.
+ * Return 0 once all CPUs have done so.
+ */
+
+static int
+rcu_try_flip_waitmb(void)
+{
+ int cpu;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_m1);
+ for_each_possible_cpu(cpu)
+ if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_me1);
+ return 0;
+ }
+
+ smp_mb(); /* Ensure that the above checks precede any following flip. */
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_m2);
+ return 1;
+}
+
+/*
+ * Attempt a single flip of the counters. Remember, a single flip does
+ * -not- constitute a grace period. Instead, the interval between
+ * at least three consecutive flips is a grace period.
+ *
+ * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
+ * on a large SMP, they might want to use a hierarchical organization of
+ * the per-CPU-counter pairs.
+ */
+static void rcu_try_flip(void)
+{
+ unsigned long oldirq;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
+ if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
+ return;
+ }
+
+ /*
+ * Take the next transition(s) through the RCU grace-period
+ * flip-counter state machine.
+ */
+
+ switch (rcu_try_flip_state) {
+ case rcu_try_flip_idle_state:
+ if (rcu_try_flip_idle())
+ rcu_try_flip_state = rcu_try_flip_waitack_state;
+ break;
+ case rcu_try_flip_waitack_state:
+ if (rcu_try_flip_waitack())
+ rcu_try_flip_state = rcu_try_flip_waitzero_state;
+ break;
+ case rcu_try_flip_waitzero_state:
+ if (rcu_try_flip_waitzero())
+ rcu_try_flip_state = rcu_try_flip_waitmb_state;
+ break;
+ case rcu_try_flip_waitmb_state:
+ if (rcu_try_flip_waitmb())
+ rcu_try_flip_state = rcu_try_flip_idle_state;
+ }
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+}
+
+/*
+ * Check to see if this CPU needs to do a memory barrier in order to
+ * ensure that any prior RCU read-side critical sections have committed
+ * their counter manipulations and critical-section memory references
+ * before declaring the grace period to be completed.
+ */
+static void rcu_check_mb(int cpu)
+{
+ if (per_cpu(rcu_mb_flag, cpu) == rcu_mb_needed) {
+ smp_mb(); /* Ensure RCU read-side accesses are visible. */
+ per_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
+ }
+}
+
+void rcu_check_callbacks(int cpu, int user)
+{
+ unsigned long oldirq;
+ struct rcu_data *rdp = RCU_DATA_CPU(cpu);
+
+ rcu_check_mb(cpu);
+ if (rcu_ctrlblk.completed == rdp->completed)
+ rcu_try_flip();
+ spin_lock_irqsave(&rdp->lock, oldirq);
+ RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
+ __rcu_advance_callbacks(rdp);
+ if (rdp->donelist == NULL) {
+ spin_unlock_irqrestore(&rdp->lock, oldirq);
+ } else {
+ spin_unlock_irqrestore(&rdp->lock, oldirq);
+ raise_softirq(RCU_SOFTIRQ);
+ }
+}
+
+/*
+ * Needed by dynticks, to make sure all RCU processing has finished
+ * when we go idle:
+ */
+void rcu_advance_callbacks(int cpu, int user)
+{
+ unsigned long oldirq;
+ struct rcu_data *rdp = RCU_DATA_CPU(cpu);
+
+ if (rcu_ctrlblk.completed == rdp->completed) {
+ rcu_try_flip();
+ if (rcu_ctrlblk.completed == rdp->completed)
+ return;
+ }
+ spin_lock_irqsave(&rdp->lock, oldirq);
+ RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
+ __rcu_advance_callbacks(rdp);
+ spin_unlock_irqrestore(&rdp->lock, oldirq);
+}
+
+static void rcu_process_callbacks(struct softirq_action *unused)
+{
+ unsigned long flags;
+ struct rcu_head *next, *list;
+ struct rcu_data *rdp = RCU_DATA_ME();
+
+ spin_lock_irqsave(&rdp->lock, flags);
+ list = rdp->donelist;
+ if (list == NULL) {
+ spin_unlock_irqrestore(&rdp->lock, flags);
+ return;
+ }
+ rdp->donelist = NULL;
+ rdp->donetail = &rdp->donelist;
+ RCU_TRACE_RDP(rcupreempt_trace_done_remove, rdp);
+ spin_unlock_irqrestore(&rdp->lock, flags);
+ while (list) {
+ next = list->next;
+ list->func(list);
+ list = next;
+ RCU_TRACE_ME(rcupreempt_trace_invoke);
+ }
+}
+
+void fastcall call_rcu(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
+{
+ unsigned long oldirq;
+ struct rcu_data *rdp;
+
+ head->func = func;
+ head->next = NULL;
+ local_irq_save(oldirq);
+ rdp = RCU_DATA_ME();
+ spin_lock(&rdp->lock);
+ __rcu_advance_callbacks(rdp);
+ *rdp->nexttail = head;
+ rdp->nexttail = &head->next;
+ RCU_TRACE_RDP(rcupreempt_trace_next_add, rdp);
+ spin_unlock(&rdp->lock);
+ local_irq_restore(oldirq);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
+
+/*
+ * Wait until all currently running preempt_disable() code segments
+ * (including hardware-irq-disable segments) complete. Note that
+ * in -rt this does -not- necessarily result in all currently executing
+ * interrupt -handlers- having completed.
+ */
+void __synchronize_sched(void)
+{
+ cpumask_t oldmask;
+ int cpu;
+
+ if (sched_getaffinity(0, &oldmask) < 0)
+ oldmask = cpu_possible_map;
+ for_each_online_cpu(cpu) {
+ sched_setaffinity(0, cpumask_of_cpu(cpu));
+ schedule();
+ }
+ sched_setaffinity(0, oldmask);
+}
+EXPORT_SYMBOL_GPL(__synchronize_sched);
+
+/*
+ * Check to see if any future RCU-related work will need to be done
+ * by the current CPU, even if none need be done immediately, returning
+ * 1 if so. Assumes that notifiers would take care of handling any
+ * outstanding requests from the RCU core.
+ *
+ * This function is part of the RCU implementation; it is -not-
+ * an exported member of the RCU API.
+ */
+int rcu_needs_cpu(int cpu)
+{
+ struct rcu_data *rdp = RCU_DATA_CPU(cpu);
+
+ return (rdp->donelist != NULL ||
+ !!rdp->waitlistcount ||
+ rdp->nextlist != NULL);
+}
+
+int rcu_pending(int cpu)
+{
+ struct rcu_data *rdp = RCU_DATA_CPU(cpu);
+
+ /* The CPU has at least one callback queued somewhere. */
+
+ if (rdp->donelist != NULL ||
+ !!rdp->waitlistcount ||
+ rdp->nextlist != NULL)
+ return 1;
+
+ /* The RCU core needs an acknowledgement from this CPU. */
+
+ if ((per_cpu(rcu_flip_flag, cpu) == rcu_flipped) ||
+ (per_cpu(rcu_mb_flag, cpu) == rcu_mb_needed))
+ return 1;
+
+ /* This CPU has fallen behind the global grace-period number. */
+
+ if (rdp->completed != rcu_ctrlblk.completed)
+ return 1;
+
+ /* Nothing needed from this CPU. */
+
+ return 0;
+}
+
+void __init __rcu_init(void)
+{
+ int cpu;
+ int i;
+ struct rcu_data *rdp;
+
+/*&&&&*/printk(KERN_NOTICE "WARNING: experimental RCU implementation.\n");
+ for_each_possible_cpu(cpu) {
+ rdp = RCU_DATA_CPU(cpu);
+ spin_lock_init(&rdp->lock);
+ rdp->completed = 0;
+ rdp->waitlistcount = 0;
+ rdp->nextlist = NULL;
+ rdp->nexttail = &rdp->nextlist;
+ for (i = 0; i < GP_STAGES; i++) {
+ rdp->waitlist[i] = NULL;
+ rdp->waittail[i] = &rdp->waitlist[i];
+ }
+ rdp->donelist = NULL;
+ rdp->donetail = &rdp->donelist;
+ }
+ open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
+}
+
+/*
+ * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
+ */
+void synchronize_kernel(void)
+{
+ synchronize_rcu();
+}
+
+#ifdef CONFIG_RCU_TRACE
+int *rcupreempt_flipctr(int cpu)
+{
+ return &per_cpu(rcu_flipctr, cpu)[0];
+}
+EXPORT_SYMBOL_GPL(rcupreempt_flipctr);
+
+int rcupreempt_flip_flag(int cpu)
+{
+ return per_cpu(rcu_flip_flag, cpu);
+}
+EXPORT_SYMBOL_GPL(rcupreempt_flip_flag);
+
+int rcupreempt_mb_flag(int cpu)
+{
+ return per_cpu(rcu_mb_flag, cpu);
+}
+EXPORT_SYMBOL_GPL(rcupreempt_mb_flag);
+
+char *rcupreempt_try_flip_state_name(void)
+{
+ return rcu_try_flip_state_names[rcu_try_flip_state];
+}
+EXPORT_SYMBOL_GPL(rcupreempt_try_flip_state_name);
+
+struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu)
+{
+ struct rcu_data *rdp = RCU_DATA_CPU(cpu);
+
+ return &rdp->trace;
+}
+EXPORT_SYMBOL_GPL(rcupreempt_trace_cpu);
+
+#endif /* #ifdef RCU_TRACE */
diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt_trace.c linux-2.6.22-c-preemptrcu/kernel/rcupreempt_trace.c
--- linux-2.6.22-b-fixbarriers/kernel/rcupreempt_trace.c 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt_trace.c 2007-08-22 15:36:12.000000000 -0700
@@ -0,0 +1,330 @@
+/*
+ * Read-Copy Update tracing for realtime implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2006
+ *
+ * Papers: http://www.rdrop.com/users/paulmck/RCU
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU/ *.txt
+ *
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/rcupdate.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/rcupdate.h>
+#include <linux/cpu.h>
+#include <linux/mutex.h>
+#include <linux/rcupreempt_trace.h>
+#include <linux/debugfs.h>
+
+static struct mutex rcupreempt_trace_mutex;
+static char *rcupreempt_trace_buf;
+#define RCUPREEMPT_TRACE_BUF_SIZE 4096
+
+void rcupreempt_trace_move2done(struct rcupreempt_trace *trace)
+{
+ trace->done_length += trace->wait_length;
+ trace->done_add += trace->wait_length;
+ trace->wait_length = 0;
+}
+void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace)
+{
+ trace->wait_length += trace->next_length;
+ trace->wait_add += trace->next_length;
+ trace->next_length = 0;
+}
+void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace)
+{
+ atomic_inc(&trace->rcu_try_flip_1);
+}
+void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace)
+{
+ atomic_inc(&trace->rcu_try_flip_e1);
+}
+void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_i1++;
+}
+void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_ie1++;
+}
+void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_g1++;
+}
+void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_a1++;
+}
+void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_ae1++;
+}
+void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_a2++;
+}
+void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_z1++;
+}
+void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_ze1++;
+}
+void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_z2++;
+}
+void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_m1++;
+}
+void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_me1++;
+}
+void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace)
+{
+ trace->rcu_try_flip_m2++;
+}
+void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace)
+{
+ trace->rcu_check_callbacks++;
+}
+void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace)
+{
+ trace->done_remove += trace->done_length;
+ trace->done_length = 0;
+}
+void rcupreempt_trace_invoke(struct rcupreempt_trace *trace)
+{
+ atomic_inc(&trace->done_invoked);
+}
+void rcupreempt_trace_next_add(struct rcupreempt_trace *trace)
+{
+ trace->next_add++;
+ trace->next_length++;
+}
+
+static void rcupreempt_trace_sum(struct rcupreempt_trace *sp)
+{
+ struct rcupreempt_trace *cp;
+ int cpu;
+
+ memset(sp, 0, sizeof(*sp));
+ for_each_possible_cpu(cpu) {
+ cp = rcupreempt_trace_cpu(cpu);
+ sp->next_length += cp->next_length;
+ sp->next_add += cp->next_add;
+ sp->wait_length += cp->wait_length;
+ sp->wait_add += cp->wait_add;
+ sp->done_length += cp->done_length;
+ sp->done_add += cp->done_add;
+ sp->done_remove += cp->done_remove;
+ atomic_set(&sp->done_invoked, atomic_read(&cp->done_invoked));
+ sp->rcu_check_callbacks += cp->rcu_check_callbacks;
+ atomic_set(&sp->rcu_try_flip_1,
+ atomic_read(&cp->rcu_try_flip_1));
+ atomic_set(&sp->rcu_try_flip_e1,
+ atomic_read(&cp->rcu_try_flip_e1));
+ sp->rcu_try_flip_i1 += cp->rcu_try_flip_i1;
+ sp->rcu_try_flip_ie1 += cp->rcu_try_flip_ie1;
+ sp->rcu_try_flip_g1 += cp->rcu_try_flip_g1;
+ sp->rcu_try_flip_a1 += cp->rcu_try_flip_a1;
+ sp->rcu_try_flip_ae1 += cp->rcu_try_flip_ae1;
+ sp->rcu_try_flip_a2 += cp->rcu_try_flip_a2;
+ sp->rcu_try_flip_z1 += cp->rcu_try_flip_z1;
+ sp->rcu_try_flip_ze1 += cp->rcu_try_flip_ze1;
+ sp->rcu_try_flip_z2 += cp->rcu_try_flip_z2;
+ sp->rcu_try_flip_m1 += cp->rcu_try_flip_m1;
+ sp->rcu_try_flip_me1 += cp->rcu_try_flip_me1;
+ sp->rcu_try_flip_m2 += cp->rcu_try_flip_m2;
+ }
+}
+
+static ssize_t rcustats_read(struct file *filp, char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ struct rcupreempt_trace trace;
+ ssize_t bcount;
+ int cnt = 0;
+
+ rcupreempt_trace_sum(&trace);
+ mutex_lock(&rcupreempt_trace_mutex);
+ snprintf(&rcupreempt_trace_buf[cnt], RCUPREEMPT_TRACE_BUF_SIZE - cnt,
+ "ggp=%ld rcc=%ld\n",
+ rcu_batches_completed(),
+ trace.rcu_check_callbacks);
+ snprintf(&rcupreempt_trace_buf[cnt], RCUPREEMPT_TRACE_BUF_SIZE - cnt,
+ "na=%ld nl=%ld wa=%ld wl=%ld da=%ld dl=%ld dr=%ld di=%d\n"
+ "1=%d e1=%d i1=%ld ie1=%ld g1=%ld a1=%ld ae1=%ld a2=%ld\n"
+ "z1=%ld ze1=%ld z2=%ld m1=%ld me1=%ld m2=%ld\n",
+
+ trace.next_add, trace.next_length,
+ trace.wait_add, trace.wait_length,
+ trace.done_add, trace.done_length,
+ trace.done_remove, atomic_read(&trace.done_invoked),
+ atomic_read(&trace.rcu_try_flip_1),
+ atomic_read(&trace.rcu_try_flip_e1),
+ trace.rcu_try_flip_i1, trace.rcu_try_flip_ie1,
+ trace.rcu_try_flip_g1,
+ trace.rcu_try_flip_a1, trace.rcu_try_flip_ae1,
+ trace.rcu_try_flip_a2,
+ trace.rcu_try_flip_z1, trace.rcu_try_flip_ze1,
+ trace.rcu_try_flip_z2,
+ trace.rcu_try_flip_m1, trace.rcu_try_flip_me1,
+ trace.rcu_try_flip_m2);
+ bcount = simple_read_from_buffer(buffer, count, ppos,
+ rcupreempt_trace_buf, strlen(rcupreempt_trace_buf));
+ mutex_unlock(&rcupreempt_trace_mutex);
+ return bcount;
+}
+
+static ssize_t rcugp_read(struct file *filp, char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ long oldgp = rcu_batches_completed();
+ ssize_t bcount;
+
+ mutex_lock(&rcupreempt_trace_mutex);
+ synchronize_rcu();
+ snprintf(rcupreempt_trace_buf, RCUPREEMPT_TRACE_BUF_SIZE,
+ "oldggp=%ld newggp=%ld\n", oldgp, rcu_batches_completed());
+ bcount = simple_read_from_buffer(buffer, count, ppos,
+ rcupreempt_trace_buf, strlen(rcupreempt_trace_buf));
+ mutex_unlock(&rcupreempt_trace_mutex);
+ return bcount;
+}
+
+static ssize_t rcuctrs_read(struct file *filp, char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ int cnt = 0;
+ int cpu;
+ int f = rcu_batches_completed() & 0x1;
+ ssize_t bcount;
+
+ mutex_lock(&rcupreempt_trace_mutex);
+
+ cnt += snprintf(&rcupreempt_trace_buf[cnt], RCUPREEMPT_TRACE_BUF_SIZE,
+ "CPU last cur F M\n");
+ for_each_online_cpu(cpu) {
+ int *flipctr = rcupreempt_flipctr(cpu);
+ cnt += snprintf(&rcupreempt_trace_buf[cnt],
+ RCUPREEMPT_TRACE_BUF_SIZE - cnt,
+ "%3d %4d %3d %d %d\n",
+ cpu,
+ flipctr[!f],
+ flipctr[f],
+ rcupreempt_flip_flag(cpu),
+ rcupreempt_mb_flag(cpu));
+ }
+ cnt += snprintf(&rcupreempt_trace_buf[cnt],
+ RCUPREEMPT_TRACE_BUF_SIZE - cnt,
+ "ggp = %ld, state = %s\n",
+ rcu_batches_completed(),
+ rcupreempt_try_flip_state_name());
+ cnt += snprintf(&rcupreempt_trace_buf[cnt],
+ RCUPREEMPT_TRACE_BUF_SIZE - cnt,
+ "\n");
+ bcount = simple_read_from_buffer(buffer, count, ppos,
+ rcupreempt_trace_buf, strlen(rcupreempt_trace_buf));
+ mutex_unlock(&rcupreempt_trace_mutex);
+ return bcount;
+}
+
+static struct file_operations rcustats_fops = {
+ .owner = THIS_MODULE,
+ .read = rcustats_read,
+};
+
+static struct file_operations rcugp_fops = {
+ .owner = THIS_MODULE,
+ .read = rcugp_read,
+};
+
+static struct file_operations rcuctrs_fops = {
+ .owner = THIS_MODULE,
+ .read = rcuctrs_read,
+};
+
+static struct dentry *rcudir, *statdir, *ctrsdir, *gpdir;
+static int rcupreempt_debugfs_init(void)
+{
+ rcudir = debugfs_create_dir("rcu", NULL);
+ if (!rcudir)
+ goto out;
+ statdir = debugfs_create_file("rcustats", 0444, rcudir,
+ NULL, &rcustats_fops);
+ if (!statdir)
+ goto free_out;
+
+ gpdir = debugfs_create_file("rcugp", 0444, rcudir, NULL, &rcugp_fops);
+ if (!gpdir)
+ goto free_out;
+
+ ctrsdir = debugfs_create_file("rcuctrs", 0444, rcudir,
+ NULL, &rcuctrs_fops);
+ if (!ctrsdir)
+ goto free_out;
+ return 0;
+free_out:
+ if (statdir)
+ debugfs_remove(statdir);
+ if (gpdir)
+ debugfs_remove(gpdir);
+ debugfs_remove(rcudir);
+out:
+ return 1;
+}
+
+static int __init rcupreempt_trace_init(void)
+{
+ mutex_init(&rcupreempt_trace_mutex);
+ rcupreempt_trace_buf = kmalloc(RCUPREEMPT_TRACE_BUF_SIZE, GFP_KERNEL);
+ if (!rcupreempt_trace_buf)
+ return 1;
+ return rcupreempt_debugfs_init();
+}
+
+static void __exit rcupreempt_trace_cleanup(void)
+{
+ debugfs_remove(statdir);
+ debugfs_remove(gpdir);
+ debugfs_remove(ctrsdir);
+ debugfs_remove(rcudir);
+ kfree(rcupreempt_trace_buf);
+}
+
+
+module_init(rcupreempt_trace_init);
+module_exit(rcupreempt_trace_cleanup);

2007-09-10 18:35:40

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 4/9] RCU: synchronize_sched() workaround for CPU hotplug

Work in progress, not for inclusion.

The combination of CPU hotplug and PREEMPT_RCU has resulted in deadlocks
due to the migration-based implementation of synchronize_sched() in -rt.
This experimental patch maps synchronize_sched() back onto Classic RCU,
eliminating the migration, thus hopefully also eliminating the deadlocks.
It is not clear that this is a good long-term approach, but it will at
least permit people doing CPU hotplug in -rt kernels additional wiggle
room in their design and implementation.

The basic approach is to cause the -rt kernel to incorporate rcuclassic.c
as well as rcupreempt.c, but to #ifdef out the conflicting portions of
rcuclassic.c so that only the code needed to implement synchronize_sched()
remains in a PREEMPT_RT build. Invocations of grace-period detection from
the scheduling-clock interrupt go to rcuclassic.c, which then invokes
the corresponding functions in rcupreempt.c (with _rt suffix added to
keep the linker happy). Also applies the RCU_SOFTIRQ to classic RCU.
The bulk of this patch just moves code around, but likely increases
scheduling-clock latency.

If this patch does turn out to be the right approach, the #ifdefs in
kernel/rcuclassic.c might be dealt with. ;-) At current writing, Gautham
Shenoy's most recent CPU-hotplug fixes seem likely to obsolete this patch
(which would be a very good thing indeed!). If this really pans out,
this portion of the patch will vanish during the forward-porting process.

Signed-off-by: Steven Rostedt <[email protected]> (for RCU_SOFTIRQ)
Signed-off-by: Paul E. McKenney <[email protected]>
---

include/linux/rcuclassic.h | 79 +++++--------------------------------
include/linux/rcupdate.h | 30 ++++++++++++--
include/linux/rcupreempt.h | 27 ++++++------
kernel/Makefile | 2
kernel/rcuclassic.c | 95 ++++++++++++++++++++++++++++++++++++---------
kernel/rcupdate.c | 22 ++++++++--
kernel/rcupreempt.c | 50 +++++------------------
7 files changed, 158 insertions(+), 147 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h
--- linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h 2007-08-22 15:21:06.000000000 -0700
+++ linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h 2007-08-22 17:49:35.000000000 -0700
@@ -42,80 +42,19 @@
#include <linux/cpumask.h>
#include <linux/seqlock.h>

-
-/* Global control variables for rcupdate callback mechanism. */
-struct rcu_ctrlblk {
- long cur; /* Current batch number. */
- long completed; /* Number of the last completed batch */
- int next_pending; /* Is the next batch already waiting? */
-
- int signaled;
-
- spinlock_t lock ____cacheline_internodealigned_in_smp;
- cpumask_t cpumask; /* CPUs that need to switch in order */
- /* for current batch to proceed. */
-} ____cacheline_internodealigned_in_smp;
-
-/* Is batch a before batch b ? */
-static inline int rcu_batch_before(long a, long b)
-{
- return (a - b) < 0;
-}
-
-/* Is batch a after batch b ? */
-static inline int rcu_batch_after(long a, long b)
-{
- return (a - b) > 0;
-}
+DECLARE_PER_CPU(int, rcu_data_bh_passed_quiesc);

/*
- * Per-CPU data for Read-Copy UPdate.
- * nxtlist - new callbacks are added here
- * curlist - current batch for which quiescent cycle started if any
- */
-struct rcu_data {
- /* 1) quiescent state handling : */
- long quiescbatch; /* Batch # for grace period */
- int passed_quiesc; /* User-mode/idle loop etc. */
- int qs_pending; /* core waits for quiesc state */
-
- /* 2) batch handling */
- long batch; /* Batch # for current RCU batch */
- struct rcu_head *nxtlist;
- struct rcu_head **nxttail;
- long qlen; /* # of queued callbacks */
- struct rcu_head *curlist;
- struct rcu_head **curtail;
- struct rcu_head *donelist;
- struct rcu_head **donetail;
- long blimit; /* Upper limit on a processed batch */
- int cpu;
- struct rcu_head barrier;
-};
-
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-
-/*
- * Increment the quiescent state counter.
+ * Increment the bottom-half quiescent state counter.
* The counter is a bit degenerated: We do not need to know
* how many quiescent states passed, just if there was at least
* one since the start of the grace period. Thus just a flag.
*/
-static inline void rcu_qsctr_inc(int cpu)
-{
- struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- rdp->passed_quiesc = 1;
-}
static inline void rcu_bh_qsctr_inc(int cpu)
{
- struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
- rdp->passed_quiesc = 1;
+ per_cpu(rcu_data_bh_passed_quiesc, cpu) = 1;
}

-extern int rcu_pending(int cpu);
-extern int rcu_needs_cpu(int cpu);
-
#define __rcu_read_lock() \
do { \
preempt_disable(); \
@@ -139,9 +78,15 @@ extern int rcu_needs_cpu(int cpu);

#define __synchronize_sched() synchronize_rcu()

-extern void __rcu_init(void);
-extern void rcu_check_callbacks(int cpu, int user);
-extern void rcu_restart_cpu(int cpu);
+#define rcu_advance_callbacks_rt(cpu, user) do { } while (0)
+#define rcu_check_callbacks_rt(cpu, user) do { } while (0)
+#define rcu_init_rt() do { } while (0)
+#define rcu_needs_cpu_rt(cpu) 0
+#define rcu_pending_rt(cpu) 0
+#define rcu_process_callbacks_rt(unused) do { } while (0)
+
+extern void FASTCALL(call_rcu_classic(struct rcu_head *head,
+ void (*func)(struct rcu_head *head)));

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUCLASSIC_H */
diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h linux-2.6.22-d-schedclassic/include/linux/rcupdate.h
--- linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
+++ linux-2.6.22-d-schedclassic/include/linux/rcupdate.h 2007-08-22 15:38:22.000000000 -0700
@@ -197,8 +197,11 @@ struct rcu_head {
* delimited by rcu_read_lock() and rcu_read_unlock(),
* and may be nested.
*/
-extern void FASTCALL(call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *head)));
+#ifdef CONFIG_CLASSIC_RCU
+#define call_rcu(head, func) call_rcu_classic(head, func)
+#else /* #ifdef CONFIG_CLASSIC_RCU */
+#define call_rcu(head, func) call_rcu_preempt(head, func)
+#endif /* #else #ifdef CONFIG_CLASSIC_RCU */

/**
* call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
@@ -226,9 +229,28 @@ extern long rcu_batches_completed(void);
extern long rcu_batches_completed_bh(void);

/* Internal to kernel */
-extern void rcu_init(void);
extern void rcu_check_callbacks(int cpu, int user);
-extern int rcu_needs_cpu(int cpu);
+extern long rcu_batches_completed(void);
+extern long rcu_batches_completed_bh(void);
+extern void rcu_check_callbacks(int cpu, int user);
+extern void rcu_init(void);
+extern int rcu_needs_cpu(int cpu);
+extern int rcu_pending(int cpu);
+struct softirq_action;
+extern void rcu_restart_cpu(int cpu);
+
+DECLARE_PER_CPU(int, rcu_data_passed_quiesc);
+
+/*
+ * Increment the quiescent state counter.
+ * The counter is a bit degenerated: We do not need to know
+ * how many quiescent states passed, just if there was at least
+ * one since the start of the grace period. Thus just a flag.
+ */
+static inline void rcu_qsctr_inc(int cpu)
+{
+ per_cpu(rcu_data_passed_quiesc, cpu) = 1;
+}

#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h
--- linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h 2007-08-22 15:21:06.000000000 -0700
+++ linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h 2007-08-22 17:53:25.000000000 -0700
@@ -42,25 +42,26 @@
#include <linux/cpumask.h>
#include <linux/seqlock.h>

-#define rcu_qsctr_inc(cpu)
-#define rcu_bh_qsctr_inc(cpu)
#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
-
-extern void __rcu_read_lock(void);
-extern void __rcu_read_unlock(void);
-extern int rcu_pending(int cpu);
-extern int rcu_needs_cpu(int cpu);
-
+#define rcu_bh_qsctr_inc(cpu) do { } while (0)
#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); }
-
#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting)

+extern void FASTCALL(call_rcu_classic(struct rcu_head *head,
+ void (*func)(struct rcu_head *head)));
+extern void FASTCALL(call_rcu_preempt(struct rcu_head *head,
+ void (*func)(struct rcu_head *head)));
+extern void __rcu_read_lock(void);
+extern void __rcu_read_unlock(void);
extern void __synchronize_sched(void);
-
-extern void __rcu_init(void);
-extern void rcu_check_callbacks(int cpu, int user);
-extern void rcu_restart_cpu(int cpu);
+extern void rcu_advance_callbacks_rt(int cpu, int user);
+extern void rcu_check_callbacks_rt(int cpu, int user);
+extern void rcu_init_rt(void);
+extern int rcu_needs_cpu_rt(int cpu);
+extern int rcu_pending_rt(int cpu);
+struct softirq_action;
+extern void rcu_process_callbacks_rt(struct softirq_action *unused);

#ifdef CONFIG_RCU_TRACE
struct rcupreempt_trace;
diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/kernel/Makefile linux-2.6.22-d-schedclassic/kernel/Makefile
--- linux-2.6.22-c-preemptrcu/kernel/Makefile 2007-08-22 15:21:06.000000000 -0700
+++ linux-2.6.22-d-schedclassic/kernel/Makefile 2007-08-22 15:38:22.000000000 -0700
@@ -47,7 +47,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
-obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
+obj-$(CONFIG_PREEMPT_RCU) += rcuclassic.o rcupreempt.o
ifeq ($(CONFIG_PREEMPT_RCU),y)
obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
endif
diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/kernel/rcuclassic.c linux-2.6.22-d-schedclassic/kernel/rcuclassic.c
--- linux-2.6.22-c-preemptrcu/kernel/rcuclassic.c 2007-08-22 15:18:40.000000000 -0700
+++ linux-2.6.22-d-schedclassic/kernel/rcuclassic.c 2007-08-22 18:00:17.000000000 -0700
@@ -45,10 +45,53 @@
#include <linux/moduleparam.h>
#include <linux/percpu.h>
#include <linux/notifier.h>
-/* #include <linux/rcupdate.h> @@@ */
#include <linux/cpu.h>
#include <linux/mutex.h>

+
+/* Global control variables for rcupdate callback mechanism. */
+struct rcu_ctrlblk {
+ long cur; /* Current batch number. */
+ long completed; /* Number of the last completed batch */
+ int next_pending; /* Is the next batch already waiting? */
+
+ int signaled;
+
+ spinlock_t lock ____cacheline_internodealigned_in_smp;
+ cpumask_t cpumask; /* CPUs that need to switch in order */
+ /* for current batch to proceed. */
+} ____cacheline_internodealigned_in_smp;
+
+/* Is batch a before batch b ? */
+static inline int rcu_batch_before(long a, long b)
+{
+ return (a - b) < 0;
+}
+
+/*
+ * Per-CPU data for Read-Copy UPdate.
+ * nxtlist - new callbacks are added here
+ * curlist - current batch for which quiescent cycle started if any
+ */
+struct rcu_data {
+ /* 1) quiescent state handling : */
+ long quiescbatch; /* Batch # for grace period */
+ int *passed_quiesc; /* User-mode/idle loop etc. */
+ int qs_pending; /* core waits for quiesc state */
+
+ /* 2) batch handling */
+ long batch; /* Batch # for current RCU batch */
+ struct rcu_head *nxtlist;
+ struct rcu_head **nxttail;
+ long qlen; /* # of queued callbacks */
+ struct rcu_head *curlist;
+ struct rcu_head **curtail;
+ struct rcu_head *donelist;
+ struct rcu_head **donetail;
+ long blimit; /* Upper limit on a processed batch */
+ int cpu;
+};
+
/* Definition for rcupdate control block. */
static struct rcu_ctrlblk rcu_ctrlblk = {
.cur = -300,
@@ -63,11 +106,11 @@ static struct rcu_ctrlblk rcu_bh_ctrlblk
.cpumask = CPU_MASK_NONE,
};

-DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
-DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+static DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+static DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+DEFINE_PER_CPU(int, rcu_data_bh_passed_quiesc);

/* Fake initialization required by compiler */
-static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
static int blimit = 10;
static int qhimark = 10000;
static int qlowmark = 100;
@@ -110,8 +153,8 @@ static inline void force_quiescent_state
* sections are delimited by rcu_read_lock() and rcu_read_unlock(),
* and may be nested.
*/
-void fastcall call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void fastcall call_rcu_classic(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
{
unsigned long flags;
struct rcu_data *rdp;
@@ -128,7 +171,9 @@ void fastcall call_rcu(struct rcu_head *
}
local_irq_restore(flags);
}
-EXPORT_SYMBOL_GPL(call_rcu);
+EXPORT_SYMBOL_GPL(call_rcu_classic);
+
+#ifdef CONFIG_CLASSIC_RCU

/**
* call_rcu_bh - Queue an RCU for invocation after a quicker grace period.
@@ -166,7 +211,9 @@ void fastcall call_rcu_bh(struct rcu_hea

local_irq_restore(flags);
}
+#ifdef CONFIG_CLASSIC_RCU
EXPORT_SYMBOL_GPL(call_rcu_bh);
+#endif /* #ifdef CONFIG_CLASSIC_RCU */

/*
* Return the number of RCU batches processed thus far. Useful
@@ -176,7 +223,9 @@ long rcu_batches_completed(void)
{
return rcu_ctrlblk.completed;
}
+#ifdef CONFIG_CLASSIC_RCU
EXPORT_SYMBOL_GPL(rcu_batches_completed);
+#endif /* #ifdef CONFIG_CLASSIC_RCU */

/*
* Return the number of RCU batches processed thus far. Useful
@@ -186,7 +235,11 @@ long rcu_batches_completed_bh(void)
{
return rcu_bh_ctrlblk.completed;
}
+#ifdef CONFIG_CLASSIC_RCU
EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
+#endif /* #ifdef CONFIG_CLASSIC_RCU */
+
+#endif /* #ifdef CONFIG_CLASSIC_RCU */

/*
* Invoke the completed RCU callbacks. They are expected to be in
@@ -217,7 +270,7 @@ static void rcu_do_batch(struct rcu_data
if (!rdp->donelist)
rdp->donetail = &rdp->donelist;
else
- tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu));
+ raise_softirq(RCU_SOFTIRQ);
}

/*
@@ -294,7 +347,7 @@ static void rcu_check_quiescent_state(st
if (rdp->quiescbatch != rcp->cur) {
/* start new grace period: */
rdp->qs_pending = 1;
- rdp->passed_quiesc = 0;
+ *rdp->passed_quiesc = 0;
rdp->quiescbatch = rcp->cur;
return;
}
@@ -310,7 +363,7 @@ static void rcu_check_quiescent_state(st
* Was there a quiescent state since the beginning of the grace
* period? If no, then exit and wait for the next call.
*/
- if (!rdp->passed_quiesc)
+ if (!*rdp->passed_quiesc)
return;
rdp->qs_pending = 0;

@@ -369,7 +422,6 @@ static void rcu_offline_cpu(int cpu)
&per_cpu(rcu_bh_data, cpu));
put_cpu_var(rcu_data);
put_cpu_var(rcu_bh_data);
- tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu);
}

#else
@@ -381,7 +433,7 @@ static void rcu_offline_cpu(int cpu)
#endif

/*
- * This does the RCU processing work from tasklet context.
+ * This does the RCU processing work from softirq context.
*/
static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
struct rcu_data *rdp)
@@ -426,10 +478,11 @@ static void __rcu_process_callbacks(stru
rcu_do_batch(rdp);
}

-static void rcu_process_callbacks(unsigned long unused)
+static void rcu_process_callbacks(struct softirq_action *unused)
{
__rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
__rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
+ rcu_process_callbacks_rt(unused);
}

static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp)
@@ -464,7 +517,8 @@ static int __rcu_pending(struct rcu_ctrl
int rcu_pending(int cpu)
{
return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
- __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
+ __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu)) ||
+ rcu_pending_rt(cpu);
}

/*
@@ -478,7 +532,8 @@ int rcu_needs_cpu(int cpu)
struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);

- return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu));
+ return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu) ||
+ rcu_needs_cpu_rt(cpu));
}

void rcu_check_callbacks(int cpu, int user)
@@ -490,7 +545,8 @@ void rcu_check_callbacks(int cpu, int us
rcu_bh_qsctr_inc(cpu);
} else if (!in_softirq())
rcu_bh_qsctr_inc(cpu);
- tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
+ rcu_check_callbacks_rt(cpu, user);
+ raise_softirq(RCU_SOFTIRQ);
}

static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
@@ -512,8 +568,9 @@ static void __devinit rcu_online_cpu(int
struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu);

rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp);
+ rdp->passed_quiesc = &per_cpu(rcu_data_passed_quiesc, cpu);
rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
- tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL);
+ bh_rdp->passed_quiesc = &per_cpu(rcu_data_bh_passed_quiesc, cpu);
}

static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
@@ -545,12 +602,14 @@ static struct notifier_block __cpuinitda
* Note that rcu_qsctr and friends are implicitly
* initialized due to the choice of ``0'' for RCU_CTR_INVALID.
*/
-void __init __rcu_init(void)
+void __init rcu_init(void)
{
rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE,
(void *)(long)smp_processor_id());
/* Register notifier for non-boot CPUs */
register_cpu_notifier(&rcu_nb);
+ rcu_init_rt();
+ open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
}

module_param(blimit, int, 0);
diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/kernel/rcupdate.c linux-2.6.22-d-schedclassic/kernel/rcupdate.c
--- linux-2.6.22-c-preemptrcu/kernel/rcupdate.c 2007-08-22 15:18:40.000000000 -0700
+++ linux-2.6.22-d-schedclassic/kernel/rcupdate.c 2007-08-22 15:46:36.000000000 -0700
@@ -52,6 +52,7 @@ struct rcu_synchronize {
struct completion completion;
};

+DEFINE_PER_CPU(int, rcu_data_passed_quiesc);
static DEFINE_PER_CPU(struct rcu_head, rcu_barrier_head) = {NULL};
static atomic_t rcu_barrier_cpu_count;
static DEFINE_MUTEX(rcu_barrier_mutex);
@@ -88,6 +89,22 @@ void synchronize_rcu(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu);

+#ifdef CONFIG_PREEMPT_RCU
+
+/*
+ * Map synchronize_sched() to the classic RCU implementation.
+ */
+void __synchronize_sched(void)
+{
+ struct rcu_synchronize rcu;
+
+ init_completion(&rcu.completion);
+ call_rcu_classic(&rcu.head, wakeme_after_rcu);
+ wait_for_completion(&rcu.completion);
+}
+EXPORT_SYMBOL_GPL(__synchronize_sched);
+#endif /* #ifdef CONFIG_PREEMPT_RCU */
+
static void rcu_barrier_callback(struct rcu_head *notused)
{
if (atomic_dec_and_test(&rcu_barrier_cpu_count))
@@ -131,8 +148,3 @@ void rcu_barrier(void)
wait_for_completion(&rcu_barrier_completion);
mutex_unlock(&rcu_barrier_mutex);
}
-
-void __init rcu_init(void)
-{
- __rcu_init();
-}
diff -urpNa -X dontdiff linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c linux-2.6.22-d-schedclassic/kernel/rcupreempt.c
--- linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 15:35:19.000000000 -0700
+++ linux-2.6.22-d-schedclassic/kernel/rcupreempt.c 2007-08-22 15:45:28.000000000 -0700
@@ -61,7 +61,6 @@ struct rcu_data {
spinlock_t lock; /* Protect rcu_data fields. */
long completed; /* Number of last completed batch. */
int waitlistcount;
- struct tasklet_struct rcu_tasklet;
struct rcu_head *nextlist;
struct rcu_head **nexttail;
struct rcu_head *waitlist[GP_STAGES];
@@ -550,7 +549,7 @@ static void rcu_check_mb(int cpu)
}
}

-void rcu_check_callbacks(int cpu, int user)
+void rcu_check_callbacks_rt(int cpu, int user)
{
unsigned long oldirq;
struct rcu_data *rdp = RCU_DATA_CPU(cpu);
@@ -561,19 +560,14 @@ void rcu_check_callbacks(int cpu, int us
spin_lock_irqsave(&rdp->lock, oldirq);
RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
__rcu_advance_callbacks(rdp);
- if (rdp->donelist == NULL) {
- spin_unlock_irqrestore(&rdp->lock, oldirq);
- } else {
- spin_unlock_irqrestore(&rdp->lock, oldirq);
- raise_softirq(RCU_SOFTIRQ);
- }
+ spin_unlock_irqrestore(&rdp->lock, oldirq);
}

/*
* Needed by dynticks, to make sure all RCU processing has finished
- * when we go idle:
+ * when we go idle. (Currently unused, needed?)
*/
-void rcu_advance_callbacks(int cpu, int user)
+void rcu_advance_callbacks_rt(int cpu, int user)
{
unsigned long oldirq;
struct rcu_data *rdp = RCU_DATA_CPU(cpu);
@@ -589,7 +583,7 @@ void rcu_advance_callbacks(int cpu, int
spin_unlock_irqrestore(&rdp->lock, oldirq);
}

-static void rcu_process_callbacks(struct softirq_action *unused)
+void rcu_process_callbacks_rt(struct softirq_action *unused)
{
unsigned long flags;
struct rcu_head *next, *list;
@@ -613,8 +607,8 @@ static void rcu_process_callbacks(struct
}
}

-void fastcall call_rcu(struct rcu_head *head,
- void (*func)(struct rcu_head *rcu))
+void fastcall call_rcu_preempt(struct rcu_head *head,
+ void (*func)(struct rcu_head *rcu))
{
unsigned long oldirq;
struct rcu_data *rdp;
@@ -631,28 +625,7 @@ void fastcall call_rcu(struct rcu_head *
spin_unlock(&rdp->lock);
local_irq_restore(oldirq);
}
-EXPORT_SYMBOL_GPL(call_rcu);
-
-/*
- * Wait until all currently running preempt_disable() code segments
- * (including hardware-irq-disable segments) complete. Note that
- * in -rt this does -not- necessarily result in all currently executing
- * interrupt -handlers- having completed.
- */
-void __synchronize_sched(void)
-{
- cpumask_t oldmask;
- int cpu;
-
- if (sched_getaffinity(0, &oldmask) < 0)
- oldmask = cpu_possible_map;
- for_each_online_cpu(cpu) {
- sched_setaffinity(0, cpumask_of_cpu(cpu));
- schedule();
- }
- sched_setaffinity(0, oldmask);
-}
-EXPORT_SYMBOL_GPL(__synchronize_sched);
+EXPORT_SYMBOL_GPL(call_rcu_preempt);

/*
* Check to see if any future RCU-related work will need to be done
@@ -663,7 +636,7 @@ EXPORT_SYMBOL_GPL(__synchronize_sched);
* This function is part of the RCU implementation; it is -not-
* an exported member of the RCU API.
*/
-int rcu_needs_cpu(int cpu)
+int rcu_needs_cpu_rt(int cpu)
{
struct rcu_data *rdp = RCU_DATA_CPU(cpu);

@@ -672,7 +645,7 @@ int rcu_needs_cpu(int cpu)
rdp->nextlist != NULL);
}

-int rcu_pending(int cpu)
+int rcu_pending_rt(int cpu)
{
struct rcu_data *rdp = RCU_DATA_CPU(cpu);

@@ -699,7 +672,7 @@ int rcu_pending(int cpu)
return 0;
}

-void __init __rcu_init(void)
+void __init rcu_init_rt(void)
{
int cpu;
int i;
@@ -720,7 +693,6 @@ void __init __rcu_init(void)
rdp->donelist = NULL;
rdp->donetail = &rdp->donelist;
}
- open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
}

/*

2007-09-10 18:36:36

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU

Work in progress, not for inclusion.

This patch allows preemptible RCU to tolerate CPU-hotplug operations.
It accomplishes this by maintaining a local copy of a map of online
CPUs, which it accesses under its own lock.

Signed-off-by: Paul E. McKenney <[email protected]>
---

include/linux/rcuclassic.h | 2
include/linux/rcupreempt.h | 2
kernel/rcuclassic.c | 8 +++
kernel/rcupreempt.c | 93 +++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 100 insertions(+), 5 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h
--- linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h 2007-08-22 15:38:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h 2007-08-22 15:55:39.000000000 -0700
@@ -143,6 +143,8 @@ extern int rcu_needs_cpu(int cpu);
#define rcu_check_callbacks_rt(cpu, user)
#define rcu_init_rt()
#define rcu_needs_cpu_rt(cpu) 0
+#define rcu_offline_cpu_rt(cpu)
+#define rcu_online_cpu_rt(cpu)
#define rcu_pending_rt(cpu) 0
#define rcu_process_callbacks_rt(unused)

diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h
--- linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h 2007-08-22 15:38:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h 2007-08-22 15:55:39.000000000 -0700
@@ -59,6 +59,8 @@ extern void rcu_advance_callbacks_rt(int
extern void rcu_check_callbacks_rt(int cpu, int user);
extern void rcu_init_rt(void);
extern int rcu_needs_cpu_rt(int cpu);
+extern void rcu_offline_cpu_rt(int cpu);
+extern void rcu_online_cpu_rt(int cpu);
extern int rcu_pending_rt(int cpu);
struct softirq_action;
extern void rcu_process_callbacks_rt(struct softirq_action *unused);
diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/kernel/rcuclassic.c linux-2.6.22-e-hotplugcpu/kernel/rcuclassic.c
--- linux-2.6.22-d-schedclassic/kernel/rcuclassic.c 2007-08-22 15:51:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/kernel/rcuclassic.c 2007-08-22 15:55:39.000000000 -0700
@@ -414,14 +414,19 @@ static void __rcu_offline_cpu(struct rcu
static void rcu_offline_cpu(int cpu)
{
struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
+#ifdef CONFIG_CLASSIC_RCU
struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data);
+#endif /* #ifdef CONFIG_CLASSIC_RCU */

__rcu_offline_cpu(this_rdp, &rcu_ctrlblk,
&per_cpu(rcu_data, cpu));
+#ifdef CONFIG_CLASSIC_RCU
__rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk,
&per_cpu(rcu_bh_data, cpu));
- put_cpu_var(rcu_data);
put_cpu_var(rcu_bh_data);
+#endif /* #ifdef CONFIG_CLASSIC_RCU */
+ put_cpu_var(rcu_data);
+ rcu_offline_cpu_rt(cpu);
}

#else
@@ -571,6 +576,7 @@ static void __devinit rcu_online_cpu(int
rdp->passed_quiesc = &per_cpu(rcu_data_passed_quiesc, cpu);
rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp);
bh_rdp->passed_quiesc = &per_cpu(rcu_data_bh_passed_quiesc, cpu);
+ rcu_online_cpu_rt(cpu);
}

static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/kernel/rcupreempt.c linux-2.6.22-e-hotplugcpu/kernel/rcupreempt.c
--- linux-2.6.22-d-schedclassic/kernel/rcupreempt.c 2007-08-22 15:45:28.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/kernel/rcupreempt.c 2007-08-22 15:56:22.000000000 -0700
@@ -125,6 +125,8 @@ enum rcu_mb_flag_values {
};
static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done;

+static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE;
+
/*
* Macro that prevents the compiler from reordering accesses, but does
* absolutely -nothing- to prevent CPUs from reordering. This is used
@@ -404,7 +406,7 @@ rcu_try_flip_idle(void)

/* Now ask each CPU for acknowledgement of the flip. */

- for_each_possible_cpu(cpu)
+ for_each_cpu_mask(cpu, rcu_cpu_online_map)
per_cpu(rcu_flip_flag, cpu) = rcu_flipped;

return 1;
@@ -420,7 +422,7 @@ rcu_try_flip_waitack(void)
int cpu;

RCU_TRACE_ME(rcupreempt_trace_try_flip_a1);
- for_each_possible_cpu(cpu)
+ for_each_cpu_mask(cpu, rcu_cpu_online_map)
if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) {
RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1);
return 0;
@@ -462,7 +464,7 @@ rcu_try_flip_waitzero(void)

/* Call for a memory barrier from each CPU. */

- for_each_possible_cpu(cpu)
+ for_each_cpu_mask(cpu, rcu_cpu_online_map)
per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed;

RCU_TRACE_ME(rcupreempt_trace_try_flip_z2);
@@ -480,7 +482,7 @@ rcu_try_flip_waitmb(void)
int cpu;

RCU_TRACE_ME(rcupreempt_trace_try_flip_m1);
- for_each_possible_cpu(cpu)
+ for_each_cpu_mask(cpu, rcu_cpu_online_map)
if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) {
RCU_TRACE_ME(rcupreempt_trace_try_flip_me1);
return 0;
@@ -583,6 +585,89 @@ void rcu_advance_callbacks_rt(int cpu, i
spin_unlock_irqrestore(&rdp->lock, oldirq);
}

+#ifdef CONFIG_HOTPLUG_CPU
+
+#define rcu_offline_cpu_rt_enqueue(srclist, srctail, dstlist, dsttail) do { \
+ *dsttail = srclist; \
+ if (srclist != NULL) { \
+ dsttail = srctail; \
+ srclist = NULL; \
+ srctail = &srclist;\
+ } \
+ } while (0)
+
+
+void rcu_offline_cpu_rt(int cpu)
+{
+ int i;
+ struct rcu_head *list = NULL;
+ unsigned long oldirq;
+ struct rcu_data *rdp = RCU_DATA_CPU(cpu);
+ struct rcu_head **tail = &list;
+
+ /* Remove all callbacks from the newly dead CPU, retaining order. */
+
+ spin_lock_irqsave(&rdp->lock, oldirq);
+ rcu_offline_cpu_rt_enqueue(rdp->donelist, rdp->donetail, list, tail);
+ for (i = GP_STAGES - 1; i >= 0; i--)
+ rcu_offline_cpu_rt_enqueue(rdp->waitlist[i], rdp->waittail[i],
+ list, tail);
+ rcu_offline_cpu_rt_enqueue(rdp->nextlist, rdp->nexttail, list, tail);
+ spin_unlock_irqrestore(&rdp->lock, oldirq);
+ rdp->waitlistcount = 0;
+
+ /* Disengage the newly dead CPU from grace-period computation. */
+
+ spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
+ rcu_check_mb(cpu);
+ if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
+ smp_mb(); /* Subsequent counter accesses must see new value */
+ per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
+ smp_mb(); /* Subsequent RCU read-side critical sections */
+ /* seen -after- acknowledgement. */
+ }
+ cpu_clear(cpu, rcu_cpu_online_map);
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+
+ /*
+ * Place the removed callbacks on the current CPU's queue.
+ * Make them all start a new grace period: simple approach,
+ * in theory could starve a given set of callbacks, but
+ * you would need to be doing some serious CPU hotplugging
+ * to make this happen. If this becomes a problem, adding
+ * a synchronize_rcu() to the hotplug path would be a simple
+ * fix.
+ */
+
+ rdp = RCU_DATA_ME();
+ spin_lock_irqsave(&rdp->lock, oldirq);
+ *rdp->nexttail = list;
+ if (list)
+ rdp->nexttail = tail;
+ spin_unlock_irqrestore(&rdp->lock, oldirq);
+}
+
+void __devinit rcu_online_cpu_rt(int cpu)
+{
+ unsigned long oldirq;
+
+ spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
+ cpu_set(cpu, rcu_cpu_online_map);
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+}
+
+#else /* #ifdef CONFIG_HOTPLUG_CPU */
+
+void rcu_offline_cpu(int cpu)
+{
+}
+
+void __devinit rcu_online_cpu_rt(int cpu)
+{
+}
+
+#endif /* #else #ifdef CONFIG_HOTPLUG_CPU */
+
void rcu_process_callbacks_rt(struct softirq_action *unused)
{
unsigned long flags;

2007-09-10 18:39:21

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU

Work in progress, not for inclusion.

RCU priority boosting is needed when running a workload that might include
CPU-bound user tasks running at realtime priorities with preemptible RCU.
In this situation, RCU priority boosting is needed to avoid OOM.

Please note that because Classic RCU does not permit RCU read-side
critical sections to be preempted, there is no need to boost the priority
of Classic RCU readers. Boosting the priority of a running process
does not make it run any faster, at least not on any hardware that I am
aware of. ;-)

Signed-off-by: Paul E. McKenney <[email protected]>
---

include/linux/init_task.h | 13
include/linux/rcupdate.h | 17 +
include/linux/rcupreempt.h | 20 +
include/linux/sched.h | 24 +
init/main.c | 1
kernel/Kconfig.preempt | 14 -
kernel/fork.c | 6
kernel/rcupreempt.c | 608 ++++++++++++++++++++++++++++++++++++++++++---
kernel/rtmutex.c | 7
kernel/sched.c | 5
lib/Kconfig.debug | 34 ++
11 files changed, 703 insertions(+), 46 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/init_task.h linux-2.6.22-F-boostrcu/include/linux/init_task.h
--- linux-2.6.22-E-hotplug/include/linux/init_task.h 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/include/linux/init_task.h 2007-08-31 14:09:02.000000000 -0700
@@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy;
.signalfd_list = LIST_HEAD_INIT(sighand.signalfd_list), \
}

+#ifdef CONFIG_PREEMPT_RCU_BOOST
+#define INIT_RCU_BOOST_PRIO .rcu_prio = MAX_PRIO,
+#define INIT_PREEMPT_RCU_BOOST(tsk) \
+ .rcub_rbdp = NULL, \
+ .rcub_state = RCU_BOOST_IDLE, \
+ .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry),
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+#define INIT_RCU_BOOST_PRIO
+#define INIT_PREEMPT_RCU_BOOST(tsk)
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
extern struct group_info init_groups;

#define INIT_STRUCT_PID { \
@@ -125,6 +136,7 @@ extern struct group_info init_groups;
.prio = MAX_PRIO-20, \
.static_prio = MAX_PRIO-20, \
.normal_prio = MAX_PRIO-20, \
+ INIT_RCU_BOOST_PRIO \
.policy = SCHED_NORMAL, \
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
@@ -169,6 +181,7 @@ extern struct group_info init_groups;
}, \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
+ INIT_PREEMPT_RCU_BOOST(tsk) \
}


diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/rcupdate.h linux-2.6.22-F-boostrcu/include/linux/rcupdate.h
--- linux-2.6.22-E-hotplug/include/linux/rcupdate.h 2007-08-24 11:03:22.000000000 -0700
+++ linux-2.6.22-F-boostrcu/include/linux/rcupdate.h 2007-08-24 17:04:14.000000000 -0700
@@ -252,5 +252,22 @@ static inline void rcu_qsctr_inc(int cpu
per_cpu(rcu_data_passed_quiesc, cpu) = 1;
}

+#ifdef CONFIG_PREEMPT_RCU_BOOST
+extern void init_rcu_boost_late(void);
+extern void __rcu_preempt_boost(void);
+#define rcu_preempt_boost() /* cpp to avoid #include hell. */ \
+ do { \
+ if (unlikely(current->rcu_read_lock_nesting > 0)) \
+ __rcu_preempt_boost(); \
+ } while (0)
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+static inline void init_rcu_boost_late(void)
+{
+}
+static inline void rcu_preempt_boost(void)
+{
+}
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
#endif /* __KERNEL__ */
#endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/rcupreempt.h linux-2.6.22-F-boostrcu/include/linux/rcupreempt.h
--- linux-2.6.22-E-hotplug/include/linux/rcupreempt.h 2007-08-24 11:20:32.000000000 -0700
+++ linux-2.6.22-F-boostrcu/include/linux/rcupreempt.h 2007-08-24 11:24:59.000000000 -0700
@@ -42,6 +42,26 @@
#include <linux/cpumask.h>
#include <linux/seqlock.h>

+#ifdef CONFIG_PREEMPT_RCU_BOOST
+/*
+ * Task state with respect to being RCU-boosted. This state is changed
+ * by the task itself in response to the following three events:
+ * 1. Preemption (or block on lock) while in RCU read-side critical section.
+ * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section.
+ *
+ * The RCU-boost task also updates the state when boosting priority.
+ */
+enum rcu_boost_state {
+ RCU_BOOST_IDLE = 0, /* Not yet blocked if in RCU read-side. */
+ RCU_BOOST_BLOCKED = 1, /* Blocked from RCU read-side. */
+ RCU_BOOSTED = 2, /* Boosting complete. */
+ RCU_BOOST_INVALID = 3, /* For bogus state sightings. */
+};
+
+#define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1)
+
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
#define rcu_bh_qsctr_inc(cpu) do { } while (0)
#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/sched.h linux-2.6.22-F-boostrcu/include/linux/sched.h
--- linux-2.6.22-E-hotplug/include/linux/sched.h 2007-08-24 11:00:39.000000000 -0700
+++ linux-2.6.22-F-boostrcu/include/linux/sched.h 2007-08-24 17:07:01.000000000 -0700
@@ -546,6 +546,22 @@ struct signal_struct {
#define is_rt_policy(p) ((p) != SCHED_NORMAL && (p) != SCHED_BATCH)
#define has_rt_policy(p) unlikely(is_rt_policy((p)->policy))

+#ifdef CONFIG_PREEMPT_RCU_BOOST
+#define set_rcu_prio(p, prio) /* cpp to avoid #include hell */ \
+ do { \
+ (p)->rcu_prio = (prio); \
+ } while (0)
+#define get_rcu_prio(p) (p)->rcu_prio /* cpp to avoid #include hell */
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+static inline void set_rcu_prio(struct task_struct *p, int prio)
+{
+}
+static inline int get_rcu_prio(struct task_struct *p)
+{
+ return MAX_PRIO;
+}
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
/*
* Some day this will be a full-fledged user tracking system..
*/
@@ -834,6 +850,9 @@ struct task_struct {
#endif
int load_weight; /* for niceness load balancing purposes */
int prio, static_prio, normal_prio;
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+ int rcu_prio;
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
struct list_head run_list;
struct prio_array *array;

@@ -858,6 +877,11 @@ struct task_struct {
#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
#endif
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+ struct rcu_boost_dat *rcub_rbdp;
+ enum rcu_boost_state rcub_state;
+ struct list_head rcub_entry;
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */

struct list_head tasks;
/*
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/init/main.c linux-2.6.22-F-boostrcu/init/main.c
--- linux-2.6.22-E-hotplug/init/main.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/init/main.c 2007-08-24 11:24:59.000000000 -0700
@@ -722,6 +722,7 @@ static void __init do_basic_setup(void)
driver_init();
init_irq_proc();
do_initcalls();
+ init_rcu_boost_late();
}

static void __init do_pre_smp_initcalls(void)
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/fork.c linux-2.6.22-F-boostrcu/kernel/fork.c
--- linux-2.6.22-E-hotplug/kernel/fork.c 2007-08-24 11:00:39.000000000 -0700
+++ linux-2.6.22-F-boostrcu/kernel/fork.c 2007-08-24 11:24:59.000000000 -0700
@@ -1036,6 +1036,12 @@ static struct task_struct *copy_process(
p->rcu_read_lock_nesting = 0;
p->rcu_flipctr_idx = 0;
#endif /* #ifdef CONFIG_PREEMPT_RCU */
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+ p->rcu_prio = MAX_PRIO;
+ p->rcub_rbdp = NULL;
+ p->rcub_state = RCU_BOOST_IDLE;
+ INIT_LIST_HEAD(&p->rcub_entry);
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
p->vfork_done = NULL;
spin_lock_init(&p->alloc_lock);

diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/Kconfig.preempt linux-2.6.22-F-boostrcu/kernel/Kconfig.preempt
--- linux-2.6.22-E-hotplug/kernel/Kconfig.preempt 2007-08-24 11:00:39.000000000 -0700
+++ linux-2.6.22-F-boostrcu/kernel/Kconfig.preempt 2007-08-24 11:24:59.000000000 -0700
@@ -91,13 +91,13 @@ config PREEMPT_RCU

endchoice

-config RCU_TRACE
- bool "Enable tracing for RCU - currently stats in debugfs"
- select DEBUG_FS
- default y
+config PREEMPT_RCU_BOOST
+ bool "Enable priority boosting of RCU read-side critical sections"
+ depends on PREEMPT_RCU
+ default n
help
- This option provides tracing in RCU which presents stats
- in debugfs for debugging RCU implementation.
+ This option permits priority boosting of RCU read-side critical
+ sections that have been preempted in order to prevent indefinite
+ delay of grace periods in face of runaway non-realtime processes.

- Say Y here if you want to enable RCU tracing
Say N if you are unsure.
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/rcupreempt.c linux-2.6.22-F-boostrcu/kernel/rcupreempt.c
--- linux-2.6.22-E-hotplug/kernel/rcupreempt.c 2007-08-24 11:20:32.000000000 -0700
+++ linux-2.6.22-F-boostrcu/kernel/rcupreempt.c 2007-08-31 14:06:49.000000000 -0700
@@ -51,6 +51,15 @@
#include <linux/byteorder/swabb.h>
#include <linux/cpumask.h>
#include <linux/rcupreempt_trace.h>
+#include <linux/kthread.h>
+
+/*
+ * Macro that prevents the compiler from reordering accesses, but does
+ * absolutely -nothing- to prevent CPUs from reordering. This is used
+ * only to mediate communication between mainline code and hardware
+ * interrupt and NMI handlers.
+ */
+#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))

/*
* PREEMPT_RCU data structures.
@@ -82,6 +91,531 @@ static struct rcu_ctrlblk rcu_ctrlblk =
};
static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };

+#ifndef CONFIG_PREEMPT_RCU_BOOST
+static inline void init_rcu_boost_early(void) { }
+static inline void rcu_read_unlock_unboost(void) { }
+#else /* #ifndef CONFIG_PREEMPT_RCU_BOOST */
+
+/* Defines possible event indices for ->rbs_stats[] (first index). */
+
+#define RCU_BOOST_DAT_BLOCK 0
+#define RCU_BOOST_DAT_BOOST 1
+#define RCU_BOOST_DAT_UNLOCK 2
+#define N_RCU_BOOST_DAT_EVENTS 3
+
+/* RCU-boost per-CPU array element. */
+
+struct rcu_boost_dat {
+ spinlock_t rbs_lock; /* Protects state/CPU slice of structures. */
+ struct list_head rbs_toboost;
+ struct list_head rbs_boosted;
+ unsigned long rbs_blocked;
+ unsigned long rbs_boost_attempt;
+ unsigned long rbs_boost;
+ unsigned long rbs_unlock;
+ unsigned long rbs_unboosted;
+#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
+ unsigned long rbs_stats[N_RCU_BOOST_DAT_EVENTS][N_RCU_BOOST_STATE];
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
+};
+#define RCU_BOOST_ELEMENTS 4
+
+static int rcu_boost_idx = -1; /* invalid value for early RCU use. */
+static DEFINE_PER_CPU(struct rcu_boost_dat, rcu_boost_dat[RCU_BOOST_ELEMENTS]);
+static struct task_struct *rcu_boost_task;
+
+#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
+
+/*
+ * Function to increment indicated ->rbs_stats[] element.
+ */
+static inline void rcu_boost_dat_stat(struct rcu_boost_dat *rbdp,
+ int event,
+ enum rcu_boost_state oldstate)
+{
+ if (oldstate >= RCU_BOOST_IDLE && oldstate <= RCU_BOOSTED) {
+ rbdp->rbs_stats[event][oldstate]++;
+ } else {
+ rbdp->rbs_stats[event][RCU_BOOST_INVALID]++;
+ }
+}
+
+static inline void rcu_boost_dat_stat_block(struct rcu_boost_dat *rbdp,
+ enum rcu_boost_state oldstate)
+{
+ rcu_boost_dat_stat(rbdp, RCU_BOOST_DAT_BLOCK, oldstate);
+}
+
+static inline void rcu_boost_dat_stat_boost(struct rcu_boost_dat *rbdp,
+ enum rcu_boost_state oldstate)
+{
+ rcu_boost_dat_stat(rbdp, RCU_BOOST_DAT_BOOST, oldstate);
+}
+
+static inline void rcu_boost_dat_stat_unlock(struct rcu_boost_dat *rbdp,
+ enum rcu_boost_state oldstate)
+{
+ rcu_boost_dat_stat(rbdp, RCU_BOOST_DAT_UNLOCK, oldstate);
+}
+
+/*
+ * Prefix for kprint() strings for periodic statistics messages.
+ */
+static char *rcu_boost_state_event[] = {
+ "block: ",
+ "boost: ",
+ "unlock: ",
+};
+
+/*
+ * Indicators for numbers in kprint() strings. "!" indicates a state-event
+ * pair that should not happen, while "?" indicates a state that should
+ * not happen.
+ */
+static char *rcu_boost_state_error[] = {
+ /*ibBe*/
+ " ?", /* block */
+ "! ?", /* boost */
+ "? ?", /* unlock */
+};
+
+/*
+ * Print out RCU booster task statistics at the specified interval.
+ */
+static void rcu_boost_dat_stat_print(void)
+{
+ /* Three decimal digits per byte plus spacing per number and line. */
+ char buf[N_RCU_BOOST_STATE * (sizeof(long) * 3 + 2) + 2];
+ int cpu;
+ int event;
+ int i;
+ static time_t lastprint; /* static implies 0 initial value. */
+ struct rcu_boost_dat *rbdp;
+ int state;
+ struct rcu_boost_dat sum;
+
+ /* Wait a graceful interval between printk spamming. */
+ /* Note: time_after() dislikes time_t. */
+
+ if (xtime.tv_sec - lastprint < CONFIG_PREEMPT_RCU_BOOST_STATS_INTERVAL)
+ return;
+
+ /* Sum up the state/event-independent counters. */
+
+ sum.rbs_blocked = 0;
+ sum.rbs_boost_attempt = 0;
+ sum.rbs_boost = 0;
+ sum.rbs_unlock = 0;
+ sum.rbs_unboosted = 0;
+ for_each_possible_cpu(cpu)
+ for (i = 0; i < RCU_BOOST_ELEMENTS; i++) {
+ rbdp = per_cpu(rcu_boost_dat, cpu);
+ sum.rbs_blocked += rbdp[i].rbs_blocked;
+ sum.rbs_boost_attempt += rbdp[i].rbs_boost_attempt;
+ sum.rbs_boost += rbdp[i].rbs_boost;
+ sum.rbs_unlock += rbdp[i].rbs_unlock;
+ sum.rbs_unboosted += rbdp[i].rbs_unboosted;
+ }
+
+ /* Sum up the state/event-dependent counters. */
+
+ for (event = 0; event < N_RCU_BOOST_DAT_EVENTS; event++)
+ for (state = 0; state < N_RCU_BOOST_STATE; state++) {
+ sum.rbs_stats[event][state] = 0;
+ for_each_possible_cpu(cpu) {
+ for (i = 0; i < RCU_BOOST_ELEMENTS; i++)
+ sum.rbs_stats[event][state]
+ += per_cpu(rcu_boost_dat,
+ cpu)[i].rbs_stats[event][state];
+ }
+ }
+
+ /* Print them out! */
+
+ printk(KERN_INFO
+ "rcu_boost_dat: idx=%d "
+ "b=%lu ul=%lu ub=%lu boost: a=%lu b=%lu\n",
+ rcu_boost_idx,
+ sum.rbs_blocked, sum.rbs_unlock, sum.rbs_unboosted,
+ sum.rbs_boost_attempt, sum.rbs_boost);
+ for (event = 0; event < N_RCU_BOOST_DAT_EVENTS; event++) {
+ i = 0;
+ for (state = 0; state < N_RCU_BOOST_STATE; state++)
+ i += sprintf(&buf[i], " %ld%c",
+ sum.rbs_stats[event][state],
+ rcu_boost_state_error[event][state]);
+ printk(KERN_INFO "rcu_boost_dat %s %s\n",
+ rcu_boost_state_event[event], buf);
+ }
+
+ /* Go away and don't come back for awhile. */
+
+ lastprint = xtime.tv_sec;
+}
+
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
+
+static inline void rcu_boost_dat_stat_block(struct rcu_boost_dat *rbdp,
+ enum rcu_boost_state oldstate)
+{
+}
+static inline void rcu_boost_dat_stat_boost(struct rcu_boost_dat *rbdp,
+ enum rcu_boost_state oldstate)
+{
+}
+static inline void rcu_boost_dat_stat_unlock(struct rcu_boost_dat *rbdp,
+ enum rcu_boost_state oldstate)
+{
+}
+static void rcu_boost_dat_stat_print(void)
+{
+}
+
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
+
+/*
+ * Initialize RCU-boost state. This happens early in the boot process,
+ * when the scheduler does not yet exist. So don't try to use it.
+ */
+static void init_rcu_boost_early(void)
+{
+ struct rcu_boost_dat *rbdp;
+ int cpu;
+ int i;
+
+ for_each_possible_cpu(cpu) {
+ rbdp = per_cpu(rcu_boost_dat, cpu);
+ for (i = 0; i < RCU_BOOST_ELEMENTS; i++) {
+ spin_lock_init(&rbdp[i].rbs_lock);
+ INIT_LIST_HEAD(&rbdp[i].rbs_toboost);
+ INIT_LIST_HEAD(&rbdp[i].rbs_boosted);
+ rbdp[i].rbs_blocked = 0;
+ rbdp[i].rbs_boost_attempt = 0;
+ rbdp[i].rbs_boost = 0;
+ rbdp[i].rbs_unlock = 0;
+ rbdp[i].rbs_unboosted = 0;
+#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
+ {
+ int j, k;
+
+ for (j = 0; j < N_RCU_BOOST_DAT_EVENTS; j++)
+ for (k = 0; k < N_RCU_BOOST_STATE; k++)
+ rbdp[i].rbs_stats[j][k] = 0;
+ }
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
+ }
+ smp_wmb(); /* Make sure readers see above initialization. */
+ rcu_boost_idx = 0; /* Allow readers to access data. */
+ }
+}
+
+/*
+ * Return the list on which the calling task should add itself, or
+ * NULL if too early during initialization.
+ */
+static inline struct rcu_boost_dat *rcu_rbd_new(void)
+{
+ int cpu = smp_processor_id(); /* locks used, so preemption OK. */
+ int idx = ORDERED_WRT_IRQ(rcu_boost_idx);
+
+ if (unlikely(idx < 0))
+ return NULL;
+ return &per_cpu(rcu_boost_dat, cpu)[idx];
+}
+
+/*
+ * Return the list from which to boost target tasks.
+ * May only be invoked by the booster task, so guaranteed to
+ * already be initialized. Use rcu_boost_dat element least recently
+ * the destination for task blocking in RCU read-side critical sections.
+ */
+static inline struct rcu_boost_dat *rcu_rbd_boosting(int cpu)
+{
+ int idx = (rcu_boost_idx + 1) & (RCU_BOOST_ELEMENTS - 1);
+
+ return &per_cpu(rcu_boost_dat, cpu)[idx];
+}
+
+#define PREEMPT_RCU_BOOSTER_PRIO 49 /* Match curr_irq_prio manually. */
+ /* Administrators can always adjust */
+ /* via the /proc interface. */
+
+/*
+ * Boost the specified task from an RCU viewpoint.
+ * Boost the target task to a priority just a bit less-favored than
+ * that of the RCU-boost task, but boost to a realtime priority even
+ * if the RCU-boost task is running at a non-realtime priority.
+ * We check the priority of the RCU-boost task each time we boost
+ * in case the sysadm manually changes the priority.
+ */
+static void rcu_boost_prio(struct task_struct *taskp)
+{
+ unsigned long flags;
+ int rcuprio;
+
+ spin_lock_irqsave(&current->pi_lock, flags);
+ rcuprio = rt_mutex_getprio(current) + 1;
+ if (rcuprio >= MAX_USER_RT_PRIO)
+ rcuprio = MAX_USER_RT_PRIO - 1;
+ spin_unlock_irqrestore(&current->pi_lock, flags);
+ spin_lock_irqsave(&taskp->pi_lock, flags);
+ if (taskp->rcu_prio != rcuprio) {
+ taskp->rcu_prio = rcuprio;
+ if (taskp->rcu_prio < taskp->prio)
+ rt_mutex_setprio(taskp, taskp->rcu_prio);
+ }
+ spin_unlock_irqrestore(&taskp->pi_lock, flags);
+}
+
+/*
+ * Unboost the specified task from an RCU viewpoint.
+ */
+static void rcu_unboost_prio(struct task_struct *taskp)
+{
+ int nprio;
+ unsigned long flags;
+
+ spin_lock_irqsave(&taskp->pi_lock, flags);
+ taskp->rcu_prio = MAX_PRIO;
+ nprio = rt_mutex_getprio(taskp);
+ if (nprio > taskp->prio)
+ rt_mutex_setprio(taskp, nprio);
+ spin_unlock_irqrestore(&taskp->pi_lock, flags);
+}
+
+/*
+ * Boost all of the RCU-reader tasks on the specified list.
+ */
+static void rcu_boost_one_reader_list(struct rcu_boost_dat *rbdp)
+{
+ LIST_HEAD(list);
+ unsigned long flags;
+ struct task_struct *taskp;
+
+ /*
+ * Splice both lists onto a local list. We will still
+ * need to hold the lock when manipulating the local list
+ * because tasks can remove themselves at any time.
+ * The reason for splicing the rbs_boosted list is that
+ * our priority may have changed, so reboosting may be
+ * required.
+ */
+
+ spin_lock_irqsave(&rbdp->rbs_lock, flags);
+ list_splice_init(&rbdp->rbs_toboost, &list);
+ list_splice_init(&rbdp->rbs_boosted, &list);
+ while (!list_empty(&list)) {
+
+ /*
+ * Pause for a bit before boosting each task.
+ * @@@FIXME: reduce/eliminate pausing in case of OOM.
+ */
+
+ spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
+ schedule_timeout_uninterruptible(1);
+ spin_lock_irqsave(&rbdp->rbs_lock, flags);
+
+ /*
+ * All tasks might have removed themselves while
+ * we were waiting. Recheck list emptiness.
+ */
+
+ if (list_empty(&list))
+ break;
+
+ /* Remove first task in local list, count the attempt. */
+
+ taskp = list_entry(list.next, typeof(*taskp), rcub_entry);
+ list_del_init(&taskp->rcub_entry);
+ rbdp->rbs_boost_attempt++;
+
+ /* Ignore tasks in unexpected states. */
+
+ if (taskp->rcub_state == RCU_BOOST_IDLE) {
+ list_add_tail(&taskp->rcub_entry, &rbdp->rbs_toboost);
+ rcu_boost_dat_stat_boost(rbdp, taskp->rcub_state);
+ continue;
+ }
+
+ /* Boost the task's priority. */
+
+ rcu_boost_prio(taskp);
+ rbdp->rbs_boost++;
+ rcu_boost_dat_stat_boost(rbdp, taskp->rcub_state);
+ taskp->rcub_state = RCU_BOOSTED;
+ list_add_tail(&taskp->rcub_entry, &rbdp->rbs_boosted);
+ }
+ spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
+}
+
+/*
+ * Priority-boost tasks stuck in RCU read-side critical sections as
+ * needed (presumably rarely).
+ */
+static int rcu_booster(void *arg)
+{
+ int cpu;
+ struct sched_param sp = { .sched_priority = PREEMPT_RCU_BOOSTER_PRIO, };
+
+ sched_setscheduler(current, SCHED_RR, &sp);
+ current->flags |= PF_NOFREEZE;
+
+ do {
+
+ /* Advance the lists of tasks. */
+
+ rcu_boost_idx = (rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS;
+ for_each_possible_cpu(cpu) {
+
+ /*
+ * Boost all sufficiently aged readers.
+ * Readers must first be preempted or block
+ * on a mutex in an RCU read-side critical section,
+ * then remain in that critical section for
+ * RCU_BOOST_ELEMENTS-1 time intervals.
+ * So most of the time we should end up doing
+ * nothing.
+ */
+
+ rcu_boost_one_reader_list(rcu_rbd_boosting(cpu));
+
+ /*
+ * Large SMP systems may need to sleep sometimes
+ * in this loop. Or have multiple RCU-boost tasks.
+ */
+ }
+
+ /*
+ * Sleep to allow any unstalled RCU read-side critical
+ * sections to age out of the list. @@@ FIXME: reduce,
+ * adjust, or eliminate in case of OOM.
+ */
+
+ schedule_timeout_uninterruptible(HZ);
+
+ /* Print stats if enough time has passed. */
+
+ rcu_boost_dat_stat_print();
+
+ } while (!kthread_should_stop());
+
+ return 0;
+}
+
+/*
+ * Perform the portions of RCU-boost initialization that require the
+ * scheduler to be up and running.
+ */
+void init_rcu_boost_late(void)
+{
+
+ /* Spawn RCU-boost task. */
+
+ printk(KERN_INFO "Starting RCU priority booster\n");
+ rcu_boost_task = kthread_run(rcu_booster, NULL, "RCU Prio Booster");
+ if (IS_ERR(rcu_boost_task))
+ panic("Unable to create RCU Priority Booster, errno %ld\n",
+ -PTR_ERR(rcu_boost_task));
+}
+
+/*
+ * Update task's RCU-boost state to reflect blocking in RCU read-side
+ * critical section, so that the RCU-boost task can find it in case it
+ * later needs its priority boosted.
+ */
+void __rcu_preempt_boost(void)
+{
+ struct rcu_boost_dat *rbdp;
+ unsigned long flags;
+
+ /* Identify list to place task on for possible later boosting. */
+
+ local_irq_save(flags);
+ rbdp = rcu_rbd_new();
+ if (rbdp == NULL) {
+ local_irq_restore(flags);
+ printk(KERN_INFO
+ "Preempted RCU read-side critical section too early.\n");
+ return;
+ }
+ spin_lock(&rbdp->rbs_lock);
+ rbdp->rbs_blocked++;
+
+ /*
+ * Update state. We hold the lock and aren't yet on the list,
+ * so the booster cannot mess with us yet.
+ */
+
+ rcu_boost_dat_stat_block(rbdp, current->rcub_state);
+ if (current->rcub_state != RCU_BOOST_IDLE) {
+
+ /*
+ * We have been here before, so just update stats.
+ * It may seem strange to do all this work just to
+ * accumulate statistics, but this is such a
+ * low-probability code path that we shouldn't care.
+ * If it becomes a problem, it can be fixed.
+ */
+
+ spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
+ return;
+ }
+ current->rcub_state = RCU_BOOST_BLOCKED;
+
+ /* Now add ourselves to the list so that the booster can find us. */
+
+ list_add_tail(&current->rcub_entry, &rbdp->rbs_toboost);
+ current->rcub_rbdp = rbdp;
+ spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
+}
+
+/*
+ * Do the list-removal and priority-unboosting "heavy lifting" when
+ * required.
+ */
+static void __rcu_read_unlock_unboost(void)
+{
+ unsigned long flags;
+ struct rcu_boost_dat *rbdp;
+
+ /* Identify the list structure and acquire the corresponding lock. */
+
+ rbdp = current->rcub_rbdp;
+ spin_lock_irqsave(&rbdp->rbs_lock, flags);
+
+ /* Remove task from the list it was on. */
+
+ list_del_init(&current->rcub_entry);
+ rbdp->rbs_unlock++;
+ current->rcub_rbdp = NULL;
+
+ /* Record stats, unboost if needed, and update state. */
+
+ rcu_boost_dat_stat_unlock(rbdp, current->rcub_state);
+ if (current->rcub_state == RCU_BOOSTED) {
+ rcu_unboost_prio(current);
+ rbdp->rbs_unboosted++;
+ }
+ current->rcub_state = RCU_BOOST_IDLE;
+ spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
+}
+
+/*
+ * Do any state changes and unboosting needed for rcu_read_unlock().
+ * Pass any complex work on to __rcu_read_unlock_unboost().
+ * The vast majority of the time, no work will be needed, as preemption
+ * and blocking within RCU read-side critical sections is comparatively
+ * rare.
+ */
+static inline void rcu_read_unlock_unboost(void)
+{
+
+ if (unlikely(current->rcub_state != RCU_BOOST_IDLE))
+ __rcu_read_unlock_unboost();
+}
+
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU_BOOST */
+
/*
* States for rcu_try_flip() and friends.
*/
@@ -128,14 +662,6 @@ static DEFINE_PER_CPU(enum rcu_mb_flag_v
static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE;

/*
- * Macro that prevents the compiler from reordering accesses, but does
- * absolutely -nothing- to prevent CPUs from reordering. This is used
- * only to mediate communication between mainline code and hardware
- * interrupt and NMI handlers.
- */
-#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
-
-/*
* RCU_DATA_ME: find the current CPU's rcu_data structure.
* RCU_DATA_CPU: find the specified CPU's rcu_data structure.
*/
@@ -194,7 +720,7 @@ void __rcu_read_lock(void)
me->rcu_read_lock_nesting = nesting + 1;

} else {
- unsigned long oldirq;
+ unsigned long flags;

/*
* Disable local interrupts to prevent the grace-period
@@ -203,7 +729,7 @@ void __rcu_read_lock(void)
* contain rcu_read_lock().
*/

- local_irq_save(oldirq);
+ local_irq_save(flags);

/*
* Outermost nesting of rcu_read_lock(), so increment
@@ -233,7 +759,7 @@ void __rcu_read_lock(void)
*/

ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
- local_irq_restore(oldirq);
+ local_irq_restore(flags);
}
}
EXPORT_SYMBOL_GPL(__rcu_read_lock);
@@ -255,7 +781,7 @@ void __rcu_read_unlock(void)
me->rcu_read_lock_nesting = nesting - 1;

} else {
- unsigned long oldirq;
+ unsigned long flags;

/*
* Disable local interrupts to prevent the grace-period
@@ -264,7 +790,7 @@ void __rcu_read_unlock(void)
* contain rcu_read_lock() and rcu_read_unlock().
*/

- local_irq_save(oldirq);
+ local_irq_save(flags);

/*
* Outermost nesting of rcu_read_unlock(), so we must
@@ -305,7 +831,10 @@ void __rcu_read_unlock(void)
*/

ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
- local_irq_restore(oldirq);
+
+ rcu_read_unlock_unboost();
+
+ local_irq_restore(flags);
}
}
EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -504,10 +1033,10 @@ rcu_try_flip_waitmb(void)
*/
static void rcu_try_flip(void)
{
- unsigned long oldirq;
+ unsigned long flags;

RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
- if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
+ if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, flags))) {
RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
return;
}
@@ -534,7 +1063,7 @@ static void rcu_try_flip(void)
if (rcu_try_flip_waitmb())
rcu_try_flip_state = rcu_try_flip_idle_state;
}
- spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
}

/*
@@ -553,16 +1082,16 @@ static void rcu_check_mb(int cpu)

void rcu_check_callbacks_rt(int cpu, int user)
{
- unsigned long oldirq;
+ unsigned long flags;
struct rcu_data *rdp = RCU_DATA_CPU(cpu);

rcu_check_mb(cpu);
if (rcu_ctrlblk.completed == rdp->completed)
rcu_try_flip();
- spin_lock_irqsave(&rdp->lock, oldirq);
+ spin_lock_irqsave(&rdp->lock, flags);
RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
__rcu_advance_callbacks(rdp);
- spin_unlock_irqrestore(&rdp->lock, oldirq);
+ spin_unlock_irqrestore(&rdp->lock, flags);
}

/*
@@ -571,18 +1100,19 @@ void rcu_check_callbacks_rt(int cpu, int
*/
void rcu_advance_callbacks_rt(int cpu, int user)
{
- unsigned long oldirq;
+ unsigned long flags;
struct rcu_data *rdp = RCU_DATA_CPU(cpu);

if (rcu_ctrlblk.completed == rdp->completed) {
rcu_try_flip();
if (rcu_ctrlblk.completed == rdp->completed)
return;
+ rcu_read_unlock_unboost();
}
- spin_lock_irqsave(&rdp->lock, oldirq);
+ spin_lock_irqsave(&rdp->lock, flags);
RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
__rcu_advance_callbacks(rdp);
- spin_unlock_irqrestore(&rdp->lock, oldirq);
+ spin_unlock_irqrestore(&rdp->lock, flags);
}

#ifdef CONFIG_HOTPLUG_CPU
@@ -601,24 +1131,24 @@ void rcu_offline_cpu_rt(int cpu)
{
int i;
struct rcu_head *list = NULL;
- unsigned long oldirq;
+ unsigned long flags;
struct rcu_data *rdp = RCU_DATA_CPU(cpu);
struct rcu_head **tail = &list;

/* Remove all callbacks from the newly dead CPU, retaining order. */

- spin_lock_irqsave(&rdp->lock, oldirq);
+ spin_lock_irqsave(&rdp->lock, flags);
rcu_offline_cpu_rt_enqueue(rdp->donelist, rdp->donetail, list, tail);
for (i = GP_STAGES - 1; i >= 0; i--)
rcu_offline_cpu_rt_enqueue(rdp->waitlist[i], rdp->waittail[i],
list, tail);
rcu_offline_cpu_rt_enqueue(rdp->nextlist, rdp->nexttail, list, tail);
- spin_unlock_irqrestore(&rdp->lock, oldirq);
+ spin_unlock_irqrestore(&rdp->lock, flags);
rdp->waitlistcount = 0;

/* Disengage the newly dead CPU from grace-period computation. */

- spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
+ spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
rcu_check_mb(cpu);
if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
smp_mb(); /* Subsequent counter accesses must see new value */
@@ -627,7 +1157,7 @@ void rcu_offline_cpu_rt(int cpu)
/* seen -after- acknowledgement. */
}
cpu_clear(cpu, rcu_cpu_online_map);
- spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);

/*
* Place the removed callbacks on the current CPU's queue.
@@ -640,20 +1170,20 @@ void rcu_offline_cpu_rt(int cpu)
*/

rdp = RCU_DATA_ME();
- spin_lock_irqsave(&rdp->lock, oldirq);
+ spin_lock_irqsave(&rdp->lock, flags);
*rdp->nexttail = list;
if (list)
rdp->nexttail = tail;
- spin_unlock_irqrestore(&rdp->lock, oldirq);
+ spin_unlock_irqrestore(&rdp->lock, flags);
}

void __devinit rcu_online_cpu_rt(int cpu)
{
- unsigned long oldirq;
+ unsigned long flags;

- spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
+ spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
cpu_set(cpu, rcu_cpu_online_map);
- spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
+ spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
}

#else /* #ifdef CONFIG_HOTPLUG_CPU */
@@ -695,12 +1225,12 @@ void rcu_process_callbacks_rt(struct sof
void fastcall call_rcu_preempt(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
{
- unsigned long oldirq;
+ unsigned long flags;
struct rcu_data *rdp;

head->func = func;
head->next = NULL;
- local_irq_save(oldirq);
+ local_irq_save(flags);
rdp = RCU_DATA_ME();
spin_lock(&rdp->lock);
__rcu_advance_callbacks(rdp);
@@ -708,7 +1238,7 @@ void fastcall call_rcu_preempt(struct rc
rdp->nexttail = &head->next;
RCU_TRACE_RDP(rcupreempt_trace_next_add, rdp);
spin_unlock(&rdp->lock);
- local_irq_restore(oldirq);
+ local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(call_rcu_preempt);

@@ -757,6 +1287,11 @@ int rcu_pending_rt(int cpu)
return 0;
}

+/*
+ * Initialize RCU. This is called very early in boot, so is restricted
+ * to very simple operations. Don't even think about messing with anything
+ * that involves the scheduler, as it doesn't exist yet.
+ */
void __init rcu_init_rt(void)
{
int cpu;
@@ -778,6 +1313,7 @@ void __init rcu_init_rt(void)
rdp->donelist = NULL;
rdp->donetail = &rdp->donelist;
}
+ init_rcu_boost_early();
}

/*
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/rtmutex.c linux-2.6.22-F-boostrcu/kernel/rtmutex.c
--- linux-2.6.22-E-hotplug/kernel/rtmutex.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/kernel/rtmutex.c 2007-08-24 11:24:59.000000000 -0700
@@ -111,11 +111,12 @@ static inline void mark_rt_mutex_waiters
*/
int rt_mutex_getprio(struct task_struct *task)
{
+ int prio = min(task->normal_prio, get_rcu_prio(task));
+
if (likely(!task_has_pi_waiters(task)))
- return task->normal_prio;
+ return prio;

- return min(task_top_pi_waiter(task)->pi_list_entry.prio,
- task->normal_prio);
+ return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio);
}

/*
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/sched.c linux-2.6.22-F-boostrcu/kernel/sched.c
--- linux-2.6.22-E-hotplug/kernel/sched.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/kernel/sched.c 2007-08-24 11:24:59.000000000 -0700
@@ -1702,6 +1702,7 @@ void fastcall sched_fork(struct task_str
* Make sure we do not leak PI boosting priority to the child:
*/
p->prio = current->normal_prio;
+ set_rcu_prio(p, MAX_PRIO);

INIT_LIST_HEAD(&p->run_list);
p->array = NULL;
@@ -1784,6 +1785,7 @@ void fastcall wake_up_new_task(struct ta
else {
p->prio = current->prio;
p->normal_prio = current->normal_prio;
+ set_rcu_prio(p, MAX_PRIO);
list_add_tail(&p->run_list, &current->run_list);
p->array = current->array;
p->array->nr_active++;
@@ -3590,6 +3592,8 @@ asmlinkage void __sched schedule(void)
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));

+ rcu_preempt_boost();
+
need_resched:
preempt_disable();
prev = current;
@@ -5060,6 +5064,7 @@ void __cpuinit init_idle(struct task_str
idle->sleep_avg = 0;
idle->array = NULL;
idle->prio = idle->normal_prio = MAX_PRIO;
+ set_rcu_prio(idle, MAX_PRIO);
idle->state = TASK_RUNNING;
idle->cpus_allowed = cpumask_of_cpu(cpu);
set_task_cpu(idle, cpu);
diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/lib/Kconfig.debug linux-2.6.22-F-boostrcu/lib/Kconfig.debug
--- linux-2.6.22-E-hotplug/lib/Kconfig.debug 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/lib/Kconfig.debug 2007-08-24 11:24:59.000000000 -0700
@@ -391,6 +391,40 @@ config RCU_TORTURE_TEST
Say M if you want the RCU torture tests to build as a module.
Say N if you are unsure.

+config RCU_TRACE
+ bool "Enable tracing for RCU - currently stats in debugfs"
+ select DEBUG_FS
+ depends on DEBUG_KERNEL
+ default y
+ help
+ This option provides tracing in RCU which presents stats
+ in debugfs for debugging RCU implementation.
+
+ Say Y here if you want to enable RCU tracing
+ Say N if you are unsure.
+
+config PREEMPT_RCU_BOOST_STATS
+ bool "Enable RCU priority-boosting statistic printing"
+ depends on PREEMPT_RCU_BOOST
+ depends on DEBUG_KERNEL
+ default n
+ help
+ This option enables debug printk()s of RCU boost statistics,
+ which are normally only used to debug RCU priority boost
+ implementations.
+
+ Say N if you are unsure.
+
+config PREEMPT_RCU_BOOST_STATS_INTERVAL
+ int "RCU priority-boosting statistic printing interval (seconds)"
+ depends on PREEMPT_RCU_BOOST_STATS
+ default 100
+ range 10 86400
+ help
+ This option controls the timing of debug printk()s of RCU boost
+ statistics, which are normally only used to debug RCU priority
+ boost implementations.
+
config LKDTM
tristate "Linux Kernel Dump Test Tool Module"
depends on DEBUG_KERNEL

2007-09-10 18:40:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 7/9] RCU: rcutorture testing for RCU priority boosting

Work in progress, not for inclusion. Still uses xtime because this
patch is still against 2.6.22.

This patch modifies rcutorture to also torture RCU priority boosting.
The torturing involves forcing RCU read-side critical sections (already
performed as part of the torturing of RCU) to run for extremely long
time periods, increasing the probability of their being preempted and
thus needing priority boosting. The fact that rcutorture's "nreaders"
module parameter defaults to twice the number of CPUs helps ensure lots
of the needed preemption.

To cause the torturing to be fully effective in -mm, run in presence
of CPU-hotplug operations.

Signed-off-by: Paul E. McKenney <[email protected]>
---

rcutorture.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 77 insertions(+), 14 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-f-boostrcu/kernel/rcutorture.c linux-2.6.22-g-boosttorture/kernel/rcutorture.c
--- linux-2.6.22-f-boostrcu/kernel/rcutorture.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-g-boosttorture/kernel/rcutorture.c 2007-09-09 16:58:35.000000000 -0700
@@ -58,6 +58,7 @@ static int stat_interval; /* Interval be
static int verbose; /* Print more debug info. */
static int test_no_idle_hz; /* Test RCU's support for tickless idle CPUs. */
static int shuffle_interval = 5; /* Interval between shuffles (in sec)*/
+static int preempt_torture; /* Realtime task preempts torture readers. */
static char *torture_type = "rcu"; /* What RCU implementation to torture. */

module_param(nreaders, int, 0444);
@@ -72,6 +73,8 @@ module_param(test_no_idle_hz, bool, 0444
MODULE_PARM_DESC(test_no_idle_hz, "Test support for tickless idle CPUs");
module_param(shuffle_interval, int, 0444);
MODULE_PARM_DESC(shuffle_interval, "Number of seconds between shuffles");
+module_param(preempt_torture, bool, 0444);
+MODULE_PARM_DESC(preempt_torture, "Enable realtime preemption torture");
module_param(torture_type, charp, 0444);
MODULE_PARM_DESC(torture_type, "Type of RCU to torture (rcu, rcu_bh, srcu)");

@@ -194,6 +197,8 @@ struct rcu_torture_ops {
int (*completed)(void);
void (*deferredfree)(struct rcu_torture *p);
void (*sync)(void);
+ long (*preemptstart)(void);
+ void (*preemptend)(void);
int (*stats)(char *page);
char *name;
};
@@ -258,16 +263,75 @@ static void rcu_torture_deferred_free(st
call_rcu(&p->rtort_rcu, rcu_torture_cb);
}

+static struct task_struct *rcu_preeempt_task;
+static unsigned long rcu_torture_preempt_errors;
+
+static int rcu_torture_preempt(void *arg)
+{
+ int completedstart;
+ int err;
+ time_t gcstart;
+ struct sched_param sp;
+
+ sp.sched_priority = MAX_RT_PRIO - 1;
+ err = sched_setscheduler(current, SCHED_RR, &sp);
+ if (err != 0)
+ printk(KERN_ALERT "rcu_torture_preempt() priority err: %d\n",
+ err);
+ current->flags |= PF_NOFREEZE;
+
+ do {
+ completedstart = rcu_torture_completed();
+ gcstart = xtime.tv_sec;
+ while ((xtime.tv_sec - gcstart < 10) &&
+ (rcu_torture_completed() == completedstart))
+ cond_resched();
+ if (rcu_torture_completed() == completedstart)
+ rcu_torture_preempt_errors++;
+ schedule_timeout_interruptible(1);
+ } while (!kthread_should_stop());
+ return 0;
+}
+
+static long rcu_preempt_start(void)
+{
+ long retval = 0;
+
+ rcu_preeempt_task = kthread_run(rcu_torture_preempt, NULL,
+ "rcu_torture_preempt");
+ if (IS_ERR(rcu_preeempt_task)) {
+ VERBOSE_PRINTK_ERRSTRING("Failed to create preempter");
+ retval = PTR_ERR(rcu_preeempt_task);
+ rcu_preeempt_task = NULL;
+ }
+ return retval;
+}
+
+static void rcu_preempt_end(void)
+{
+ if (rcu_preeempt_task != NULL) {
+ VERBOSE_PRINTK_STRING("Stopping rcu_preempt task");
+ kthread_stop(rcu_preeempt_task);
+ }
+ rcu_preeempt_task = NULL;
+}
+
+static int rcu_preempt_stats(char *page)
+{
+ return sprintf(page,
+ "Preemption stalls: %lu\n", rcu_torture_preempt_errors);
+}
+
static struct rcu_torture_ops rcu_ops = {
- .init = NULL,
- .cleanup = NULL,
.readlock = rcu_torture_read_lock,
.readdelay = rcu_read_delay,
.readunlock = rcu_torture_read_unlock,
.completed = rcu_torture_completed,
.deferredfree = rcu_torture_deferred_free,
.sync = synchronize_rcu,
- .stats = NULL,
+ .preemptstart = rcu_preempt_start,
+ .preemptend = rcu_preempt_end,
+ .stats = rcu_preempt_stats,
.name = "rcu"
};

@@ -299,14 +363,12 @@ static void rcu_sync_torture_init(void)

static struct rcu_torture_ops rcu_sync_ops = {
.init = rcu_sync_torture_init,
- .cleanup = NULL,
.readlock = rcu_torture_read_lock,
.readdelay = rcu_read_delay,
.readunlock = rcu_torture_read_unlock,
.completed = rcu_torture_completed,
.deferredfree = rcu_sync_torture_deferred_free,
.sync = synchronize_rcu,
- .stats = NULL,
.name = "rcu_sync"
};

@@ -358,28 +420,23 @@ static void rcu_bh_torture_synchronize(v
}

static struct rcu_torture_ops rcu_bh_ops = {
- .init = NULL,
- .cleanup = NULL,
.readlock = rcu_bh_torture_read_lock,
.readdelay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = rcu_bh_torture_read_unlock,
.completed = rcu_bh_torture_completed,
.deferredfree = rcu_bh_torture_deferred_free,
.sync = rcu_bh_torture_synchronize,
- .stats = NULL,
.name = "rcu_bh"
};

static struct rcu_torture_ops rcu_bh_sync_ops = {
.init = rcu_sync_torture_init,
- .cleanup = NULL,
.readlock = rcu_bh_torture_read_lock,
.readdelay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = rcu_bh_torture_read_unlock,
.completed = rcu_bh_torture_completed,
.deferredfree = rcu_sync_torture_deferred_free,
.sync = rcu_bh_torture_synchronize,
- .stats = NULL,
.name = "rcu_bh_sync"
};

@@ -491,14 +548,12 @@ static void sched_torture_synchronize(vo

static struct rcu_torture_ops sched_ops = {
.init = rcu_sync_torture_init,
- .cleanup = NULL,
.readlock = sched_torture_read_lock,
.readdelay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
.completed = sched_torture_completed,
.deferredfree = rcu_sync_torture_deferred_free,
.sync = sched_torture_synchronize,
- .stats = NULL,
.name = "sched"
};

@@ -793,9 +848,10 @@ rcu_torture_print_module_parms(char *tag
printk(KERN_ALERT "%s" TORTURE_FLAG
"--- %s: nreaders=%d nfakewriters=%d "
"stat_interval=%d verbose=%d test_no_idle_hz=%d "
- "shuffle_interval = %d\n",
+ "shuffle_interval=%d preempt_torture=%d\n",
torture_type, tag, nrealreaders, nfakewriters,
- stat_interval, verbose, test_no_idle_hz, shuffle_interval);
+ stat_interval, verbose, test_no_idle_hz, shuffle_interval,
+ preempt_torture);
}

static void
@@ -848,6 +904,8 @@ rcu_torture_cleanup(void)
kthread_stop(stats_task);
}
stats_task = NULL;
+ if (preempt_torture && (cur_ops->preemptend != NULL))
+ cur_ops->preemptend();

/* Wait for all RCU callbacks to fire. */
rcu_barrier();
@@ -990,6 +1048,11 @@ rcu_torture_init(void)
goto unwind;
}
}
+ if (preempt_torture && (cur_ops->preemptstart != NULL)) {
+ firsterr = cur_ops->preemptstart();
+ if (firsterr != 0)
+ goto unwind;
+ }
return 0;

unwind:

2007-09-10 18:42:09

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 8/9] RCU: Make RCU priority boosting consume less power

Work in progress, not for inclusion.

This patch modified the RCU priority booster to explicitly sleep when
there are no RCU readers in need of priority boosting. This should be
a power-consumption improvement over the one-second polling cycle in
the underlying RCU priority-boosting patch.

Signed-off-by: Paul E. McKenney <[email protected]>
---

include/linux/rcupreempt.h | 15 ++++++
kernel/rcupreempt.c | 102 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 115 insertions(+), 2 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h
--- linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h 2007-08-24 11:24:59.000000000 -0700
+++ linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h 2007-08-24 18:12:41.000000000 -0700
@@ -60,6 +60,21 @@ enum rcu_boost_state {

#define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1)

+/*
+ * RCU-booster state with respect to sleeping. The RCU booster
+ * sleeps when no task has recently been seen sleeping in an RCU
+ * read-side critical section, and is awakened when a new sleeper
+ * appears.
+ */
+enum rcu_booster_state {
+ RCU_BOOSTER_ACTIVE = 0, /* RCU booster actively scanning. */
+ RCU_BOOSTER_DROWSY = 1, /* RCU booster is considering sleeping. */
+ RCU_BOOSTER_SLEEPING = 2, /* RCU booster is asleep. */
+ RCU_BOOSTER_INVALID = 3, /* For bogus state sightings. */
+};
+
+#define N_RCU_BOOSTER_STATE (RCU_BOOSTER_INVALID + 1)
+
#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */

#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/kernel/rcupreempt.c linux-2.6.22-H-boostsleep/kernel/rcupreempt.c
--- linux-2.6.22-G-boosttorture/kernel/rcupreempt.c 2007-08-27 15:42:57.000000000 -0700
+++ linux-2.6.22-H-boostsleep/kernel/rcupreempt.c 2007-08-27 15:42:37.000000000 -0700
@@ -108,6 +108,7 @@ struct rcu_boost_dat {
unsigned long rbs_unboosted;
#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
unsigned long rbs_stats[N_RCU_BOOST_DAT_EVENTS][N_RCU_BOOST_STATE];
+ unsigned long rbs_qw_stats[N_RCU_BOOSTER_STATE];
#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
};
#define RCU_BOOST_ELEMENTS 4
@@ -115,6 +116,10 @@ struct rcu_boost_dat {
static int rcu_boost_idx = -1; /* invalid value for early RCU use. */
static DEFINE_PER_CPU(struct rcu_boost_dat, rcu_boost_dat[RCU_BOOST_ELEMENTS]);
static struct task_struct *rcu_boost_task;
+static DEFINE_SPINLOCK(rcu_boost_quiesce_lock);
+static enum rcu_booster_state rcu_booster_quiesce_state = RCU_BOOSTER_ACTIVE;
+static unsigned long rbs_qs_stats[2][N_RCU_BOOSTER_STATE];
+wait_queue_head_t rcu_booster_quiesce_wq;

#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS

@@ -171,6 +176,15 @@ static char *rcu_boost_state_error[] = {
"? ?", /* unlock */
};

+/* Labels for RCU booster state printout. */
+
+static char *rcu_booster_state_label[] = {
+ "Active",
+ "Drowsy",
+ "Sleeping",
+ "???",
+};
+
/*
* Print out RCU booster task statistics at the specified interval.
*/
@@ -221,6 +235,14 @@ static void rcu_boost_dat_stat_print(voi
cpu)[i].rbs_stats[event][state];
}
}
+ for (state = 0; state < N_RCU_BOOSTER_STATE; state++) {
+ sum.rbs_qw_stats[state] = 0;
+ for_each_possible_cpu(cpu)
+ for (i = 0; i < RCU_BOOST_ELEMENTS; i++)
+ sum.rbs_qw_stats[state] +=
+ per_cpu(rcu_boost_dat,
+ cpu)[i].rbs_qw_stats[state];
+ }

/* Print them out! */

@@ -240,6 +262,24 @@ static void rcu_boost_dat_stat_print(voi
rcu_boost_state_event[event], buf);
}

+ printk(KERN_INFO "RCU booster state: %s\n",
+ rcu_booster_quiesce_state >= 0 &&
+ rcu_booster_quiesce_state < N_RCU_BOOSTER_STATE
+ ? rcu_booster_state_label[rcu_booster_quiesce_state]
+ : "???");
+ i = 0;
+ for (state = 0; state < N_RCU_BOOSTER_STATE; state++)
+ i += sprintf(&buf[i], " %ld", rbs_qs_stats[0][state]);
+ printk(KERN_INFO "No tasks found: %s\n", buf);
+ i = 0;
+ for (state = 0; state < N_RCU_BOOSTER_STATE; state++)
+ i += sprintf(&buf[i], " %ld", rbs_qs_stats[1][state]);
+ printk(KERN_INFO "Tasks found: %s\n", buf);
+ i = 0;
+ for (state = 0; state < N_RCU_BOOSTER_STATE; state++)
+ i += sprintf(&buf[i], " %ld", sum.rbs_qw_stats[state]);
+ printk(KERN_INFO "Awaken opportunities: %s\n", buf);
+
/* Go away and don't come back for awhile. */

lastprint = xtime.tv_sec;
@@ -293,6 +333,8 @@ static void init_rcu_boost_early(void)
for (j = 0; j < N_RCU_BOOST_DAT_EVENTS; j++)
for (k = 0; k < N_RCU_BOOST_STATE; k++)
rbdp[i].rbs_stats[j][k] = 0;
+ for (j = 0; j < N_RCU_BOOSTER_STATE; j++)
+ rbdp[i].rbs_qw_stats[j] = 0;
}
#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
}
@@ -378,10 +420,11 @@ static void rcu_unboost_prio(struct task
/*
* Boost all of the RCU-reader tasks on the specified list.
*/
-static void rcu_boost_one_reader_list(struct rcu_boost_dat *rbdp)
+static int rcu_boost_one_reader_list(struct rcu_boost_dat *rbdp)
{
LIST_HEAD(list);
unsigned long flags;
+ int retval = 0;
struct task_struct *taskp;

/*
@@ -397,6 +440,7 @@ static void rcu_boost_one_reader_list(st
list_splice_init(&rbdp->rbs_toboost, &list);
list_splice_init(&rbdp->rbs_boosted, &list);
while (!list_empty(&list)) {
+ retval = 1;

/*
* Pause for a bit before boosting each task.
@@ -438,6 +482,36 @@ static void rcu_boost_one_reader_list(st
list_add_tail(&taskp->rcub_entry, &rbdp->rbs_boosted);
}
spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
+ return retval;
+}
+
+/*
+ * Examine state to see if it is time to sleep.
+ */
+static void rcu_booster_try_sleep(int yo)
+{
+ spin_lock(&rcu_boost_quiesce_lock);
+ if (rcu_booster_quiesce_state < 0 ||
+ rcu_booster_quiesce_state >= N_RCU_BOOSTER_STATE)
+ rcu_booster_quiesce_state = RCU_BOOST_INVALID;
+ rbs_qs_stats[yo != 0][rcu_booster_quiesce_state]++;
+ if (yo != 0) {
+ rcu_booster_quiesce_state = RCU_BOOSTER_ACTIVE;
+ } else {
+ if (rcu_booster_quiesce_state == RCU_BOOSTER_ACTIVE) {
+ rcu_booster_quiesce_state = RCU_BOOSTER_DROWSY;
+ } else if (rcu_booster_quiesce_state == RCU_BOOSTER_DROWSY) {
+ rcu_booster_quiesce_state = RCU_BOOSTER_SLEEPING;
+ spin_unlock(&rcu_boost_quiesce_lock);
+ __wait_event(rcu_booster_quiesce_wq,
+ rcu_booster_quiesce_state ==
+ RCU_BOOSTER_ACTIVE);
+ spin_lock(&rcu_boost_quiesce_lock);
+ } else {
+ rcu_booster_quiesce_state = RCU_BOOSTER_ACTIVE;
+ }
+ }
+ spin_unlock(&rcu_boost_quiesce_lock);
}

/*
@@ -448,15 +522,21 @@ static int rcu_booster(void *arg)
{
int cpu;
struct sched_param sp = { .sched_priority = PREEMPT_RCU_BOOSTER_PRIO, };
+ int yo = 0;

sched_setscheduler(current, SCHED_RR, &sp);
current->flags |= PF_NOFREEZE;
+ init_waitqueue_head(&rcu_booster_quiesce_wq);

do {

/* Advance the lists of tasks. */

rcu_boost_idx = (rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS;
+ if (rcu_boost_idx == 0) {
+ rcu_booster_try_sleep(yo);
+ yo = 0;
+ }
for_each_possible_cpu(cpu) {

/*
@@ -469,7 +549,7 @@ static int rcu_booster(void *arg)
* nothing.
*/

- rcu_boost_one_reader_list(rcu_rbd_boosting(cpu));
+ yo += rcu_boost_one_reader_list(rcu_rbd_boosting(cpu));

/*
* Large SMP systems may need to sleep sometimes
@@ -511,6 +591,23 @@ void init_rcu_boost_late(void)
}

/*
+ * Awaken the RCU priority booster if neecessary.
+ */
+static void rcu_preempt_wake(struct rcu_boost_dat *rbdp)
+{
+ spin_lock(&rcu_boost_quiesce_lock);
+ if (rcu_booster_quiesce_state >= N_RCU_BOOSTER_STATE)
+ rcu_booster_quiesce_state = RCU_BOOSTER_INVALID;
+ rbdp->rbs_qw_stats[rcu_booster_quiesce_state]++;
+ if (rcu_booster_quiesce_state == RCU_BOOSTER_SLEEPING) {
+ rcu_booster_quiesce_state = RCU_BOOSTER_ACTIVE;
+ wake_up(&rcu_booster_quiesce_wq);
+ } else if (rcu_booster_quiesce_state != RCU_BOOSTER_ACTIVE)
+ rcu_booster_quiesce_state = RCU_BOOSTER_ACTIVE;
+ spin_unlock(&rcu_boost_quiesce_lock);
+}
+
+/*
* Update task's RCU-boost state to reflect blocking in RCU read-side
* critical section, so that the RCU-boost task can find it in case it
* later needs its priority boosted.
@@ -532,6 +629,7 @@ void __rcu_preempt_boost(void)
}
spin_lock(&rbdp->rbs_lock);
rbdp->rbs_blocked++;
+ rcu_preempt_wake(rbdp);

/*
* Update state. We hold the lock and aren't yet on the list,

2007-09-10 18:42:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC 9/9] RCU: preemptible documentation and comment cleanups

Work in progress, not for inclusion.

This patch updates the RCU documentation to reflect preemptible RCU as
well as recent publications. Fix an incorrect comment in the code.
Change the name ORDERED_WRT_IRQ() to ACCESS_ONCE() to better describe
its function.

Signed-off-by: Paul E. McKenney <[email protected]>
---

Documentation/RCU/RTFP.txt | 234 ++++++++++++++++++++++++++++++++++++++++--
Documentation/RCU/rcu.txt | 20 +++
Documentation/RCU/torture.txt | 44 ++++++-
kernel/Kconfig.preempt | 1
kernel/rcupreempt.c | 24 ++--
5 files changed, 292 insertions(+), 31 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt
--- linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt 2007-09-08 16:54:17.000000000 -0700
@@ -36,6 +36,14 @@ o How can the updater tell when a grace
executed in user mode, or executed in the idle loop, we can
safely free up that item.

+ Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
+ same effect, but require that the readers manipulate CPU-local
+ counters. These counters allow limited types of blocking
+ within RCU read-side critical sections. SRCU also uses
+ CPU-local counters, and permits general blocking within
+ RCU read-side critical sections. These two variants of
+ RCU detect grace periods by sampling these counters.
+
o If I am running on a uniprocessor kernel, which can only do one
thing at a time, why should I wait for a grace period?

@@ -46,7 +54,10 @@ o How can I see where RCU is currently u
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
"srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
- "synchronize_net", and "synchronize_srcu".
+ "synchronize_net", "synchronize_srcu", and the other RCU
+ primitives. Or grab one of the cscope databases from:
+
+ http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html

o What guidelines should I follow when writing code that uses RCU?

@@ -67,7 +78,12 @@ o I hear that RCU is patented? What is

o I hear that RCU needs work in order to support realtime kernels?

- Yes, work in progress.
+ This work is largely completed. Realtime-friendly RCU can be
+ enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter.
+ In addition, the CONFIG_PREEMPT_RCU_BOOST kernel configuration
+ parameter enables priority boosting of preempted RCU read-side
+ critical sections, though this is only needed if you have
+ CPU-bound realtime threads.

o Where can I find more information on RCU?

diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/Documentation/RCU/RTFP.txt linux-2.6.22-i-cleanups/Documentation/RCU/RTFP.txt
--- linux-2.6.22-h-boostsleep/Documentation/RCU/RTFP.txt 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-i-cleanups/Documentation/RCU/RTFP.txt 2007-09-08 16:41:31.000000000 -0700
@@ -9,8 +9,8 @@ The first thing resembling RCU was publi
[Kung80] recommended use of a garbage collector to defer destruction
of nodes in a parallel binary search tree in order to simplify its
implementation. This works well in environments that have garbage
-collectors, but current production garbage collectors incur significant
-read-side overhead.
+collectors, but most production garbage collectors incur significant
+overhead.

In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring
destruction until all threads running at that time have terminated, again
@@ -99,16 +99,25 @@ locking, reduces contention, reduces mem
parallelizes pipeline stalls and memory latency for writers. However,
these techniques still impose significant read-side overhead in the
form of memory barriers. Researchers at Sun worked along similar lines
-in the same timeframe [HerlihyLM02,HerlihyLMS03]. These techniques
-can be thought of as inside-out reference counts, where the count is
-represented by the number of hazard pointers referencing a given data
-structure (rather than the more conventional counter field within the
-data structure itself).
+in the same timeframe [HerlihyLM02]. These techniques can be thought
+of as inside-out reference counts, where the count is represented by the
+number of hazard pointers referencing a given data structure (rather than
+the more conventional counter field within the data structure itself).
+
+By the same token, RCU can be thought of as a "bulk reference count",
+where some form of reference counter covers all reference by a given CPU
+or thread during a set timeframe. This timeframe is related to, but
+not necessarily exactly the same as, an RCU grace period. In classic
+RCU, the reference counter is the per-CPU bit in the "bitmask" field,
+and each such bit covers all references that might have been made by
+the corresponding CPU during the prior grace period. Of course, RCU
+can be thought of in other terms as well.

In 2003, the K42 group described how RCU could be used to create
-hot-pluggable implementations of operating-system functions. Later that
-year saw a paper describing an RCU implementation of System V IPC
-[Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a].
+hot-pluggable implementations of operating-system functions [Appavoo03a].
+Later that year saw a paper describing an RCU implementation of System
+V IPC [Arcangeli03], and an introduction to RCU in Linux Journal
+[McKenney03a].

2004 has seen a Linux-Journal article on use of RCU in dcache
[McKenney04a], a performance comparison of locking to RCU on several
@@ -117,10 +126,27 @@ number of operating-system kernels [Paul
describing how to make RCU safe for soft-realtime applications [Sarma04c],
and a paper describing SELinux performance with RCU [JamesMorris04b].

-2005 has seen further adaptation of RCU to realtime use, permitting
+2005 brought further adaptation of RCU to realtime use, permitting
preemption of RCU realtime critical sections [PaulMcKenney05a,
PaulMcKenney05b].

+2006 saw the first best-paper award for an RCU paper [ThomasEHart2006a],
+as well as further work on efficient implementations of preemptible
+RCU [PaulEMcKenney2006b], but priority-boosting of RCU read-side critical
+sections proved elusive. An RCU implementation permitting general
+blocking in read-side critical sections appeared [PaulEMcKenney2006c],
+Robert Olsson described an RCU-protected trie-hash combination
+[RobertOlsson2006a].
+
+In 2007, the RCU priority-boosting problem finally was solved
+[PaulEMcKenney2007BoostRCU], and an RCU paper was first accepted into
+an academic journal [ThomasEHart2007a]. An LWN article on the use of
+Promela and spin to validate parallel algorithms [PaulEMcKenney2007QRCUspin]
+also described Oleg Nesterov's QRCU, the first RCU implementation that
+can boast deep sub-microsecond grace periods (in absence of readers,
+and read-side overhead is roughly that of a global reference count).
+
+
Bibtex Entries

@article{Kung80
@@ -203,6 +229,41 @@ Bibtex Entries
,Address="New Orleans, LA"
}

+@conference{Pu95a,
+Author = "Calton Pu and Tito Autrey and Andrew Black and Charles Consel and
+Crispin Cowan and Jon Inouye and Lakshmi Kethana and Jonathan Walpole and
+Ke Zhang",
+Title = "Optimistic Incremental Specialization: Streamlining a Commercial
+Operating System",
+Booktitle = "15\textsuperscript{th} ACM Symposium on
+Operating Systems Principles (SOSP'95)",
+address = "Copper Mountain, CO",
+month="December",
+year="1995",
+pages="314-321",
+annotation="
+ Uses a replugger, but with a flag to signal when people are
+ using the resource at hand. Only one reader at a time.
+"
+}
+
+@conference{Cowan96a,
+Author = "Crispin Cowan and Tito Autrey and Charles Krasic and
+Calton Pu and Jonathan Walpole",
+Title = "Fast Concurrent Dynamic Linking for an Adaptive Operating System",
+Booktitle = "International Conference on Configurable Distributed Systems
+(ICCDS'96)",
+address = "Annapolis, MD",
+month="May",
+year="1996",
+pages="108",
+isbn="0-8186-7395-8",
+annotation="
+ Uses a replugger, but with a counter to signal when people are
+ using the resource at hand. Allows multiple readers.
+"
+}
+
@techreport{Slingwine95
,author="John D. Slingwine and Paul E. McKenney"
,title="Apparatus and Method for Achieving Reduced Overhead Mutual
@@ -312,6 +373,49 @@ Andrea Arcangeli and Andi Kleen and Orra
[Viewed June 23, 2004]"
}

+@conference{Michael02a
+,author="Maged M. Michael"
+,title="Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic
+Reads and Writes"
+,Year="2002"
+,Month="August"
+,booktitle="{Proceedings of the 21\textsuperscript{st} Annual ACM
+Symposium on Principles of Distributed Computing}"
+,pages="21-30"
+,annotation="
+ Each thread keeps an array of pointers to items that it is
+ currently referencing. Sort of an inside-out garbage collection
+ mechanism, but one that requires the accessing code to explicitly
+ state its needs. Also requires read-side memory barriers on
+ most architectures.
+"
+}
+
+@conference{Michael02b
+,author="Maged M. Michael"
+,title="High Performance Dynamic Lock-Free Hash Tables and List-Based Sets"
+,Year="2002"
+,Month="August"
+,booktitle="{Proceedings of the 14\textsuperscript{th} Annual ACM
+Symposium on Parallel
+Algorithms and Architecture}"
+,pages="73-82"
+,annotation="
+ Like the title says...
+"
+}
+
+@InProceedings{HerlihyLM02
+,author={Maurice Herlihy and Victor Luchangco and Mark Moir}
+,title="The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized,
+Lock-Free Data Structures"
+,booktitle={Proceedings of 16\textsuperscript{th} International
+Symposium on Distributed Computing}
+,year=2002
+,month="October"
+,pages="339-353"
+}
+
@article{Appavoo03a
,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and
D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and
@@ -447,3 +551,111 @@ Oregon Health and Sciences University"
Realtime turns into making RCU yet more realtime friendly.
"
}
+
+@conference{ThomasEHart2006a
+,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown"
+,Title="Making Lockless Synchronization Fast: Performance Implications
+of Memory Reclamation"
+,Booktitle="20\textsuperscript{th} {IEEE} International Parallel and
+Distributed Processing Symposium"
+,month="April"
+,year="2006"
+,day="25-29"
+,address="Rhodes, Greece"
+,annotation="
+ Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free
+ reference counting.
+"
+}
+
+@Conference{PaulEMcKenney2006b
+,Author="Paul E. McKenney and Dipankar Sarma and Ingo Molnar and
+Suparna Bhattacharya"
+,Title="Extending RCU for Realtime and Embedded Workloads"
+,Booktitle="{Ottawa Linux Symposium}"
+,Month="July"
+,Year="2006"
+,pages="v2 123-138"
+,note="Available:
+\url{http://www.linuxsymposium.org/2006/view_abstract.php?content_key=184}
+\url{http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf}
+[Viewed January 1, 2007]"
+,annotation="
+ Described how to improve the -rt implementation of realtime RCU.
+"
+}
+
+@unpublished{PaulEMcKenney2006c
+,Author="Paul E. McKenney"
+,Title="Sleepable {RCU}"
+,month="October"
+,day="9"
+,year="2006"
+,note="Available:
+\url{http://lwn.net/Articles/202847/}
+Revised:
+\url{http://www.rdrop.com/users/paulmck/RCU/srcu.2007.01.14a.pdf}
+[Viewed August 21, 2006]"
+,annotation="
+ LWN article introducing SRCU.
+"
+}
+
+@unpublished{RobertOlsson2006a
+,Author="Robert Olsson and Stefan Nilsson"
+,Title="{TRASH}: A dynamic {LC}-trie and hash data structure"
+,month="August"
+,day="18"
+,year="2006"
+,note="Available:
+\url{http://www.nada.kth.se/~snilsson/public/papers/trash/trash.pdf}
+[Viewed February 24, 2007]"
+,annotation="
+ RCU-protected dynamic trie-hash combination.
+"
+}
+
+@unpublished{PaulEMcKenney2007BoostRCU
+,Author="Paul E. McKenney"
+,Title="Priority-Boosting {RCU} Read-Side Critical Sections"
+,month="February"
+,day="5"
+,year="2007"
+,note="Available:
+\url{http://lwn.net/Articles/220677/}
+Revised:
+\url{http://www.rdrop.com/users/paulmck/RCU/RCUbooststate.2007.04.16a.pdf}
+[Viewed September 7, 2007]"
+,annotation="
+ LWN article introducing RCU priority boosting.
+"
+}
+
+@unpublished{ThomasEHart2007a
+,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown and Jonathan Walpole"
+,Title="Performance of memory reclamation for lockless synchronization"
+,journal="J. Parallel Distrib. Comput."
+,year="2007"
+,note="To appear in J. Parallel Distrib. Comput.
+ \url{doi=10.1016/j.jpdc.2007.04.010}"
+,annotation={
+ Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free
+ reference counting. Journal version of ThomasEHart2006a.
+}
+}
+
+@unpublished{PaulEMcKenney2007QRCUspin
+,Author="Paul E. McKenney"
+,Title="Using Promela and Spin to verify parallel algorithms"
+,month="August"
+,day="1"
+,year="2007"
+,note="Available:
+\url{http://lwn.net/Articles/243851/}
+[Viewed September 8, 2007]"
+,annotation="
+ LWN article describing Promela and spin, and also using Oleg
+ Nesterov's QRCU as an example (with Paul McKenney's fastpath).
+"
+}
+
diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/Documentation/RCU/torture.txt linux-2.6.22-i-cleanups/Documentation/RCU/torture.txt
--- linux-2.6.22-h-boostsleep/Documentation/RCU/torture.txt 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-i-cleanups/Documentation/RCU/torture.txt 2007-09-09 17:02:58.000000000 -0700
@@ -37,6 +37,24 @@ nfakewriters This is the number of RCU f
to trigger special cases caused by multiple writers, such as
the synchronize_srcu() early return optimization.

+preempt_torture Specifies that torturing of preemptible RCU is to be
+ undertaken, defaults to no such testing. This test
+ creates a kernel thread that runs at the lowest possible
+ realtime priority, alternating between ten seconds
+ of spinning and a short sleep period. The goal is
+ to preempt lower-priority RCU readers. Note that this
+ currently does not fail the full test, but instead simply
+ counts the number of times that a ten-second CPU burst
+ coincides with a stall in grace-period detection.
+
+ Of course, if the grace period advances during a CPU burst,
+ that indicates that no RCU reader was preempted, so the
+ burst ends early in that case.
+
+ Note that such stalls are expected behavior in preemptible
+ RCU implementations when RCU priority boosting is not
+ enabled (PREEMPT_RCU_BOOST=n).
+
stat_interval The number of seconds between output of torture
statistics (via printk()). Regardless of the interval,
statistics are printed when the module is unloaded.
@@ -46,12 +64,13 @@ stat_interval The number of seconds betw

shuffle_interval
The number of seconds to keep the test threads affinitied
- to a particular subset of the CPUs. Used in conjunction
- with test_no_idle_hz.
+ to a particular subset of the CPUs, defaults to 5 seconds.
+ Used in conjunction with test_no_idle_hz.

test_no_idle_hz Whether or not to test the ability of RCU to operate in
a kernel that disables the scheduling-clock interrupt to
idle CPUs. Boolean parameter, "1" to test, "0" otherwise.
+ Defaults to omitting this test.

torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API,
"rcu_sync" for rcu_read_lock() with synchronous reclamation,
@@ -82,8 +101,6 @@ be evident. ;-)

The entries are as follows:

-o "ggp": The number of counter flips (or batches) since boot.
-
o "rtc": The hexadecimal address of the structure currently visible
to readers.

@@ -117,8 +134,8 @@ o "Reader Pipe": Histogram of "ages" of
o "Reader Batch": Another histogram of "ages" of structures seen
by readers, but in terms of counter flips (or batches) rather
than in terms of grace periods. The legal number of non-zero
- entries is again two. The reason for this separate view is
- that it is easier to get the third entry to show up in the
+ entries is again two. The reason for this separate view is that
+ it is sometimes easier to get the third entry to show up in the
"Reader Batch" list than in the "Reader Pipe" list.

o "Free-Block Circulation": Shows the number of torture structures
@@ -145,6 +162,21 @@ of the "old" and "current" counters for
"idx" value maps the "old" and "current" values to the underlying array,
and is useful for debugging.

+In addition, preemptible RCU rcutorture runs will report preemption
+stalls:
+
+rcu-torture: rtc: ffffffff88005a40 ver: 17041 tfle: 1 rta: 17041 rtaf: 7904 rtf: 16941 rtmbe: 0
+rcu-torture: Reader Pipe: 975332139 34406 0 0 0 0 0 0 0 0 0
+rcu-torture: Reader Batch: 975349310 17234 0 0 0 0 0 0 0 0 0
+rcu-torture: Free-Block Circulation: 17040 17030 17028 17022 17009 16994 16982 16969 16955 16941 0
+Preemption stalls: 0
+
+The first four lines are as before, and the last line records the number
+of times that grace-period processing stalled during a realtime CPU burst.
+Note that a non-zero value does not -prove- that RCU priority boosting is
+broken, because there are other things that can stall RCU grace-period
+processing. Here is hoping that someone comes up with a better test!
+

USAGE

diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/kernel/Kconfig.preempt linux-2.6.22-i-cleanups/kernel/Kconfig.preempt
--- linux-2.6.22-h-boostsleep/kernel/Kconfig.preempt 2007-09-08 14:02:29.000000000 -0700
+++ linux-2.6.22-i-cleanups/kernel/Kconfig.preempt 2007-09-08 14:08:31.000000000 -0700
@@ -79,6 +79,7 @@ config CLASSIC_RCU
config PREEMPT_RCU
bool "Preemptible RCU"
depends on PREEMPT
+ depends on EXPERIMENTAL
help
This option reduces the latency of the kernel by making certain
RCU sections preemptible. Normally RCU code is non-preemptible, if
diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/kernel/rcupreempt.c linux-2.6.22-i-cleanups/kernel/rcupreempt.c
--- linux-2.6.22-h-boostsleep/kernel/rcupreempt.c 2007-09-08 14:06:27.000000000 -0700
+++ linux-2.6.22-i-cleanups/kernel/rcupreempt.c 2007-09-08 14:08:31.000000000 -0700
@@ -59,7 +59,7 @@
* only to mediate communication between mainline code and hardware
* interrupt and NMI handlers.
*/
-#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
+#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

/*
* PREEMPT_RCU data structures.
@@ -358,7 +358,7 @@ static void init_rcu_boost_early(void)
static inline struct rcu_boost_dat *rcu_rbd_new(void)
{
int cpu = smp_processor_id(); /* locks used, so preemption OK. */
- int idx = ORDERED_WRT_IRQ(rcu_boost_idx);
+ int idx = ACCESS_ONCE(rcu_boost_idx);

if (unlikely(idx < 0))
return NULL;
@@ -810,7 +810,7 @@ void __rcu_read_lock(void)
struct task_struct *me = current;
int nesting;

- nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
+ nesting = ACCESS_ONCE(me->rcu_read_lock_nesting);
if (nesting != 0) {

/* An earlier rcu_read_lock() covers us, just count it. */
@@ -835,9 +835,9 @@ void __rcu_read_lock(void)
* casts to prevent the compiler from reordering.
*/

- idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
+ idx = ACCESS_ONCE(rcu_ctrlblk.completed) & 0x1;
smp_read_barrier_depends(); /* @@@@ might be unneeded */
- ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
+ ACCESS_ONCE(__get_cpu_var(rcu_flipctr)[idx])++;

/*
* Now that the per-CPU counter has been incremented, we
@@ -847,7 +847,7 @@ void __rcu_read_lock(void)
* of the need to increment the per-CPU counter.
*/

- ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
+ ACCESS_ONCE(me->rcu_read_lock_nesting) = nesting + 1;

/*
* Now that we have preventing any NMIs from storing
@@ -856,7 +856,7 @@ void __rcu_read_lock(void)
* rcu_read_unlock().
*/

- ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
+ ACCESS_ONCE(me->rcu_flipctr_idx) = idx;
local_irq_restore(flags);
}
}
@@ -868,7 +868,7 @@ void __rcu_read_unlock(void)
struct task_struct *me = current;
int nesting;

- nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
+ nesting = ACCESS_ONCE(me->rcu_read_lock_nesting);
if (nesting > 1) {

/*
@@ -908,7 +908,7 @@ void __rcu_read_unlock(void)
* DEC Alpha.
*/

- idx = ORDERED_WRT_IRQ(me->rcu_flipctr_idx);
+ idx = ACCESS_ONCE(me->rcu_flipctr_idx);
smp_read_barrier_depends(); /* @@@ Needed??? */

/*
@@ -917,7 +917,7 @@ void __rcu_read_unlock(void)
* After this, any interrupts or NMIs will increment and
* decrement the per-CPU counters.
*/
- ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1;
+ ACCESS_ONCE(me->rcu_read_lock_nesting) = nesting - 1;

/*
* It is now safe to decrement this task's nesting count.
@@ -928,7 +928,7 @@ void __rcu_read_unlock(void)
* but that is OK, since we have already fetched it.
*/

- ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
+ ACCESS_ONCE(__get_cpu_var(rcu_flipctr)[idx])--;

rcu_read_unlock_unboost();

@@ -1123,7 +1123,7 @@ rcu_try_flip_waitmb(void)
/*
* Attempt a single flip of the counters. Remember, a single flip does
* -not- constitute a grace period. Instead, the interval between
- * at least three consecutive flips is a grace period.
+ * at least GP_STAGES+2 consecutive flips is a grace period.
*
* If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
* on a large SMP, they might want to use a hierarchical organization of

2007-09-10 18:46:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 0/9] RCU: Preemptible RCU


* Paul E. McKenney <[email protected]> wrote:

> Work in progress, still not for inclusion. But code now complete!

cool! Now after 2 years of development and testing i think this is one
of the most mature patchsets on lkml - so i'd like to see this
designated for potential upstream inclusion. I.e. everyone who can see
some bug, please holler now.

Ingo

2007-09-21 04:15:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 1/9] RCU: Split API to permit multiple RCU implementations

On Mon, Sep 10, 2007 at 11:32:08AM -0700, Paul E. McKenney wrote:

[nitpick and two part mail ]

>
> diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcuclassic.h linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h
> --- linux-2.6.22/include/linux/rcuclassic.h 1969-12-31 16:00:00.000000000 -0800
> +++ linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
> @@ -0,0 +1,149 @@

[snip]

> + local_bh_enable(); \
> + } while (0)
> +
> +#define __synchronize_sched() synchronize_rcu()
> +
> +extern void __rcu_init(void);
> +extern void rcu_check_callbacks(int cpu, int user);
> +extern void rcu_restart_cpu(int cpu);
> +extern long rcu_batches_completed(void);
> +extern long rcu_batches_completed_bh(void);
> +

> +#endif /* __KERNEL__ */
> +#endif /* __LINUX_RCUCLASSIC_H */
> diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcupdate.h linux-2.6.22-a-splitclassic/include/linux/rcupdate.h
> --- linux-2.6.22/include/linux/rcupdate.h 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-a-splitclassic/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700

[snip]

> */
> -#define synchronize_sched() synchronize_rcu()
> +#define synchronize_sched() __synchronize_sched()
>
> -extern void rcu_init(void);
> -extern void rcu_check_callbacks(int cpu, int user);
> -extern void rcu_restart_cpu(int cpu);
> -extern long rcu_batches_completed(void);
> -extern long rcu_batches_completed_bh(void);

Why is rcu_batches_completed and rcu_batches_completed_bh moved from
rcupdate.h to rcuclassic.h?

[ continued ...]

-- Steve

2007-09-21 04:17:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

[ continued here from comment on patch 1]

On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> /* softirq mask and active fields moved to irq_cpustat_t in
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h 2007-08-22 15:21:06.000000000 -0700
> @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
> extern void __rcu_init(void);
> extern void rcu_check_callbacks(int cpu, int user);
> extern void rcu_restart_cpu(int cpu);
> -extern long rcu_batches_completed(void);
> -extern long rcu_batches_completed_bh(void);
>
> #endif /* __KERNEL__ */
> #endif /* __LINUX_RCUCLASSIC_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
> @@ -52,7 +52,11 @@ struct rcu_head {
> void (*func)(struct rcu_head *head);
> };
>
> +#ifdef CONFIG_CLASSIC_RCU
> #include <linux/rcuclassic.h>
> +#else /* #ifdef CONFIG_CLASSIC_RCU */
> +#include <linux/rcupreempt.h>
> +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
>
> #define RCU_HEAD_INIT { .next = NULL, .func = NULL }
> #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct
> /* Exported common interfaces */
> extern void synchronize_rcu(void);
> extern void rcu_barrier(void);
> +extern long rcu_batches_completed(void);
> +extern long rcu_batches_completed_bh(void);
>

And here we put back rcu_batches_completed and rcu_batches_completed_bh
from rcuclassic.h to rcupdate.h ;-)

-- Steve

2007-09-21 05:51:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 12:17:21AM -0400, Steven Rostedt wrote:
> [ continued here from comment on patch 1]
>
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > /* softirq mask and active fields moved to irq_cpustat_t in
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
> > extern void __rcu_init(void);
> > extern void rcu_check_callbacks(int cpu, int user);
> > extern void rcu_restart_cpu(int cpu);
> > -extern long rcu_batches_completed(void);
> > -extern long rcu_batches_completed_bh(void);
> >
> > #endif /* __KERNEL__ */
> > #endif /* __LINUX_RCUCLASSIC_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -52,7 +52,11 @@ struct rcu_head {
> > void (*func)(struct rcu_head *head);
> > };
> >
> > +#ifdef CONFIG_CLASSIC_RCU
> > #include <linux/rcuclassic.h>
> > +#else /* #ifdef CONFIG_CLASSIC_RCU */
> > +#include <linux/rcupreempt.h>
> > +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
> >
> > #define RCU_HEAD_INIT { .next = NULL, .func = NULL }
> > #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> > @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct
> > /* Exported common interfaces */
> > extern void synchronize_rcu(void);
> > extern void rcu_barrier(void);
> > +extern long rcu_batches_completed(void);
> > +extern long rcu_batches_completed_bh(void);
> >
>
> And here we put back rcu_batches_completed and rcu_batches_completed_bh
> from rcuclassic.h to rcupdate.h ;-)

Hmmm... Good point!!! I guess it would be OK to just leave them
in rcupdate.h throughout. ;-)

Will fix. And good eyes!

Thanx, Paul

2007-09-21 05:57:41

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 12:17:21AM -0400, Steven Rostedt wrote:
> [ continued here from comment on patch 1]
>
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > /* softirq mask and active fields moved to irq_cpustat_t in
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcuclassic.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu);
> > #define RCU_HEAD_INIT { .next = NULL, .func = NULL }
> > #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> > @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct
> > /* Exported common interfaces */
> > extern void synchronize_rcu(void);
> > extern void rcu_barrier(void);
> > +extern long rcu_batches_completed(void);
> > +extern long rcu_batches_completed_bh(void);
> >
>
> And here we put back rcu_batches_completed and rcu_batches_completed_bh
> from rcuclassic.h to rcupdate.h ;-)

Good questions :) I can't remember why I did this - probably because
I was breaking up into classic and preemptible RCU in incremental
patches with the goal that the break-up patch can be merged
before the rcu-preempt patches. IIRC, I had to make *batches_completed*()
a common RCU API later on.

Thanks
Dipankar

2007-09-21 14:40:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
> #endif /* __KERNEL__ */
> #endif /* __LINUX_RCUCLASSIC_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
> @@ -52,7 +52,11 @@ struct rcu_head {
> void (*func)(struct rcu_head *head);
> };
>
> +#ifdef CONFIG_CLASSIC_RCU
> #include <linux/rcuclassic.h>
> +#else /* #ifdef CONFIG_CLASSIC_RCU */
> +#include <linux/rcupreempt.h>
> +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */

A bit extreme on the comments here.


>
> #define RCU_HEAD_INIT { .next = NULL, .func = NULL }
> #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct
> /* Exported common interfaces */
> extern void synchronize_rcu(void);
> extern void rcu_barrier(void);
> +extern long rcu_batches_completed(void);
> +extern long rcu_batches_completed_bh(void);
>
> /* Internal to kernel */
> extern void rcu_init(void);
> extern void rcu_check_callbacks(int cpu, int user);
> +extern int rcu_needs_cpu(int cpu);
>
> #endif /* __KERNEL__ */
> #endif /* __LINUX_RCUPDATE_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 16:00:00.000000000 -0800
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h 2007-08-22 15:21:06.000000000 -0700
> @@ -0,0 +1,78 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2006
> + *
> + * Author: Paul McKenney <[email protected]>
> + *
> + * Based on the original work by Paul McKenney <[email protected]>
> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> + * Papers:
> + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> + *
> + * For detailed explanation of Read-Copy Update mechanism see -
> + * Documentation/RCU
> + *
> + */
> +
> +#ifndef __LINUX_RCUPREEMPT_H
> +#define __LINUX_RCUPREEMPT_H
> +
> +#ifdef __KERNEL__
> +
> +#include <linux/cache.h>
> +#include <linux/spinlock.h>
> +#include <linux/threads.h>
> +#include <linux/percpu.h>
> +#include <linux/cpumask.h>
> +#include <linux/seqlock.h>
> +
> +#define rcu_qsctr_inc(cpu)
> +#define rcu_bh_qsctr_inc(cpu)
> +#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
> +
> +extern void __rcu_read_lock(void);
> +extern void __rcu_read_unlock(void);
> +extern int rcu_pending(int cpu);
> +extern int rcu_needs_cpu(int cpu);
> +
> +#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
> +#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); }
> +
> +#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting)
> +
> +extern void __synchronize_sched(void);
> +
> +extern void __rcu_init(void);
> +extern void rcu_check_callbacks(int cpu, int user);
> +extern void rcu_restart_cpu(int cpu);
> +
> +#ifdef CONFIG_RCU_TRACE
> +struct rcupreempt_trace;
> +extern int *rcupreempt_flipctr(int cpu);
> +extern long rcupreempt_data_completed(void);
> +extern int rcupreempt_flip_flag(int cpu);
> +extern int rcupreempt_mb_flag(int cpu);
> +extern char *rcupreempt_try_flip_state_name(void);
> +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
> +#endif
> +
> +struct softirq_action;
> +
> +#endif /* __KERNEL__ */
> +#endif /* __LINUX_RCUPREEMPT_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
> --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h 1969-12-31 16:00:00.000000000 -0800
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h 2007-08-22 15:21:06.000000000 -0700
> @@ -0,0 +1,100 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2006
> + *
> + * Author: Paul McKenney <[email protected]>
> + *
> + * Based on the original work by Paul McKenney <[email protected]>
> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> + * Papers:
> + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> + *
> + * For detailed explanation of Read-Copy Update mechanism see -
> + * http://lse.sourceforge.net/locking/rcupdate.html
> + *
> + */
> +
> +#ifndef __LINUX_RCUPREEMPT_TRACE_H
> +#define __LINUX_RCUPREEMPT_TRACE_H
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +
> +#include <asm/atomic.h>
> +
> +/*
> + * PREEMPT_RCU data structures.
> + */
> +
> +struct rcupreempt_trace {
> + long next_length;
> + long next_add;
> + long wait_length;
> + long wait_add;
> + long done_length;
> + long done_add;
> + long done_remove;
> + atomic_t done_invoked;
> + long rcu_check_callbacks;
> + atomic_t rcu_try_flip_1;
> + atomic_t rcu_try_flip_e1;
> + long rcu_try_flip_i1;
> + long rcu_try_flip_ie1;
> + long rcu_try_flip_g1;
> + long rcu_try_flip_a1;
> + long rcu_try_flip_ae1;
> + long rcu_try_flip_a2;
> + long rcu_try_flip_z1;
> + long rcu_try_flip_ze1;
> + long rcu_try_flip_z2;
> + long rcu_try_flip_m1;
> + long rcu_try_flip_me1;
> + long rcu_try_flip_m2;
> +};
> +
> +#ifdef CONFIG_RCU_TRACE
> +#define RCU_TRACE(fn, arg) fn(arg);
> +#else
> +#define RCU_TRACE(fn, arg)
> +#endif
> +
> +extern void rcupreempt_trace_move2done(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_invoke(struct rcupreempt_trace *trace);
> +extern void rcupreempt_trace_next_add(struct rcupreempt_trace *trace);
> +
> +#endif /* __KERNEL__ */
> +#endif /* __LINUX_RCUPREEMPT_TRACE_H */
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/sched.h linux-2.6.22-c-preemptrcu/include/linux/sched.h
> --- linux-2.6.22-b-fixbarriers/include/linux/sched.h 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/sched.h 2007-08-22 15:21:06.000000000 -0700
> @@ -850,6 +850,11 @@ struct task_struct {
> cpumask_t cpus_allowed;
> unsigned int time_slice, first_time_slice;
>
> +#ifdef CONFIG_PREEMPT_RCU
> + int rcu_read_lock_nesting;
> + int rcu_flipctr_idx;
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> +
> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> struct sched_info sched_info;
> #endif
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/fork.c linux-2.6.22-c-preemptrcu/kernel/fork.c
> --- linux-2.6.22-b-fixbarriers/kernel/fork.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/kernel/fork.c 2007-08-22 15:21:06.000000000 -0700
> @@ -1032,6 +1032,10 @@ static struct task_struct *copy_process(
>
> INIT_LIST_HEAD(&p->children);
> INIT_LIST_HEAD(&p->sibling);
> +#ifdef CONFIG_PREEMPT_RCU
> + p->rcu_read_lock_nesting = 0;
> + p->rcu_flipctr_idx = 0;
> +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> p->vfork_done = NULL;
> spin_lock_init(&p->alloc_lock);
>
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/Kconfig.preempt linux-2.6.22-c-preemptrcu/kernel/Kconfig.preempt
> --- linux-2.6.22-b-fixbarriers/kernel/Kconfig.preempt 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/kernel/Kconfig.preempt 2007-08-22 15:21:06.000000000 -0700
> @@ -63,3 +63,41 @@ config PREEMPT_BKL
> Say Y here if you are building a kernel for a desktop system.
> Say N if you are unsure.
>
> +choice
> + prompt "RCU implementation type:"
> + default CLASSIC_RCU
> +
> +config CLASSIC_RCU
> + bool "Classic RCU"
> + help
> + This option selects the classic RCU implementation that is
> + designed for best read-side performance on non-realtime
> + systems.
> +
> + Say Y if you are unsure.
> +
> +config PREEMPT_RCU
> + bool "Preemptible RCU"
> + depends on PREEMPT
> + help
> + This option reduces the latency of the kernel by making certain
> + RCU sections preemptible. Normally RCU code is non-preemptible, if
> + this option is selected then read-only RCU sections become
> + preemptible. This helps latency, but may expose bugs due to
> + now-naive assumptions about each RCU read-side critical section
> + remaining on a given CPU through its execution.
> +
> + Say N if you are unsure.
> +
> +endchoice
> +
> +config RCU_TRACE
> + bool "Enable tracing for RCU - currently stats in debugfs"
> + select DEBUG_FS
> + default y
> + help
> + This option provides tracing in RCU which presents stats
> + in debugfs for debugging RCU implementation.
> +
> + Say Y here if you want to enable RCU tracing
> + Say N if you are unsure.
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/Makefile linux-2.6.22-c-preemptrcu/kernel/Makefile
> --- linux-2.6.22-b-fixbarriers/kernel/Makefile 2007-07-19 12:16:03.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/kernel/Makefile 2007-08-22 15:21:06.000000000 -0700
> @@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o
> exit.o itimer.o time.o softirq.o resource.o \
> sysctl.o capability.o ptrace.o timer.o user.o \
> signal.o sys.o kmod.o workqueue.o pid.o \
> - rcupdate.o rcuclassic.o extable.o params.o posix-timers.o \
> + rcupdate.o extable.o params.o posix-timers.o \
> kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
> hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o
>
> @@ -46,6 +46,11 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
> obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
> obj-$(CONFIG_SECCOMP) += seccomp.o
> obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
> +obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
> +obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
> +ifeq ($(CONFIG_PREEMPT_RCU),y)
> +obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
> +endif
> obj-$(CONFIG_RELAY) += relay.o
> obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
> obj-$(CONFIG_UTS_NS) += utsname.o
> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
> --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 1969-12-31 16:00:00.000000000 -0800
> +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 15:35:19.000000000 -0700
> @@ -0,0 +1,767 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion, realtime implementation
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright IBM Corporation, 2006
> + *
> + * Authors: Paul E. McKenney <[email protected]>
> + * With thanks to Esben Nielsen, Bill Huey, and Ingo Molnar
> + * for pushing me away from locks and towards counters, and
> + * to Suparna Bhattacharya for pushing me completely away
> + * from atomic instructions on the read side.
> + *
> + * Papers: http://www.rdrop.com/users/paulmck/RCU
> + *
> + * For detailed explanation of Read-Copy Update mechanism see -
> + * Documentation/RCU/ *.txt
> + *
> + */
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/spinlock.h>
> +#include <linux/smp.h>
> +#include <linux/rcupdate.h>
> +#include <linux/interrupt.h>
> +#include <linux/sched.h>
> +#include <asm/atomic.h>
> +#include <linux/bitops.h>
> +#include <linux/module.h>
> +#include <linux/completion.h>
> +#include <linux/moduleparam.h>
> +#include <linux/percpu.h>
> +#include <linux/notifier.h>
> +#include <linux/rcupdate.h>
> +#include <linux/cpu.h>
> +#include <linux/random.h>
> +#include <linux/delay.h>
> +#include <linux/byteorder/swabb.h>
> +#include <linux/cpumask.h>
> +#include <linux/rcupreempt_trace.h>
> +
> +/*
> + * PREEMPT_RCU data structures.
> + */
> +
> +#define GP_STAGES 4

I take it that GP stand for "grace period". Might want to state that
here. /* Grace period stages */ When I was looking at this code at 1am,
I kept asking myself "what's this GP?" (General Protection??). But
that's what happens when looking at code like this after midnight ;-)

> +struct rcu_data {
> + spinlock_t lock; /* Protect rcu_data fields. */
> + long completed; /* Number of last completed batch. */
> + int waitlistcount;
> + struct tasklet_struct rcu_tasklet;
> + struct rcu_head *nextlist;
> + struct rcu_head **nexttail;
> + struct rcu_head *waitlist[GP_STAGES];
> + struct rcu_head **waittail[GP_STAGES];
> + struct rcu_head *donelist;
> + struct rcu_head **donetail;
> +#ifdef CONFIG_RCU_TRACE
> + struct rcupreempt_trace trace;
> +#endif /* #ifdef CONFIG_RCU_TRACE */
> +};
> +struct rcu_ctrlblk {
> + spinlock_t fliplock; /* Protect state-machine transitions. */
> + long completed; /* Number of last completed batch. */
> +};
> +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> +static struct rcu_ctrlblk rcu_ctrlblk = {
> + .fliplock = SPIN_LOCK_UNLOCKED,
> + .completed = 0,
> +};
> +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> +
> +/*
> + * States for rcu_try_flip() and friends.

Can you have a pointer somewhere that explains these states. And not a
"it's in this paper or directory". Either have a short discription here,
or specify where exactly to find the information (perhaps a
Documentation/RCU/preemptible_states.txt?).

Trying to understand these states has caused me the most agony in
reviewing these patches.

> + */
> +
> +enum rcu_try_flip_states {
> + rcu_try_flip_idle_state, /* "I" */
> + rcu_try_flip_waitack_state, /* "A" */
> + rcu_try_flip_waitzero_state, /* "Z" */
> + rcu_try_flip_waitmb_state /* "M" */
> +};
> +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
> +#ifdef CONFIG_RCU_TRACE
> +static char *rcu_try_flip_state_names[] =
> + { "idle", "waitack", "waitzero", "waitmb" };
> +#endif /* #ifdef CONFIG_RCU_TRACE */
> +
> +/*
> + * Enum and per-CPU flag to determine when each CPU has seen
> + * the most recent counter flip.
> + */
> +
> +enum rcu_flip_flag_values {
> + rcu_flip_seen, /* Steady/initial state, last flip seen. */
> + /* Only GP detector can update. */
> + rcu_flipped /* Flip just completed, need confirmation. */
> + /* Only corresponding CPU can update. */
> +};
> +static DEFINE_PER_CPU(enum rcu_flip_flag_values, rcu_flip_flag) = rcu_flip_seen;
> +
> +/*
> + * Enum and per-CPU flag to determine when each CPU has executed the
> + * needed memory barrier to fence in memory references from its last RCU
> + * read-side critical section in the just-completed grace period.
> + */
> +
> +enum rcu_mb_flag_values {
> + rcu_mb_done, /* Steady/initial state, no mb()s required. */
> + /* Only GP detector can update. */
> + rcu_mb_needed /* Flip just completed, need an mb(). */
> + /* Only corresponding CPU can update. */
> +};
> +static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done;
> +
> +/*
> + * Macro that prevents the compiler from reordering accesses, but does
> + * absolutely -nothing- to prevent CPUs from reordering. This is used
> + * only to mediate communication between mainline code and hardware
> + * interrupt and NMI handlers.
> + */
> +#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
> +
> +/*
> + * RCU_DATA_ME: find the current CPU's rcu_data structure.
> + * RCU_DATA_CPU: find the specified CPU's rcu_data structure.
> + */
> +#define RCU_DATA_ME() (&__get_cpu_var(rcu_data))
> +#define RCU_DATA_CPU(cpu) (&per_cpu(rcu_data, cpu))
> +
> +/*
> + * Helper macro for tracing when the appropriate rcu_data is not
> + * cached in a local variable, but where the CPU number is so cached.
> + */
> +#define RCU_TRACE_CPU(f, cpu) RCU_TRACE(f, &(RCU_DATA_CPU(cpu)->trace));
> +
> +/*
> + * Helper macro for tracing when the appropriate rcu_data is not
> + * cached in a local variable.
> + */
> +#define RCU_TRACE_ME(f) RCU_TRACE(f, &(RCU_DATA_ME()->trace));
> +
> +/*
> + * Helper macro for tracing when the appropriate rcu_data is pointed
> + * to by a local variable.
> + */
> +#define RCU_TRACE_RDP(f, rdp) RCU_TRACE(f, &((rdp)->trace));
> +
> +/*
> + * Return the number of RCU batches processed thus far. Useful
> + * for debug and statistics.
> + */
> +long rcu_batches_completed(void)
> +{
> + return rcu_ctrlblk.completed;
> +}
> +EXPORT_SYMBOL_GPL(rcu_batches_completed);
> +
> +/*
> + * Return the number of RCU batches processed thus far. Useful for debug
> + * and statistics. The _bh variant is identical to straight RCU.
> + */

If they are identical, then why the separation?

> +long rcu_batches_completed_bh(void)
> +{
> + return rcu_ctrlblk.completed;
> +}
> +EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
> +
> +void __rcu_read_lock(void)
> +{
> + int idx;
> + struct task_struct *me = current;

Nitpick, but other places in the kernel usually use "t" or "p" as a
variable to assign current to. It's just that "me" thows me off a
little while reviewing this. But this is just a nitpick, so do as you
will.

> + int nesting;
> +
> + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> + if (nesting != 0) {
> +
> + /* An earlier rcu_read_lock() covers us, just count it. */
> +
> + me->rcu_read_lock_nesting = nesting + 1;
> +
> + } else {
> + unsigned long oldirq;

Nitpick, "flags" is usually used for saving irq state.

> +
> + /*
> + * Disable local interrupts to prevent the grace-period
> + * detection state machine from seeing us half-done.
> + * NMIs can still occur, of course, and might themselves
> + * contain rcu_read_lock().
> + */
> +
> + local_irq_save(oldirq);

Isn't the GP detection done via a tasklet/softirq. So wouldn't a
local_bh_disable be sufficient here? You already cover NMIs, which would
also handle normal interrupts.

> +
> + /*
> + * Outermost nesting of rcu_read_lock(), so increment
> + * the current counter for the current CPU. Use volatile
> + * casts to prevent the compiler from reordering.
> + */
> +
> + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> +
> + /*
> + * Now that the per-CPU counter has been incremented, we
> + * are protected from races with rcu_read_lock() invoked
> + * from NMI handlers on this CPU. We can therefore safely
> + * increment the nesting counter, relieving further NMIs
> + * of the need to increment the per-CPU counter.
> + */
> +
> + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> +
> + /*
> + * Now that we have preventing any NMIs from storing
> + * to the ->rcu_flipctr_idx, we can safely use it to
> + * remember which counter to decrement in the matching
> + * rcu_read_unlock().
> + */
> +
> + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> + local_irq_restore(oldirq);
> + }
> +}
> +EXPORT_SYMBOL_GPL(__rcu_read_lock);
> +
> +void __rcu_read_unlock(void)
> +{
> + int idx;
> + struct task_struct *me = current;
> + int nesting;
> +
> + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> + if (nesting > 1) {
> +
> + /*
> + * We are still protected by the enclosing rcu_read_lock(),
> + * so simply decrement the counter.
> + */
> +
> + me->rcu_read_lock_nesting = nesting - 1;
> +
> + } else {
> + unsigned long oldirq;
> +
> + /*
> + * Disable local interrupts to prevent the grace-period
> + * detection state machine from seeing us half-done.
> + * NMIs can still occur, of course, and might themselves
> + * contain rcu_read_lock() and rcu_read_unlock().
> + */
> +
> + local_irq_save(oldirq);
> +
> + /*
> + * Outermost nesting of rcu_read_unlock(), so we must
> + * decrement the current counter for the current CPU.
> + * This must be done carefully, because NMIs can
> + * occur at any point in this code, and any rcu_read_lock()
> + * and rcu_read_unlock() pairs in the NMI handlers
> + * must interact non-destructively with this code.
> + * Lots of volatile casts, and -very- careful ordering.
> + *
> + * Changes to this code, including this one, must be
> + * inspected, validated, and tested extremely carefully!!!
> + */
> +
> + /*
> + * First, pick up the index. Enforce ordering for
> + * DEC Alpha.
> + */
> +
> + idx = ORDERED_WRT_IRQ(me->rcu_flipctr_idx);
> + smp_read_barrier_depends(); /* @@@ Needed??? */
> +
> + /*
> + * Now that we have fetched the counter index, it is
> + * safe to decrement the per-task RCU nesting counter.
> + * After this, any interrupts or NMIs will increment and
> + * decrement the per-CPU counters.
> + */
> + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1;
> +
> + /*
> + * It is now safe to decrement this task's nesting count.
> + * NMIs that occur after this statement will route their
> + * rcu_read_lock() calls through this "else" clause, and
> + * will thus start incrementing the per-CPU coutner on

s/coutner/counter/

> + * their own. They will also clobber ->rcu_flipctr_idx,
> + * but that is OK, since we have already fetched it.
> + */
> +
> + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
> + local_irq_restore(oldirq);
> + }
> +}
> +EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> +
> +/*
> + * If a global counter flip has occurred since the last time that we
> + * advanced callbacks, advance them. Hardware interrupts must be
> + * disabled when calling this function.
> + */
> +static void __rcu_advance_callbacks(struct rcu_data *rdp)
> +{
> + int cpu;
> + int i;
> + int wlc = 0;
> +
> + if (rdp->completed != rcu_ctrlblk.completed) {
> + if (rdp->waitlist[GP_STAGES - 1] != NULL) {
> + *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
> + rdp->donetail = rdp->waittail[GP_STAGES - 1];
> + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
> + }
> + for (i = GP_STAGES - 2; i >= 0; i--) {
> + if (rdp->waitlist[i] != NULL) {
> + rdp->waitlist[i + 1] = rdp->waitlist[i];
> + rdp->waittail[i + 1] = rdp->waittail[i];
> + wlc++;
> + } else {
> + rdp->waitlist[i + 1] = NULL;
> + rdp->waittail[i + 1] =
> + &rdp->waitlist[i + 1];
> + }
> + }
> + if (rdp->nextlist != NULL) {
> + rdp->waitlist[0] = rdp->nextlist;
> + rdp->waittail[0] = rdp->nexttail;
> + wlc++;
> + rdp->nextlist = NULL;
> + rdp->nexttail = &rdp->nextlist;
> + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
> + } else {
> + rdp->waitlist[0] = NULL;
> + rdp->waittail[0] = &rdp->waitlist[0];
> + }
> + rdp->waitlistcount = wlc;
> + rdp->completed = rcu_ctrlblk.completed;
> + }
> +
> + /*
> + * Check to see if this CPU needs to report that it has seen
> + * the most recent counter flip, thereby declaring that all
> + * subsequent rcu_read_lock() invocations will respect this flip.
> + */
> +
> + cpu = raw_smp_processor_id();
> + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> + smp_mb(); /* Subsequent counter accesses must see new value */
> + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> + smp_mb(); /* Subsequent RCU read-side critical sections */
> + /* seen -after- acknowledgement. */
> + }
> +}
> +
> +/*
> + * Get here when RCU is idle. Decide whether we need to
> + * move out of idle state, and return non-zero if so.
> + * "Straightforward" approach for the moment, might later
> + * use callback-list lengths, grace-period duration, or
> + * some such to determine when to exit idle state.
> + * Might also need a pre-idle test that does not acquire
> + * the lock, but let's get the simple case working first...
> + */
> +
> +static int
> +rcu_try_flip_idle(void)
> +{
> + int cpu;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> + if (!rcu_pending(smp_processor_id())) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> + return 0;
> + }
> +
> + /*
> + * Do the flip.
> + */
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
> + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
> +
> + /*
> + * Need a memory barrier so that other CPUs see the new
> + * counter value before they see the subsequent change of all
> + * the rcu_flip_flag instances to rcu_flipped.
> + */
> +
> + smp_mb(); /* see above block comment. */
> +
> + /* Now ask each CPU for acknowledgement of the flip. */
> +
> + for_each_possible_cpu(cpu)
> + per_cpu(rcu_flip_flag, cpu) = rcu_flipped;
> +
> + return 1;
> +}
> +
> +/*
> + * Wait for CPUs to acknowledge the flip.
> + */
> +
> +static int
> +rcu_try_flip_waitack(void)
> +{
> + int cpu;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_a1);
> + for_each_possible_cpu(cpu)
> + if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1);
> + return 0;
> + }
> +
> + /*
> + * Make sure our checks above don't bleed into subsequent
> + * waiting for the sum of the counters to reach zero.
> + */
> +
> + smp_mb(); /* see above block comment. */
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_a2);
> + return 1;
> +}
> +
> +/*
> + * Wait for collective ``last'' counter to reach zero,
> + * then tell all CPUs to do an end-of-grace-period memory barrier.
> + */
> +
> +static int
> +rcu_try_flip_waitzero(void)
> +{
> + int cpu;
> + int lastidx = !(rcu_ctrlblk.completed & 0x1);
> + int sum = 0;
> +
> + /* Check to see if the sum of the "last" counters is zero. */
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_z1);
> + for_each_possible_cpu(cpu)
> + sum += per_cpu(rcu_flipctr, cpu)[lastidx];
> + if (sum != 0) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_ze1);
> + return 0;
> + }
> +
> + smp_mb(); /* Don't call for memory barriers before we see zero. */
> +
> + /* Call for a memory barrier from each CPU. */
> +
> + for_each_possible_cpu(cpu)
> + per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_z2);
> + return 1;
> +}
> +
> +/*
> + * Wait for all CPUs to do their end-of-grace-period memory barrier.
> + * Return 0 once all CPUs have done so.
> + */
> +
> +static int
> +rcu_try_flip_waitmb(void)
> +{
> + int cpu;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_m1);
> + for_each_possible_cpu(cpu)
> + if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_me1);
> + return 0;
> + }
> +
> + smp_mb(); /* Ensure that the above checks precede any following flip. */
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_m2);
> + return 1;
> +}
> +
> +/*
> + * Attempt a single flip of the counters. Remember, a single flip does
> + * -not- constitute a grace period. Instead, the interval between
> + * at least three consecutive flips is a grace period.
> + *
> + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation

Oh, come now! It's not "nuts" to use this ;-)

> + * on a large SMP, they might want to use a hierarchical organization of
> + * the per-CPU-counter pairs.
> + */
> +static void rcu_try_flip(void)
> +{
> + unsigned long oldirq;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> + return;
> + }
> +
> + /*
> + * Take the next transition(s) through the RCU grace-period
> + * flip-counter state machine.
> + */
> +
> + switch (rcu_try_flip_state) {
> + case rcu_try_flip_idle_state:
> + if (rcu_try_flip_idle())
> + rcu_try_flip_state = rcu_try_flip_waitack_state;

Just trying to understand all this. Here at flip_idle, only a CPU with
no pending RCU calls will flip it. Then all the cpus flags will be set
to rcu_flipped, and the ctrl.completed counter is incremented.

When a CPU calls process_callbacks, it would then move all it's
callbacks to the next list (next -> wait[GP...] -> done), and set it's
unique completed counter to completed. So no more moving up the lists
will be done. It also will set it's state flag to rcu_flip_seen. Even if
the cpu doesn't have any RCU callbacks, the state would be in
rcu_try_flip_waitack_state so we go to the next switch case for a
rcu_try_flip call.

Also the filp counter has been flipped so that all new rcu_read_locks
will increment the cpus "other" index. We just need to wait for the
index of the previous counters to reach zero.

> + break;
> + case rcu_try_flip_waitack_state:
> + if (rcu_try_flip_waitack())

Now, we just wait to see if all the cpus have called process_callbacks
and now have the flag rcu_flip_seen set. So all the cpus have pushed
their callbacks up to the next queue.

> + rcu_try_flip_state = rcu_try_flip_waitzero_state;
> + break;
> + case rcu_try_flip_waitzero_state:
> + if (rcu_try_flip_waitzero())

Now this is where we wait for the sum of all the CPUs counters to reach
zero. The reason for the sum is that the task may have been preempted
and migrated to another CPU and decremented that counter. So the one CPU
counter would be 1 and the other would be -1. But the sum is still zero.
Is there a chance that overflow of a counter (although probably very
very unlikely) would cause any problems?

Also, all the CPUs have their "check_mb" set.

> + rcu_try_flip_state = rcu_try_flip_waitmb_state;
> + break;
> + case rcu_try_flip_waitmb_state:
> + if (rcu_try_flip_waitmb())

I have to admit that this seems a bit of an overkill, but I guess you
know what you are doing. After going through three states, we still
need to do a memory barrier on each CPU?

> + rcu_try_flip_state = rcu_try_flip_idle_state;

Finally we are back to the original state and we start the process all
over.

Is this analysis correct?

> + }
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> +}
> +
> +/*
> + * Check to see if this CPU needs to do a memory barrier in order to
> + * ensure that any prior RCU read-side critical sections have committed
> + * their counter manipulations and critical-section memory references
> + * before declaring the grace period to be completed.
> + */
> +static void rcu_check_mb(int cpu)
> +{
> + if (per_cpu(rcu_mb_flag, cpu) == rcu_mb_needed) {
> + smp_mb(); /* Ensure RCU read-side accesses are visible. */
> + per_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
> + }
> +}
> +
> +void rcu_check_callbacks(int cpu, int user)
> +{
> + unsigned long oldirq;
> + struct rcu_data *rdp = RCU_DATA_CPU(cpu);
> +
> + rcu_check_mb(cpu);
> + if (rcu_ctrlblk.completed == rdp->completed)
> + rcu_try_flip();
> + spin_lock_irqsave(&rdp->lock, oldirq);
> + RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> + __rcu_advance_callbacks(rdp);
> + if (rdp->donelist == NULL) {
> + spin_unlock_irqrestore(&rdp->lock, oldirq);
> + } else {
> + spin_unlock_irqrestore(&rdp->lock, oldirq);
> + raise_softirq(RCU_SOFTIRQ);
> + }
> +}
> +
> +/*
> + * Needed by dynticks, to make sure all RCU processing has finished
> + * when we go idle:

Didn't we have a discussion that this is no longer true? Or was it taken
out of Dynamic ticks before. I don't see any users of it.

> + */
> +void rcu_advance_callbacks(int cpu, int user)
> +{
> + unsigned long oldirq;
> + struct rcu_data *rdp = RCU_DATA_CPU(cpu);
> +
> + if (rcu_ctrlblk.completed == rdp->completed) {
> + rcu_try_flip();
> + if (rcu_ctrlblk.completed == rdp->completed)
> + return;
> + }
> + spin_lock_irqsave(&rdp->lock, oldirq);
> + RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> + __rcu_advance_callbacks(rdp);
> + spin_unlock_irqrestore(&rdp->lock, oldirq);
> +}

OK, that's all I have on this patch (will take a bit of a break before
reviewing your other patches). But I will say that RCU has grown quite
a bit, and is looking very good.

Basically, what I'm saying is "Great work, Paul!". This is looking
good. Seems that we just need a little bit better explanation for those
that are not up at the IQ level of you. I can write something up after
this all gets finalized. Sort of a rcu-design.txt, that really tries to
explain it to the simpleton's like me ;-)

-- Steve

2007-09-21 15:21:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> +
> +/*
> + * PREEMPT_RCU data structures.
> + */
> +
> +#define GP_STAGES 4
> +struct rcu_data {
> + spinlock_t lock; /* Protect rcu_data fields. */
> + long completed; /* Number of last completed batch. */
> + int waitlistcount;
> + struct tasklet_struct rcu_tasklet;
> + struct rcu_head *nextlist;
> + struct rcu_head **nexttail;
> + struct rcu_head *waitlist[GP_STAGES];
> + struct rcu_head **waittail[GP_STAGES];
> + struct rcu_head *donelist;
> + struct rcu_head **donetail;
> +#ifdef CONFIG_RCU_TRACE
> + struct rcupreempt_trace trace;
> +#endif /* #ifdef CONFIG_RCU_TRACE */
> +};
> +struct rcu_ctrlblk {
> + spinlock_t fliplock; /* Protect state-machine transitions. */
> + long completed; /* Number of last completed batch. */
> +};
> +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> +static struct rcu_ctrlblk rcu_ctrlblk = {
> + .fliplock = SPIN_LOCK_UNLOCKED,
> + .completed = 0,
> +};
> +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> +
> +/*
> + * States for rcu_try_flip() and friends.
> + */
> +
> +enum rcu_try_flip_states {
> + rcu_try_flip_idle_state, /* "I" */
> + rcu_try_flip_waitack_state, /* "A" */
> + rcu_try_flip_waitzero_state, /* "Z" */
> + rcu_try_flip_waitmb_state /* "M" */
> +};
> +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
> +#ifdef CONFIG_RCU_TRACE
> +static char *rcu_try_flip_state_names[] =
> + { "idle", "waitack", "waitzero", "waitmb" };
> +#endif /* #ifdef CONFIG_RCU_TRACE */

[snip]

> +/*
> + * If a global counter flip has occurred since the last time that we
> + * advanced callbacks, advance them. Hardware interrupts must be
> + * disabled when calling this function.
> + */
> +static void __rcu_advance_callbacks(struct rcu_data *rdp)
> +{
> + int cpu;
> + int i;
> + int wlc = 0;
> +
> + if (rdp->completed != rcu_ctrlblk.completed) {
> + if (rdp->waitlist[GP_STAGES - 1] != NULL) {
> + *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
> + rdp->donetail = rdp->waittail[GP_STAGES - 1];
> + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
> + }
> + for (i = GP_STAGES - 2; i >= 0; i--) {
> + if (rdp->waitlist[i] != NULL) {
> + rdp->waitlist[i + 1] = rdp->waitlist[i];
> + rdp->waittail[i + 1] = rdp->waittail[i];
> + wlc++;
> + } else {
> + rdp->waitlist[i + 1] = NULL;
> + rdp->waittail[i + 1] =
> + &rdp->waitlist[i + 1];
> + }
> + }
> + if (rdp->nextlist != NULL) {
> + rdp->waitlist[0] = rdp->nextlist;
> + rdp->waittail[0] = rdp->nexttail;
> + wlc++;
> + rdp->nextlist = NULL;
> + rdp->nexttail = &rdp->nextlist;
> + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
> + } else {
> + rdp->waitlist[0] = NULL;
> + rdp->waittail[0] = &rdp->waitlist[0];
> + }
> + rdp->waitlistcount = wlc;
> + rdp->completed = rcu_ctrlblk.completed;
> + }
> +
> + /*
> + * Check to see if this CPU needs to report that it has seen
> + * the most recent counter flip, thereby declaring that all
> + * subsequent rcu_read_lock() invocations will respect this flip.
> + */
> +
> + cpu = raw_smp_processor_id();
> + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> + smp_mb(); /* Subsequent counter accesses must see new value */
> + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> + smp_mb(); /* Subsequent RCU read-side critical sections */
> + /* seen -after- acknowledgement. */
> + }
> +}

[snip]

> +/*
> + * Attempt a single flip of the counters. Remember, a single flip does
> + * -not- constitute a grace period. Instead, the interval between
> + * at least three consecutive flips is a grace period.
> + *
> + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> + * on a large SMP, they might want to use a hierarchical organization of
> + * the per-CPU-counter pairs.
> + */
> +static void rcu_try_flip(void)
> +{
> + unsigned long oldirq;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> + return;
> + }
> +
> + /*
> + * Take the next transition(s) through the RCU grace-period
> + * flip-counter state machine.
> + */
> +
> + switch (rcu_try_flip_state) {
> + case rcu_try_flip_idle_state:
> + if (rcu_try_flip_idle())
> + rcu_try_flip_state = rcu_try_flip_waitack_state;
> + break;
> + case rcu_try_flip_waitack_state:
> + if (rcu_try_flip_waitack())
> + rcu_try_flip_state = rcu_try_flip_waitzero_state;
> + break;
> + case rcu_try_flip_waitzero_state:
> + if (rcu_try_flip_waitzero())
> + rcu_try_flip_state = rcu_try_flip_waitmb_state;
> + break;
> + case rcu_try_flip_waitmb_state:
> + if (rcu_try_flip_waitmb())
> + rcu_try_flip_state = rcu_try_flip_idle_state;
> + }
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> +}

Paul,

Looking further into this, I still think this is a bit of overkill. We
go through 20 states from call_rcu to list->func().

On call_rcu we put our stuff on the next list. Before we move stuff from
next to wait, we need to go through 4 states. So we have

next -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states ->
wait[2] -> 4 states -> wait[3] -> 4 states -> done.

That's 20 states that we go through from the time we add our function to
the list to the time it actually gets called. Do we really need the 4
wait lists?

Seems a bit overkill to me.

What am I missing?

-- Steve

2007-09-21 15:47:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[email protected]>
wrote:

> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:


> Can you have a pointer somewhere that explains these states. And not a
> "it's in this paper or directory". Either have a short discription here,
> or specify where exactly to find the information (perhaps a
> Documentation/RCU/preemptible_states.txt?).
>
> Trying to understand these states has caused me the most agony in
> reviewing these patches.
>
> > + */
> > +
> > +enum rcu_try_flip_states {
> > + rcu_try_flip_idle_state, /* "I" */
> > + rcu_try_flip_waitack_state, /* "A" */
> > + rcu_try_flip_waitzero_state, /* "Z" */
> > + rcu_try_flip_waitmb_state /* "M" */
> > +};

I thought the 4 flip states corresponded to the 4 GP stages, but now
you confused me. It seems to indeed progress one stage for every 4 flip
states.

Hmm, now I have to puzzle how these 4 stages are required by the lock
and unlock magic.

> > +/*
> > + * Return the number of RCU batches processed thus far. Useful for debug
> > + * and statistics. The _bh variant is identical to straight RCU.
> > + */
>
> If they are identical, then why the separation?

I guess a smaller RCU domain makes for quicker grace periods.

> > +void __rcu_read_lock(void)
> > +{
> > + int idx;
> > + struct task_struct *me = current;
>
> Nitpick, but other places in the kernel usually use "t" or "p" as a
> variable to assign current to. It's just that "me" thows me off a
> little while reviewing this. But this is just a nitpick, so do as you
> will.

struct task_struct *curr = current;

is also not uncommon.

> > + int nesting;
> > +
> > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > + if (nesting != 0) {
> > +
> > + /* An earlier rcu_read_lock() covers us, just count it. */
> > +
> > + me->rcu_read_lock_nesting = nesting + 1;
> > +
> > + } else {
> > + unsigned long oldirq;
>
> > +
> > + /*
> > + * Disable local interrupts to prevent the grace-period
> > + * detection state machine from seeing us half-done.
> > + * NMIs can still occur, of course, and might themselves
> > + * contain rcu_read_lock().
> > + */
> > +
> > + local_irq_save(oldirq);
>
> Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> local_bh_disable be sufficient here? You already cover NMIs, which would
> also handle normal interrupts.

This is also my understanding, but I think this disable is an
'optimization' in that it avoids the regular IRQs from jumping through
these hoops outlined below.

> > +
> > + /*
> > + * Outermost nesting of rcu_read_lock(), so increment
> > + * the current counter for the current CPU. Use volatile
> > + * casts to prevent the compiler from reordering.
> > + */
> > +
> > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > +
> > + /*
> > + * Now that the per-CPU counter has been incremented, we
> > + * are protected from races with rcu_read_lock() invoked
> > + * from NMI handlers on this CPU. We can therefore safely
> > + * increment the nesting counter, relieving further NMIs
> > + * of the need to increment the per-CPU counter.
> > + */
> > +
> > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > +
> > + /*
> > + * Now that we have preventing any NMIs from storing
> > + * to the ->rcu_flipctr_idx, we can safely use it to
> > + * remember which counter to decrement in the matching
> > + * rcu_read_unlock().
> > + */
> > +
> > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > + local_irq_restore(oldirq);
> > + }
> > +}

> > +/*
> > + * Attempt a single flip of the counters. Remember, a single flip does
> > + * -not- constitute a grace period. Instead, the interval between
> > + * at least three consecutive flips is a grace period.
> > + *
> > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
>
> Oh, come now! It's not "nuts" to use this ;-)
>
> > + * on a large SMP, they might want to use a hierarchical organization of
> > + * the per-CPU-counter pairs.
> > + */

Its the large SMP case that's nuts, and on that I have to agree with
Paul, its not really large SMP friendly.


2007-09-21 22:06:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[email protected]>
> wrote:
>
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
> > Can you have a pointer somewhere that explains these states. And not a
> > "it's in this paper or directory". Either have a short discription here,
> > or specify where exactly to find the information (perhaps a
> > Documentation/RCU/preemptible_states.txt?).
> >
> > Trying to understand these states has caused me the most agony in
> > reviewing these patches.
> >
> > > + */
> > > +
> > > +enum rcu_try_flip_states {
> > > + rcu_try_flip_idle_state, /* "I" */
> > > + rcu_try_flip_waitack_state, /* "A" */
> > > + rcu_try_flip_waitzero_state, /* "Z" */
> > > + rcu_try_flip_waitmb_state /* "M" */
> > > +};
>
> I thought the 4 flip states corresponded to the 4 GP stages, but now
> you confused me. It seems to indeed progress one stage for every 4 flip
> states.

Yes, four flip states per stage, and four stages per grace period:

rcu_try_flip_idle_state: Stay here if nothing is happening.
Flip the counter if something starts
happening.

rcu_try_flip_waitack_state: Wait here for all CPUs to notice that
the counter has flipped. This prevents
the old set of counters from ever being
incremented once we leave this state,
which in turn is necessary because we
cannot test any individual counter for
zero -- we can only check the sum.

rcu_try_flip_waitzero_state: Wait here for the sum of the old
per-CPU counters to reach zero.

rcu_try_flip_waitmb_state: Wait here for each of the other
CPUs to execute a memory barrier.
This is necessary to ensure that
these other CPUs really have finished
executing their RCU read-side
critical sections, despite their
CPUs wildly reordering memory.

This state

> Hmm, now I have to puzzle how these 4 stages are required by the lock
> and unlock magic.

They are needed to allow memory barriers to be removed from
rcu_read_lock(), to allow for the fact that the stages mean that
grace-period boundaries are now fuzzy, as well as to account for the
fact that each CPU now has its own callback queue. (Aside: classic RCU
doesn't have memory barriers -- or anything else -- in rcu_read_lock()
and rcu_read_unlock(), but from an implementation viewpoint, the read-side
critical section extends forwards and backwards to the next/previous
context switches, which do have lots of memory barriers.)

Taking the reasons for stages one at a time:

1. The first stage is needed for generic RCU. Everyone must
complete any pre-existing RCU read-side critical sections
before the grace period can be permitted to complete. This
stage corresponds to the "waitlist" and "waittail" in earlier
RCU implementations.

2. An additional stage is needed due to the fact that memory reordering
can cause an RCU read-side critical section's rcu_dereference()
to be executed before the rcu_read_lock() -- no memory barriers!

This can result in the following sequence of events:

o rcu_dereference() from CPU 0.

o list_del_rcu() from CPU 1.

o call_rcu() from CPU 1.

o counter flip from CPU 2.

o rcu_read_lock() from CPU 0 (delayed due to memory reordering).

CPU 0 increments the new counter, not the old one that
its rcu_dereference() is associated with.

o All CPUs acknowledge the counter flip.

o CPU 2 discovers that the old counters sum to zero (because
CPU 0 has incremented its new counter).

o All CPUs do their memory barriers.

o CPU 1's callback is invoked, destroying the memory still
being used by CPU 0.

An additional pass through the state machine fixes this problem.

Note that the read-after-write misordering can happen on x86,
as well as on s390.

3. An additional stage is also needed due to the fact that each CPU
has its own set of callback queues. This means that callbacks
that were registered quite some time after the most recent counter
flip can make it into the current round of the state machine,
as follows:

o CPU 0 flips the counter.

o CPU 1 does an rcu_read_lock() followed by an rcu_dereference().
CPU 1 of course uses the new value of the counter.

o CPU 2 does a list_del_rcu() on the element that CPU 1
did the rcu_dereference on.

o CPU 2 does a call_rcu(), placing the callback on
its local queue.

o CPU 2 gets its scheduling-clock interrupt, and acknowledges
the counter flip, also moving its "next" list (which
contains the new callback) to the "wait" list.

o The old counters sum to zero, and everyone does a memory
barrier, and the callback is invoked. Too bad that CPU 0
is still using the data element just freed!

(And yes, I do understand that this scenario is incompatible
with the previous scenario. But the point of these scenarios
is to demonstrate that a given effect can result in failure,
not to show all possible misfortunes that the effect can cause.
If you wish to try to convince me that the two causes underlying
these symptoms cannot be combined to create a failure spanning
two full flip intervals, then the task before you is to prove
that there is no such scenario.)

This scenario can happen even on sequentially consistent
machines.

4. An additional stage is needed due to the fact that different
CPUs can disagree as to when the counter flip happened. This
effect is in some way similar to #3, but can (and did) happen
even when there is only one global callback queue.

I believe that this can happen even on sequentially consistent
machines, but am not 100% certain of this.

So, four mechanisms for going bad mapped to four states in the grace-period
pipeline. It is quite possible that a more-complex state machine would
require fewer stages, but then again it might not end up being any faster
overall. :-/ It is also quite possible that one or another of the
additional stages covers more than one of the above mechanisms. So,
we know just from the above scenarios that at least two stages are
required, and the most that they can possibly require is four stages.
I have looked very hard for other underlying sources of badness, and
am quite sure that I have them covered (famous last words).

I will rerun with GP_STAGES==3 on POWER to double-check -- it is entirely
possible that the last such run preceded some bug removal.

On the other hand, there are most definitely relatively straightforward
ways to speed things up on uniprocessor machines, for example,
skipping the acknowledgement and memory-barrier states, short-circuiting
synchronize_rcu() and friends, and so on. But I need to avoid premature
optimization here -- one thing at a time.

> > > +/*
> > > + * Return the number of RCU batches processed thus far. Useful for debug
> > > + * and statistics. The _bh variant is identical to straight RCU.
> > > + */
> >
> > If they are identical, then why the separation?
>
> I guess a smaller RCU domain makes for quicker grace periods.

We have to remain API-compatible with classic RCU, where the separate
grace-period types are absolutely required if the networking stack is
to hold up against certain types of DOS attacks. Networking uses
call_rcu_bh(), which has quiescent states at any point in the code
that interrupts are enabled.

It might well be possible to fuse the _bh variant into the classic variant
for CONFIG_PREEMPT (as opposed to CONFIG_PREEMPT_RT) kernels, but as far
as I know, no one has tried this. And there is little motivation to try,
from what I can see.

> > > +void __rcu_read_lock(void)
> > > +{
> > > + int idx;
> > > + struct task_struct *me = current;
> >
> > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > variable to assign current to. It's just that "me" thows me off a
> > little while reviewing this. But this is just a nitpick, so do as you
> > will.
>
> struct task_struct *curr = current;
>
> is also not uncommon.

OK, good point either way.

> > > + int nesting;
> > > +
> > > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > > + if (nesting != 0) {
> > > +
> > > + /* An earlier rcu_read_lock() covers us, just count it. */
> > > +
> > > + me->rcu_read_lock_nesting = nesting + 1;
> > > +
> > > + } else {
> > > + unsigned long oldirq;
> >
> > > +
> > > + /*
> > > + * Disable local interrupts to prevent the grace-period
> > > + * detection state machine from seeing us half-done.
> > > + * NMIs can still occur, of course, and might themselves
> > > + * contain rcu_read_lock().
> > > + */
> > > +
> > > + local_irq_save(oldirq);
> >
> > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > local_bh_disable be sufficient here? You already cover NMIs, which would
> > also handle normal interrupts.
>
> This is also my understanding, but I think this disable is an
> 'optimization' in that it avoids the regular IRQs from jumping through
> these hoops outlined below.

Critical portions of the GP protection happen in the scheduler-clock
interrupt, which is a hardirq. For example, the .completed counter
is always incremented in hardirq context, and we cannot tolerate a
.completed increment in this code. Allowing such an increment would
defeat the counter-acknowledge state in the state machine.

There are ways to avoid the interrupt disabling, but all the ones
that I am aware of thus far have grace-period-latency consequences.

> > > +
> > > + /*
> > > + * Outermost nesting of rcu_read_lock(), so increment
> > > + * the current counter for the current CPU. Use volatile
> > > + * casts to prevent the compiler from reordering.
> > > + */
> > > +
> > > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > > +
> > > + /*
> > > + * Now that the per-CPU counter has been incremented, we
> > > + * are protected from races with rcu_read_lock() invoked
> > > + * from NMI handlers on this CPU. We can therefore safely
> > > + * increment the nesting counter, relieving further NMIs
> > > + * of the need to increment the per-CPU counter.
> > > + */
> > > +
> > > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > > +
> > > + /*
> > > + * Now that we have preventing any NMIs from storing
> > > + * to the ->rcu_flipctr_idx, we can safely use it to
> > > + * remember which counter to decrement in the matching
> > > + * rcu_read_unlock().
> > > + */
> > > +
> > > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > > + local_irq_restore(oldirq);
> > > + }
> > > +}
>
> > > +/*
> > > + * Attempt a single flip of the counters. Remember, a single flip does
> > > + * -not- constitute a grace period. Instead, the interval between
> > > + * at least three consecutive flips is a grace period.
> > > + *
> > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> >
> > Oh, come now! It's not "nuts" to use this ;-)

Hey!!! If it wasn't nuts, I probably wouldn't be working on it!!! ;-)

> > > + * on a large SMP, they might want to use a hierarchical organization of
> > > + * the per-CPU-counter pairs.
> > > + */
>
> Its the large SMP case that's nuts, and on that I have to agree with
> Paul, its not really large SMP friendly.

Yeah, the current implementation of synchronize_sched() is a really bad
idea on a 4096-CPU system, and probably much smaller. The base RCU
implementation is now probably good to several tens of CPUs, perhaps
even to a couple hundred. In contrast, the implementation currently
in -rt might start choking pretty badly much earlier due to the single
global callback queue.

Thanx, Paul

2007-09-21 22:31:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[email protected]>
> wrote:
>
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
>
> > Can you have a pointer somewhere that explains these states. And not a
> > "it's in this paper or directory". Either have a short discription here,
> > or specify where exactly to find the information (perhaps a
> > Documentation/RCU/preemptible_states.txt?).
> >
> > Trying to understand these states has caused me the most agony in
> > reviewing these patches.
> >
> > > + */
> > > +
> > > +enum rcu_try_flip_states {
> > > + rcu_try_flip_idle_state, /* "I" */
> > > + rcu_try_flip_waitack_state, /* "A" */
> > > + rcu_try_flip_waitzero_state, /* "Z" */
> > > + rcu_try_flip_waitmb_state /* "M" */
> > > +};
>
> I thought the 4 flip states corresponded to the 4 GP stages, but now
> you confused me. It seems to indeed progress one stage for every 4 flip
> states.

I'm still confused ;-)

>
> Hmm, now I have to puzzle how these 4 stages are required by the lock
> and unlock magic.
>
> > > +/*
> > > + * Return the number of RCU batches processed thus far. Useful for debug
> > > + * and statistics. The _bh variant is identical to straight RCU.
> > > + */
> >
> > If they are identical, then why the separation?
>
> I guess a smaller RCU domain makes for quicker grace periods.

No, I mean that both the rcu_batches_completed and
rcu_batches_completed_bh are identical. Perhaps we can just put in a

#define rcu_batches_completed_bh rcu_batches_completed

in rcupreempt.h. In rcuclassic, they are different. But no need to have
two identical functions in the preempt version. A macro should do.

>
> > > +void __rcu_read_lock(void)
> > > +{
> > > + int idx;
> > > + struct task_struct *me = current;
> >
> > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > variable to assign current to. It's just that "me" thows me off a
> > little while reviewing this. But this is just a nitpick, so do as you
> > will.
>
> struct task_struct *curr = current;
>
> is also not uncommon.

True, but the "me" confused me. Since that task struct is not me ;-)

>
> > > + int nesting;
> > > +
> > > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > > + if (nesting != 0) {
> > > +
> > > + /* An earlier rcu_read_lock() covers us, just count it. */
> > > +
> > > + me->rcu_read_lock_nesting = nesting + 1;
> > > +
> > > + } else {
> > > + unsigned long oldirq;
> >
> > > +
> > > + /*
> > > + * Disable local interrupts to prevent the grace-period
> > > + * detection state machine from seeing us half-done.
> > > + * NMIs can still occur, of course, and might themselves
> > > + * contain rcu_read_lock().
> > > + */
> > > +
> > > + local_irq_save(oldirq);
> >
> > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > local_bh_disable be sufficient here? You already cover NMIs, which would
> > also handle normal interrupts.
>
> This is also my understanding, but I think this disable is an
> 'optimization' in that it avoids the regular IRQs from jumping through
> these hoops outlined below.

But isn't disabling irqs slower than doing a local_bh_disable? So the
majority of times (where irqs will not happen) we have this overhead.

>
> > > +
> > > + /*
> > > + * Outermost nesting of rcu_read_lock(), so increment
> > > + * the current counter for the current CPU. Use volatile
> > > + * casts to prevent the compiler from reordering.
> > > + */
> > > +
> > > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > > +
> > > + /*
> > > + * Now that the per-CPU counter has been incremented, we
> > > + * are protected from races with rcu_read_lock() invoked
> > > + * from NMI handlers on this CPU. We can therefore safely
> > > + * increment the nesting counter, relieving further NMIs
> > > + * of the need to increment the per-CPU counter.
> > > + */
> > > +
> > > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > > +
> > > + /*
> > > + * Now that we have preventing any NMIs from storing
> > > + * to the ->rcu_flipctr_idx, we can safely use it to
> > > + * remember which counter to decrement in the matching
> > > + * rcu_read_unlock().
> > > + */
> > > +
> > > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > > + local_irq_restore(oldirq);
> > > + }
> > > +}
>
> > > +/*
> > > + * Attempt a single flip of the counters. Remember, a single flip does
> > > + * -not- constitute a grace period. Instead, the interval between
> > > + * at least three consecutive flips is a grace period.
> > > + *
> > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> >
> > Oh, come now! It's not "nuts" to use this ;-)
> >
> > > + * on a large SMP, they might want to use a hierarchical organization of
> > > + * the per-CPU-counter pairs.
> > > + */
>
> Its the large SMP case that's nuts, and on that I have to agree with
> Paul, its not really large SMP friendly.

Hmm, that could be true. But on large SMP systems, you usually have a
large amounts of memory, so hopefully a really long synchronize_rcu
would not be a problem.

-- Steve

2007-09-21 22:57:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 06:31:12PM -0400, Steven Rostedt wrote:
> On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:
> > On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <[email protected]>
> > wrote:
> >
> > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> >
> >
> > > Can you have a pointer somewhere that explains these states. And not a
> > > "it's in this paper or directory". Either have a short discription here,
> > > or specify where exactly to find the information (perhaps a
> > > Documentation/RCU/preemptible_states.txt?).
> > >
> > > Trying to understand these states has caused me the most agony in
> > > reviewing these patches.
> > >
> > > > + */
> > > > +
> > > > +enum rcu_try_flip_states {
> > > > + rcu_try_flip_idle_state, /* "I" */
> > > > + rcu_try_flip_waitack_state, /* "A" */
> > > > + rcu_try_flip_waitzero_state, /* "Z" */
> > > > + rcu_try_flip_waitmb_state /* "M" */
> > > > +};
> >
> > I thought the 4 flip states corresponded to the 4 GP stages, but now
> > you confused me. It seems to indeed progress one stage for every 4 flip
> > states.
>
> I'm still confused ;-)

If you do a synchronize_rcu() it might well have to wait through the
following sequence of states:

Stage 0: (might have to wait through part of this to get out of "next" queue)
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 1:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 2:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 3:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 4:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */

So yes, grace periods do indeed have some latency.

> > Hmm, now I have to puzzle how these 4 stages are required by the lock
> > and unlock magic.
> >
> > > > +/*
> > > > + * Return the number of RCU batches processed thus far. Useful for debug
> > > > + * and statistics. The _bh variant is identical to straight RCU.
> > > > + */
> > >
> > > If they are identical, then why the separation?
> >
> > I guess a smaller RCU domain makes for quicker grace periods.
>
> No, I mean that both the rcu_batches_completed and
> rcu_batches_completed_bh are identical. Perhaps we can just put in a
>
> #define rcu_batches_completed_bh rcu_batches_completed
>
> in rcupreempt.h. In rcuclassic, they are different. But no need to have
> two identical functions in the preempt version. A macro should do.

Ah!!! Good point, #define does make sense here.

> > > > +void __rcu_read_lock(void)
> > > > +{
> > > > + int idx;
> > > > + struct task_struct *me = current;
> > >
> > > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > > variable to assign current to. It's just that "me" thows me off a
> > > little while reviewing this. But this is just a nitpick, so do as you
> > > will.
> >
> > struct task_struct *curr = current;
> >
> > is also not uncommon.
>
> True, but the "me" confused me. Since that task struct is not me ;-)

Well, who is it, then? ;-)

> > > > + int nesting;
> > > > +
> > > > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > > > + if (nesting != 0) {
> > > > +
> > > > + /* An earlier rcu_read_lock() covers us, just count it. */
> > > > +
> > > > + me->rcu_read_lock_nesting = nesting + 1;
> > > > +
> > > > + } else {
> > > > + unsigned long oldirq;
> > >
> > > > +
> > > > + /*
> > > > + * Disable local interrupts to prevent the grace-period
> > > > + * detection state machine from seeing us half-done.
> > > > + * NMIs can still occur, of course, and might themselves
> > > > + * contain rcu_read_lock().
> > > > + */
> > > > +
> > > > + local_irq_save(oldirq);
> > >
> > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > local_bh_disable be sufficient here? You already cover NMIs, which would
> > > also handle normal interrupts.
> >
> > This is also my understanding, but I think this disable is an
> > 'optimization' in that it avoids the regular IRQs from jumping through
> > these hoops outlined below.
>
> But isn't disabling irqs slower than doing a local_bh_disable? So the
> majority of times (where irqs will not happen) we have this overhead.

The current code absolutely must exclude the scheduling-clock hardirq
handler.

> > > > + /*
> > > > + * Outermost nesting of rcu_read_lock(), so increment
> > > > + * the current counter for the current CPU. Use volatile
> > > > + * casts to prevent the compiler from reordering.
> > > > + */
> > > > +
> > > > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > > > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > > > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > > > +
> > > > + /*
> > > > + * Now that the per-CPU counter has been incremented, we
> > > > + * are protected from races with rcu_read_lock() invoked
> > > > + * from NMI handlers on this CPU. We can therefore safely
> > > > + * increment the nesting counter, relieving further NMIs
> > > > + * of the need to increment the per-CPU counter.
> > > > + */
> > > > +
> > > > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > > > +
> > > > + /*
> > > > + * Now that we have preventing any NMIs from storing
> > > > + * to the ->rcu_flipctr_idx, we can safely use it to
> > > > + * remember which counter to decrement in the matching
> > > > + * rcu_read_unlock().
> > > > + */
> > > > +
> > > > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > > > + local_irq_restore(oldirq);
> > > > + }
> > > > +}
> >
> > > > +/*
> > > > + * Attempt a single flip of the counters. Remember, a single flip does
> > > > + * -not- constitute a grace period. Instead, the interval between
> > > > + * at least three consecutive flips is a grace period.
> > > > + *
> > > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> > >
> > > Oh, come now! It's not "nuts" to use this ;-)
> > >
> > > > + * on a large SMP, they might want to use a hierarchical organization of
> > > > + * the per-CPU-counter pairs.
> > > > + */
> >
> > Its the large SMP case that's nuts, and on that I have to agree with
> > Paul, its not really large SMP friendly.
>
> Hmm, that could be true. But on large SMP systems, you usually have a
> large amounts of memory, so hopefully a really long synchronize_rcu
> would not be a problem.

Somewhere in the range from 64 to a few hundred CPUs, the global lock
protecting the try_flip state machine would start sucking air pretty
badly. But the real problem is synchronize_sched(), which loops through
all the CPUs -- this would likely cause problems at a few tens of
CPUs, perhaps as early as 10-20.

Thanx, Paul

2007-09-21 23:04:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 11:20:48AM -0400, Steven Rostedt wrote:
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
> > +
> > +/*
> > + * PREEMPT_RCU data structures.
> > + */
> > +
> > +#define GP_STAGES 4
> > +struct rcu_data {
> > + spinlock_t lock; /* Protect rcu_data fields. */
> > + long completed; /* Number of last completed batch. */
> > + int waitlistcount;
> > + struct tasklet_struct rcu_tasklet;
> > + struct rcu_head *nextlist;
> > + struct rcu_head **nexttail;
> > + struct rcu_head *waitlist[GP_STAGES];
> > + struct rcu_head **waittail[GP_STAGES];
> > + struct rcu_head *donelist;
> > + struct rcu_head **donetail;
> > +#ifdef CONFIG_RCU_TRACE
> > + struct rcupreempt_trace trace;
> > +#endif /* #ifdef CONFIG_RCU_TRACE */
> > +};
> > +struct rcu_ctrlblk {
> > + spinlock_t fliplock; /* Protect state-machine transitions. */
> > + long completed; /* Number of last completed batch. */
> > +};
> > +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> > +static struct rcu_ctrlblk rcu_ctrlblk = {
> > + .fliplock = SPIN_LOCK_UNLOCKED,
> > + .completed = 0,
> > +};
> > +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> > +
> > +/*
> > + * States for rcu_try_flip() and friends.
> > + */
> > +
> > +enum rcu_try_flip_states {
> > + rcu_try_flip_idle_state, /* "I" */
> > + rcu_try_flip_waitack_state, /* "A" */
> > + rcu_try_flip_waitzero_state, /* "Z" */
> > + rcu_try_flip_waitmb_state /* "M" */
> > +};
> > +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
> > +#ifdef CONFIG_RCU_TRACE
> > +static char *rcu_try_flip_state_names[] =
> > + { "idle", "waitack", "waitzero", "waitmb" };
> > +#endif /* #ifdef CONFIG_RCU_TRACE */
>
> [snip]
>
> > +/*
> > + * If a global counter flip has occurred since the last time that we
> > + * advanced callbacks, advance them. Hardware interrupts must be
> > + * disabled when calling this function.
> > + */
> > +static void __rcu_advance_callbacks(struct rcu_data *rdp)
> > +{
> > + int cpu;
> > + int i;
> > + int wlc = 0;
> > +
> > + if (rdp->completed != rcu_ctrlblk.completed) {
> > + if (rdp->waitlist[GP_STAGES - 1] != NULL) {
> > + *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
> > + rdp->donetail = rdp->waittail[GP_STAGES - 1];
> > + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
> > + }
> > + for (i = GP_STAGES - 2; i >= 0; i--) {
> > + if (rdp->waitlist[i] != NULL) {
> > + rdp->waitlist[i + 1] = rdp->waitlist[i];
> > + rdp->waittail[i + 1] = rdp->waittail[i];
> > + wlc++;
> > + } else {
> > + rdp->waitlist[i + 1] = NULL;
> > + rdp->waittail[i + 1] =
> > + &rdp->waitlist[i + 1];
> > + }
> > + }
> > + if (rdp->nextlist != NULL) {
> > + rdp->waitlist[0] = rdp->nextlist;
> > + rdp->waittail[0] = rdp->nexttail;
> > + wlc++;
> > + rdp->nextlist = NULL;
> > + rdp->nexttail = &rdp->nextlist;
> > + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
> > + } else {
> > + rdp->waitlist[0] = NULL;
> > + rdp->waittail[0] = &rdp->waitlist[0];
> > + }
> > + rdp->waitlistcount = wlc;
> > + rdp->completed = rcu_ctrlblk.completed;
> > + }
> > +
> > + /*
> > + * Check to see if this CPU needs to report that it has seen
> > + * the most recent counter flip, thereby declaring that all
> > + * subsequent rcu_read_lock() invocations will respect this flip.
> > + */
> > +
> > + cpu = raw_smp_processor_id();
> > + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> > + smp_mb(); /* Subsequent counter accesses must see new value */
> > + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> > + smp_mb(); /* Subsequent RCU read-side critical sections */
> > + /* seen -after- acknowledgement. */
> > + }
> > +}
>
> [snip]
>
> > +/*
> > + * Attempt a single flip of the counters. Remember, a single flip does
> > + * -not- constitute a grace period. Instead, the interval between
> > + * at least three consecutive flips is a grace period.
> > + *
> > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> > + * on a large SMP, they might want to use a hierarchical organization of
> > + * the per-CPU-counter pairs.
> > + */
> > +static void rcu_try_flip(void)
> > +{
> > + unsigned long oldirq;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> > + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> > + return;
> > + }
> > +
> > + /*
> > + * Take the next transition(s) through the RCU grace-period
> > + * flip-counter state machine.
> > + */
> > +
> > + switch (rcu_try_flip_state) {
> > + case rcu_try_flip_idle_state:
> > + if (rcu_try_flip_idle())
> > + rcu_try_flip_state = rcu_try_flip_waitack_state;
> > + break;
> > + case rcu_try_flip_waitack_state:
> > + if (rcu_try_flip_waitack())
> > + rcu_try_flip_state = rcu_try_flip_waitzero_state;
> > + break;
> > + case rcu_try_flip_waitzero_state:
> > + if (rcu_try_flip_waitzero())
> > + rcu_try_flip_state = rcu_try_flip_waitmb_state;
> > + break;
> > + case rcu_try_flip_waitmb_state:
> > + if (rcu_try_flip_waitmb())
> > + rcu_try_flip_state = rcu_try_flip_idle_state;
> > + }
> > + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> > +}
>
> Paul,
>
> Looking further into this, I still think this is a bit of overkill. We
> go through 20 states from call_rcu to list->func().
>
> On call_rcu we put our stuff on the next list. Before we move stuff from
> next to wait, we need to go through 4 states. So we have
>
> next -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states ->
> wait[2] -> 4 states -> wait[3] -> 4 states -> done.
>
> That's 20 states that we go through from the time we add our function to
> the list to the time it actually gets called. Do we really need the 4
> wait lists?
>
> Seems a bit overkill to me.
>
> What am I missing?

"Nothing kills like overkill!!!" ;-)

Seriously, I do expect to be able to squeeze this down over time, but
feel the need to be a bit on the cowardly side at the moment.

In any case, I will be looking at the scenarios more carefully. If
it turns out that GP_STAGES can indeed be cranked down a bit, well,
that is an easy change! I just fired off a POWER run with GP_STAGES
set to 3, will let you know how it goes.

Thanx, Paul

2007-09-21 23:23:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU



--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

>
> If you do a synchronize_rcu() it might well have to wait through the
> following sequence of states:
>
> Stage 0: (might have to wait through part of this to get out of "next" queue)
> rcu_try_flip_idle_state, /* "I" */
> rcu_try_flip_waitack_state, /* "A" */
> rcu_try_flip_waitzero_state, /* "Z" */
> rcu_try_flip_waitmb_state /* "M" */
> Stage 1:
> rcu_try_flip_idle_state, /* "I" */
> rcu_try_flip_waitack_state, /* "A" */
> rcu_try_flip_waitzero_state, /* "Z" */
> rcu_try_flip_waitmb_state /* "M" */
> Stage 2:
> rcu_try_flip_idle_state, /* "I" */
> rcu_try_flip_waitack_state, /* "A" */
> rcu_try_flip_waitzero_state, /* "Z" */
> rcu_try_flip_waitmb_state /* "M" */
> Stage 3:
> rcu_try_flip_idle_state, /* "I" */
> rcu_try_flip_waitack_state, /* "A" */
> rcu_try_flip_waitzero_state, /* "Z" */
> rcu_try_flip_waitmb_state /* "M" */
> Stage 4:
> rcu_try_flip_idle_state, /* "I" */
> rcu_try_flip_waitack_state, /* "A" */
> rcu_try_flip_waitzero_state, /* "Z" */
> rcu_try_flip_waitmb_state /* "M" */
>
> So yes, grace periods do indeed have some latency.

Yes they do. I'm now at the point that I'm just "trusting" you that you
understand that each of these stages are needed. My IQ level only lets me
understand next -> wait -> done, but not the extra 3 shifts in wait.

;-)

> >
> > True, but the "me" confused me. Since that task struct is not me ;-)
>
> Well, who is it, then? ;-)

It's the app I watch sitting there waiting it's turn for it's callback to
run.

> > > >
> > > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > > local_bh_disable be sufficient here? You already cover NMIs, which would
> > > > also handle normal interrupts.
> > >
> > > This is also my understanding, but I think this disable is an
> > > 'optimization' in that it avoids the regular IRQs from jumping through
> > > these hoops outlined below.
> >
> > But isn't disabling irqs slower than doing a local_bh_disable? So the
> > majority of times (where irqs will not happen) we have this overhead.
>
> The current code absolutely must exclude the scheduling-clock hardirq
> handler.

ACKed,
The reasoning you gave in Peter's reply most certainly makes sense.


> > > > > + *
> > > > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> > > >
> > > > Oh, come now! It's not "nuts" to use this ;-)
> > > >
> > > > > + * on a large SMP, they might want to use a hierarchical organization of
> > > > > + * the per-CPU-counter pairs.
> > > > > + */
> > >
> > > Its the large SMP case that's nuts, and on that I have to agree with
> > > Paul, its not really large SMP friendly.
> >
> > Hmm, that could be true. But on large SMP systems, you usually have a
> > large amounts of memory, so hopefully a really long synchronize_rcu
> > would not be a problem.
>
> Somewhere in the range from 64 to a few hundred CPUs, the global lock
> protecting the try_flip state machine would start sucking air pretty
> badly. But the real problem is synchronize_sched(), which loops through
> all the CPUs -- this would likely cause problems at a few tens of
> CPUs, perhaps as early as 10-20.

hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
But I can see hundreds, let alone thousands of CPUs would make a huge
grinding halt on things like synchronize_sched. God, imaging if all CPUs
did that approximately at the same time. The system would should a huge
jitter.

-- Steve

2007-09-21 23:45:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 07:23:09PM -0400, Steven Rostedt wrote:
> --
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
>
> > If you do a synchronize_rcu() it might well have to wait through the
> > following sequence of states:
> >
> > Stage 0: (might have to wait through part of this to get out of "next" queue)
> > rcu_try_flip_idle_state, /* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state, /* "Z" */
> > rcu_try_flip_waitmb_state /* "M" */
> > Stage 1:
> > rcu_try_flip_idle_state, /* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state, /* "Z" */
> > rcu_try_flip_waitmb_state /* "M" */
> > Stage 2:
> > rcu_try_flip_idle_state, /* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state, /* "Z" */
> > rcu_try_flip_waitmb_state /* "M" */
> > Stage 3:
> > rcu_try_flip_idle_state, /* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state, /* "Z" */
> > rcu_try_flip_waitmb_state /* "M" */
> > Stage 4:
> > rcu_try_flip_idle_state, /* "I" */
> > rcu_try_flip_waitack_state, /* "A" */
> > rcu_try_flip_waitzero_state, /* "Z" */
> > rcu_try_flip_waitmb_state /* "M" */
> >
> > So yes, grace periods do indeed have some latency.
>
> Yes they do. I'm now at the point that I'm just "trusting" you that you
> understand that each of these stages are needed. My IQ level only lets me
> understand next -> wait -> done, but not the extra 3 shifts in wait.
>
> ;-)

In the spirit of full disclosure, I am not -absolutely- certain that
they are needed, only that they are sufficient. Just color me paranoid.

> > > True, but the "me" confused me. Since that task struct is not me ;-)
> >
> > Well, who is it, then? ;-)
>
> It's the app I watch sitting there waiting it's turn for it's callback to
> run.

:-)

> > > > > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > > > > local_bh_disable be sufficient here? You already cover NMIs, which would
> > > > > also handle normal interrupts.
> > > >
> > > > This is also my understanding, but I think this disable is an
> > > > 'optimization' in that it avoids the regular IRQs from jumping through
> > > > these hoops outlined below.
> > >
> > > But isn't disabling irqs slower than doing a local_bh_disable? So the
> > > majority of times (where irqs will not happen) we have this overhead.
> >
> > The current code absolutely must exclude the scheduling-clock hardirq
> > handler.
>
> ACKed,
> The reasoning you gave in Peter's reply most certainly makes sense.
>
> > > > > > + *
> > > > > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> > > > >
> > > > > Oh, come now! It's not "nuts" to use this ;-)
> > > > >
> > > > > > + * on a large SMP, they might want to use a hierarchical organization of
> > > > > > + * the per-CPU-counter pairs.
> > > > > > + */
> > > >
> > > > Its the large SMP case that's nuts, and on that I have to agree with
> > > > Paul, its not really large SMP friendly.
> > >
> > > Hmm, that could be true. But on large SMP systems, you usually have a
> > > large amounts of memory, so hopefully a really long synchronize_rcu
> > > would not be a problem.
> >
> > Somewhere in the range from 64 to a few hundred CPUs, the global lock
> > protecting the try_flip state machine would start sucking air pretty
> > badly. But the real problem is synchronize_sched(), which loops through
> > all the CPUs -- this would likely cause problems at a few tens of
> > CPUs, perhaps as early as 10-20.
>
> hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
> But I can see hundreds, let alone thousands of CPUs would make a huge
> grinding halt on things like synchronize_sched. God, imaging if all CPUs
> did that approximately at the same time. The system would should a huge
> jitter.

Well, the first time the SGI guys tried to boot a 1024-CPU Altix, I got
an email complaining about RCU overheads. ;-) Manfred Spraul fixed
things up for them, though.

Thanx, Paul

2007-09-22 00:26:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

Covering the pieces that weren't in Peter's reply. ;-)

And thank you -very- much for the careful and thorough review!!!

> > #endif /* __KERNEL__ */
> > #endif /* __LINUX_RCUCLASSIC_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -52,7 +52,11 @@ struct rcu_head {
> > void (*func)(struct rcu_head *head);
> > };
> >
> > +#ifdef CONFIG_CLASSIC_RCU
> > #include <linux/rcuclassic.h>
> > +#else /* #ifdef CONFIG_CLASSIC_RCU */
> > +#include <linux/rcupreempt.h>
> > +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
>
> A bit extreme on the comments here.

My fingers do this without any help from the rest of me, but I suppose
it is a bit of overkill in this case.

> > #define RCU_HEAD_INIT { .next = NULL, .func = NULL }
> > #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> > @@ -218,10 +222,13 @@ extern void FASTCALL(call_rcu_bh(struct
> > /* Exported common interfaces */
> > extern void synchronize_rcu(void);
> > extern void rcu_barrier(void);
> > +extern long rcu_batches_completed(void);
> > +extern long rcu_batches_completed_bh(void);
> >
> > /* Internal to kernel */
> > extern void rcu_init(void);
> > extern void rcu_check_callbacks(int cpu, int user);
> > +extern int rcu_needs_cpu(int cpu);
> >
> > #endif /* __KERNEL__ */
> > #endif /* __LINUX_RCUPDATE_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 16:00:00.000000000 -0800
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -0,0 +1,78 @@
> > +/*
> > + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > + *
> > + * Copyright (C) IBM Corporation, 2006
> > + *
> > + * Author: Paul McKenney <[email protected]>
> > + *
> > + * Based on the original work by Paul McKenney <[email protected]>
> > + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> > + * Papers:
> > + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> > + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> > + *
> > + * For detailed explanation of Read-Copy Update mechanism see -
> > + * Documentation/RCU
> > + *
> > + */
> > +
> > +#ifndef __LINUX_RCUPREEMPT_H
> > +#define __LINUX_RCUPREEMPT_H
> > +
> > +#ifdef __KERNEL__
> > +
> > +#include <linux/cache.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/threads.h>
> > +#include <linux/percpu.h>
> > +#include <linux/cpumask.h>
> > +#include <linux/seqlock.h>
> > +
> > +#define rcu_qsctr_inc(cpu)
> > +#define rcu_bh_qsctr_inc(cpu)
> > +#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
> > +
> > +extern void __rcu_read_lock(void);
> > +extern void __rcu_read_unlock(void);
> > +extern int rcu_pending(int cpu);
> > +extern int rcu_needs_cpu(int cpu);
> > +
> > +#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
> > +#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); }
> > +
> > +#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting)
> > +
> > +extern void __synchronize_sched(void);
> > +
> > +extern void __rcu_init(void);
> > +extern void rcu_check_callbacks(int cpu, int user);
> > +extern void rcu_restart_cpu(int cpu);
> > +
> > +#ifdef CONFIG_RCU_TRACE
> > +struct rcupreempt_trace;
> > +extern int *rcupreempt_flipctr(int cpu);
> > +extern long rcupreempt_data_completed(void);
> > +extern int rcupreempt_flip_flag(int cpu);
> > +extern int rcupreempt_mb_flag(int cpu);
> > +extern char *rcupreempt_try_flip_state_name(void);
> > +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
> > +#endif
> > +
> > +struct softirq_action;
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* __LINUX_RCUPREEMPT_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt_trace.h 1969-12-31 16:00:00.000000000 -0800
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt_trace.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -0,0 +1,100 @@
> > +/*
> > + * Read-Copy Update mechanism for mutual exclusion (RT implementation)
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > + *
> > + * Copyright (C) IBM Corporation, 2006
> > + *
> > + * Author: Paul McKenney <[email protected]>
> > + *
> > + * Based on the original work by Paul McKenney <[email protected]>
> > + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> > + * Papers:
> > + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> > + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> > + *
> > + * For detailed explanation of Read-Copy Update mechanism see -
> > + * http://lse.sourceforge.net/locking/rcupdate.html
> > + *
> > + */
> > +
> > +#ifndef __LINUX_RCUPREEMPT_TRACE_H
> > +#define __LINUX_RCUPREEMPT_TRACE_H
> > +
> > +#ifdef __KERNEL__
> > +#include <linux/types.h>
> > +#include <linux/kernel.h>
> > +
> > +#include <asm/atomic.h>
> > +
> > +/*
> > + * PREEMPT_RCU data structures.
> > + */
> > +
> > +struct rcupreempt_trace {
> > + long next_length;
> > + long next_add;
> > + long wait_length;
> > + long wait_add;
> > + long done_length;
> > + long done_add;
> > + long done_remove;
> > + atomic_t done_invoked;
> > + long rcu_check_callbacks;
> > + atomic_t rcu_try_flip_1;
> > + atomic_t rcu_try_flip_e1;
> > + long rcu_try_flip_i1;
> > + long rcu_try_flip_ie1;
> > + long rcu_try_flip_g1;
> > + long rcu_try_flip_a1;
> > + long rcu_try_flip_ae1;
> > + long rcu_try_flip_a2;
> > + long rcu_try_flip_z1;
> > + long rcu_try_flip_ze1;
> > + long rcu_try_flip_z2;
> > + long rcu_try_flip_m1;
> > + long rcu_try_flip_me1;
> > + long rcu_try_flip_m2;
> > +};
> > +
> > +#ifdef CONFIG_RCU_TRACE
> > +#define RCU_TRACE(fn, arg) fn(arg);
> > +#else
> > +#define RCU_TRACE(fn, arg)
> > +#endif
> > +
> > +extern void rcupreempt_trace_move2done(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_invoke(struct rcupreempt_trace *trace);
> > +extern void rcupreempt_trace_next_add(struct rcupreempt_trace *trace);
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* __LINUX_RCUPREEMPT_TRACE_H */
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/sched.h linux-2.6.22-c-preemptrcu/include/linux/sched.h
> > --- linux-2.6.22-b-fixbarriers/include/linux/sched.h 2007-07-08 16:32:17.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/sched.h 2007-08-22 15:21:06.000000000 -0700
> > @@ -850,6 +850,11 @@ struct task_struct {
> > cpumask_t cpus_allowed;
> > unsigned int time_slice, first_time_slice;
> >
> > +#ifdef CONFIG_PREEMPT_RCU
> > + int rcu_read_lock_nesting;
> > + int rcu_flipctr_idx;
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > +
> > #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> > struct sched_info sched_info;
> > #endif
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/fork.c linux-2.6.22-c-preemptrcu/kernel/fork.c
> > --- linux-2.6.22-b-fixbarriers/kernel/fork.c 2007-07-08 16:32:17.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/kernel/fork.c 2007-08-22 15:21:06.000000000 -0700
> > @@ -1032,6 +1032,10 @@ static struct task_struct *copy_process(
> >
> > INIT_LIST_HEAD(&p->children);
> > INIT_LIST_HEAD(&p->sibling);
> > +#ifdef CONFIG_PREEMPT_RCU
> > + p->rcu_read_lock_nesting = 0;
> > + p->rcu_flipctr_idx = 0;
> > +#endif /* #ifdef CONFIG_PREEMPT_RCU */
> > p->vfork_done = NULL;
> > spin_lock_init(&p->alloc_lock);
> >
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/Kconfig.preempt linux-2.6.22-c-preemptrcu/kernel/Kconfig.preempt
> > --- linux-2.6.22-b-fixbarriers/kernel/Kconfig.preempt 2007-07-08 16:32:17.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/kernel/Kconfig.preempt 2007-08-22 15:21:06.000000000 -0700
> > @@ -63,3 +63,41 @@ config PREEMPT_BKL
> > Say Y here if you are building a kernel for a desktop system.
> > Say N if you are unsure.
> >
> > +choice
> > + prompt "RCU implementation type:"
> > + default CLASSIC_RCU
> > +
> > +config CLASSIC_RCU
> > + bool "Classic RCU"
> > + help
> > + This option selects the classic RCU implementation that is
> > + designed for best read-side performance on non-realtime
> > + systems.
> > +
> > + Say Y if you are unsure.
> > +
> > +config PREEMPT_RCU
> > + bool "Preemptible RCU"
> > + depends on PREEMPT
> > + help
> > + This option reduces the latency of the kernel by making certain
> > + RCU sections preemptible. Normally RCU code is non-preemptible, if
> > + this option is selected then read-only RCU sections become
> > + preemptible. This helps latency, but may expose bugs due to
> > + now-naive assumptions about each RCU read-side critical section
> > + remaining on a given CPU through its execution.
> > +
> > + Say N if you are unsure.
> > +
> > +endchoice
> > +
> > +config RCU_TRACE
> > + bool "Enable tracing for RCU - currently stats in debugfs"
> > + select DEBUG_FS
> > + default y
> > + help
> > + This option provides tracing in RCU which presents stats
> > + in debugfs for debugging RCU implementation.
> > +
> > + Say Y here if you want to enable RCU tracing
> > + Say N if you are unsure.
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/Makefile linux-2.6.22-c-preemptrcu/kernel/Makefile
> > --- linux-2.6.22-b-fixbarriers/kernel/Makefile 2007-07-19 12:16:03.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/kernel/Makefile 2007-08-22 15:21:06.000000000 -0700
> > @@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o
> > exit.o itimer.o time.o softirq.o resource.o \
> > sysctl.o capability.o ptrace.o timer.o user.o \
> > signal.o sys.o kmod.o workqueue.o pid.o \
> > - rcupdate.o rcuclassic.o extable.o params.o posix-timers.o \
> > + rcupdate.o extable.o params.o posix-timers.o \
> > kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
> > hrtimer.o rwsem.o latency.o nsproxy.o srcu.o die_notifier.o
> >
> > @@ -46,6 +46,11 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
> > obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
> > obj-$(CONFIG_SECCOMP) += seccomp.o
> > obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
> > +obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
> > +obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
> > +ifeq ($(CONFIG_PREEMPT_RCU),y)
> > +obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
> > +endif
> > obj-$(CONFIG_RELAY) += relay.o
> > obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
> > obj-$(CONFIG_UTS_NS) += utsname.o
> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
> > --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 1969-12-31 16:00:00.000000000 -0800
> > +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 15:35:19.000000000 -0700
> > @@ -0,0 +1,767 @@
> > +/*
> > + * Read-Copy Update mechanism for mutual exclusion, realtime implementation
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > + *
> > + * Copyright IBM Corporation, 2006
> > + *
> > + * Authors: Paul E. McKenney <[email protected]>
> > + * With thanks to Esben Nielsen, Bill Huey, and Ingo Molnar
> > + * for pushing me away from locks and towards counters, and
> > + * to Suparna Bhattacharya for pushing me completely away
> > + * from atomic instructions on the read side.
> > + *
> > + * Papers: http://www.rdrop.com/users/paulmck/RCU
> > + *
> > + * For detailed explanation of Read-Copy Update mechanism see -
> > + * Documentation/RCU/ *.txt
> > + *
> > + */
> > +#include <linux/types.h>
> > +#include <linux/kernel.h>
> > +#include <linux/init.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/smp.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/sched.h>
> > +#include <asm/atomic.h>
> > +#include <linux/bitops.h>
> > +#include <linux/module.h>
> > +#include <linux/completion.h>
> > +#include <linux/moduleparam.h>
> > +#include <linux/percpu.h>
> > +#include <linux/notifier.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/cpu.h>
> > +#include <linux/random.h>
> > +#include <linux/delay.h>
> > +#include <linux/byteorder/swabb.h>
> > +#include <linux/cpumask.h>
> > +#include <linux/rcupreempt_trace.h>
> > +
> > +/*
> > + * PREEMPT_RCU data structures.
> > + */
> > +
> > +#define GP_STAGES 4
>
> I take it that GP stand for "grace period". Might want to state that
> here. /* Grace period stages */ When I was looking at this code at 1am,
> I kept asking myself "what's this GP?" (General Protection??). But
> that's what happens when looking at code like this after midnight ;-)

Good point, will add a comment. You did get it right, "grace period".

> > +struct rcu_data {
> > + spinlock_t lock; /* Protect rcu_data fields. */
> > + long completed; /* Number of last completed batch. */
> > + int waitlistcount;
> > + struct tasklet_struct rcu_tasklet;
> > + struct rcu_head *nextlist;
> > + struct rcu_head **nexttail;
> > + struct rcu_head *waitlist[GP_STAGES];
> > + struct rcu_head **waittail[GP_STAGES];
> > + struct rcu_head *donelist;
> > + struct rcu_head **donetail;
> > +#ifdef CONFIG_RCU_TRACE
> > + struct rcupreempt_trace trace;
> > +#endif /* #ifdef CONFIG_RCU_TRACE */
> > +};
> > +struct rcu_ctrlblk {
> > + spinlock_t fliplock; /* Protect state-machine transitions. */
> > + long completed; /* Number of last completed batch. */
> > +};
> > +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> > +static struct rcu_ctrlblk rcu_ctrlblk = {
> > + .fliplock = SPIN_LOCK_UNLOCKED,
> > + .completed = 0,
> > +};
> > +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
> > +
> > +/*
> > + * States for rcu_try_flip() and friends.
>
> Can you have a pointer somewhere that explains these states. And not a
> "it's in this paper or directory". Either have a short discription here,
> or specify where exactly to find the information (perhaps a
> Documentation/RCU/preemptible_states.txt?).
>
> Trying to understand these states has caused me the most agony in
> reviewing these patches.

Good point, perhaps a comment block above the enum giving a short
description of the purpose of each state. Maybe more detail in
Documentation/RCU as well, as you suggest above.

> > + */
> > +
> > +enum rcu_try_flip_states {
> > + rcu_try_flip_idle_state, /* "I" */
> > + rcu_try_flip_waitack_state, /* "A" */
> > + rcu_try_flip_waitzero_state, /* "Z" */
> > + rcu_try_flip_waitmb_state /* "M" */
> > +};
> > +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state;
> > +#ifdef CONFIG_RCU_TRACE
> > +static char *rcu_try_flip_state_names[] =
> > + { "idle", "waitack", "waitzero", "waitmb" };
> > +#endif /* #ifdef CONFIG_RCU_TRACE */
> > +
> > +/*
> > + * Enum and per-CPU flag to determine when each CPU has seen
> > + * the most recent counter flip.
> > + */
> > +
> > +enum rcu_flip_flag_values {
> > + rcu_flip_seen, /* Steady/initial state, last flip seen. */
> > + /* Only GP detector can update. */
> > + rcu_flipped /* Flip just completed, need confirmation. */
> > + /* Only corresponding CPU can update. */
> > +};
> > +static DEFINE_PER_CPU(enum rcu_flip_flag_values, rcu_flip_flag) = rcu_flip_seen;
> > +
> > +/*
> > + * Enum and per-CPU flag to determine when each CPU has executed the
> > + * needed memory barrier to fence in memory references from its last RCU
> > + * read-side critical section in the just-completed grace period.
> > + */
> > +
> > +enum rcu_mb_flag_values {
> > + rcu_mb_done, /* Steady/initial state, no mb()s required. */
> > + /* Only GP detector can update. */
> > + rcu_mb_needed /* Flip just completed, need an mb(). */
> > + /* Only corresponding CPU can update. */
> > +};
> > +static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done;
> > +
> > +/*
> > + * Macro that prevents the compiler from reordering accesses, but does
> > + * absolutely -nothing- to prevent CPUs from reordering. This is used
> > + * only to mediate communication between mainline code and hardware
> > + * interrupt and NMI handlers.
> > + */
> > +#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
> > +
> > +/*
> > + * RCU_DATA_ME: find the current CPU's rcu_data structure.
> > + * RCU_DATA_CPU: find the specified CPU's rcu_data structure.
> > + */
> > +#define RCU_DATA_ME() (&__get_cpu_var(rcu_data))
> > +#define RCU_DATA_CPU(cpu) (&per_cpu(rcu_data, cpu))
> > +
> > +/*
> > + * Helper macro for tracing when the appropriate rcu_data is not
> > + * cached in a local variable, but where the CPU number is so cached.
> > + */
> > +#define RCU_TRACE_CPU(f, cpu) RCU_TRACE(f, &(RCU_DATA_CPU(cpu)->trace));
> > +
> > +/*
> > + * Helper macro for tracing when the appropriate rcu_data is not
> > + * cached in a local variable.
> > + */
> > +#define RCU_TRACE_ME(f) RCU_TRACE(f, &(RCU_DATA_ME()->trace));
> > +
> > +/*
> > + * Helper macro for tracing when the appropriate rcu_data is pointed
> > + * to by a local variable.
> > + */
> > +#define RCU_TRACE_RDP(f, rdp) RCU_TRACE(f, &((rdp)->trace));
> > +
> > +/*
> > + * Return the number of RCU batches processed thus far. Useful
> > + * for debug and statistics.
> > + */
> > +long rcu_batches_completed(void)
> > +{
> > + return rcu_ctrlblk.completed;
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_batches_completed);
> > +
> > +/*
> > + * Return the number of RCU batches processed thus far. Useful for debug
> > + * and statistics. The _bh variant is identical to straight RCU.
> > + */
>
> If they are identical, then why the separation?

I apologize for the repetition in this email.

I apologize for the repetition in this email.

I apologize for the repetition in this email.

Yep, will fix with either #define or static inline, as you suggested
in a later email.

> > +long rcu_batches_completed_bh(void)
> > +{
> > + return rcu_ctrlblk.completed;
> > +}
> > +EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
> > +
> > +void __rcu_read_lock(void)
> > +{
> > + int idx;
> > + struct task_struct *me = current;
>
> Nitpick, but other places in the kernel usually use "t" or "p" as a
> variable to assign current to. It's just that "me" thows me off a
> little while reviewing this. But this is just a nitpick, so do as you
> will.

Fair enough, as discussed earlier.

> > + int nesting;
> > +
> > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > + if (nesting != 0) {
> > +
> > + /* An earlier rcu_read_lock() covers us, just count it. */
> > +
> > + me->rcu_read_lock_nesting = nesting + 1;
> > +
> > + } else {
> > + unsigned long oldirq;
>
> Nitpick, "flags" is usually used for saving irq state.

A later patch in the series fixes these -- I believe I got all of them.
(The priority-boost patch, IIRC.)

> > +
> > + /*
> > + * Disable local interrupts to prevent the grace-period
> > + * detection state machine from seeing us half-done.
> > + * NMIs can still occur, of course, and might themselves
> > + * contain rcu_read_lock().
> > + */
> > +
> > + local_irq_save(oldirq);
>
> Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> local_bh_disable be sufficient here? You already cover NMIs, which would
> also handle normal interrupts.

We beat this into the ground in other email.

> > +
> > + /*
> > + * Outermost nesting of rcu_read_lock(), so increment
> > + * the current counter for the current CPU. Use volatile
> > + * casts to prevent the compiler from reordering.
> > + */
> > +
> > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
> > +
> > + /*
> > + * Now that the per-CPU counter has been incremented, we
> > + * are protected from races with rcu_read_lock() invoked
> > + * from NMI handlers on this CPU. We can therefore safely
> > + * increment the nesting counter, relieving further NMIs
> > + * of the need to increment the per-CPU counter.
> > + */
> > +
> > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1;
> > +
> > + /*
> > + * Now that we have preventing any NMIs from storing
> > + * to the ->rcu_flipctr_idx, we can safely use it to
> > + * remember which counter to decrement in the matching
> > + * rcu_read_unlock().
> > + */
> > +
> > + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> > + local_irq_restore(oldirq);
> > + }
> > +}
> > +EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > +
> > +void __rcu_read_unlock(void)
> > +{
> > + int idx;
> > + struct task_struct *me = current;
> > + int nesting;
> > +
> > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > + if (nesting > 1) {
> > +
> > + /*
> > + * We are still protected by the enclosing rcu_read_lock(),
> > + * so simply decrement the counter.
> > + */
> > +
> > + me->rcu_read_lock_nesting = nesting - 1;
> > +
> > + } else {
> > + unsigned long oldirq;
> > +
> > + /*
> > + * Disable local interrupts to prevent the grace-period
> > + * detection state machine from seeing us half-done.
> > + * NMIs can still occur, of course, and might themselves
> > + * contain rcu_read_lock() and rcu_read_unlock().
> > + */
> > +
> > + local_irq_save(oldirq);
> > +
> > + /*
> > + * Outermost nesting of rcu_read_unlock(), so we must
> > + * decrement the current counter for the current CPU.
> > + * This must be done carefully, because NMIs can
> > + * occur at any point in this code, and any rcu_read_lock()
> > + * and rcu_read_unlock() pairs in the NMI handlers
> > + * must interact non-destructively with this code.
> > + * Lots of volatile casts, and -very- careful ordering.
> > + *
> > + * Changes to this code, including this one, must be
> > + * inspected, validated, and tested extremely carefully!!!
> > + */
> > +
> > + /*
> > + * First, pick up the index. Enforce ordering for
> > + * DEC Alpha.
> > + */
> > +
> > + idx = ORDERED_WRT_IRQ(me->rcu_flipctr_idx);
> > + smp_read_barrier_depends(); /* @@@ Needed??? */
> > +
> > + /*
> > + * Now that we have fetched the counter index, it is
> > + * safe to decrement the per-task RCU nesting counter.
> > + * After this, any interrupts or NMIs will increment and
> > + * decrement the per-CPU counters.
> > + */
> > + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1;
> > +
> > + /*
> > + * It is now safe to decrement this task's nesting count.
> > + * NMIs that occur after this statement will route their
> > + * rcu_read_lock() calls through this "else" clause, and
> > + * will thus start incrementing the per-CPU coutner on
>
> s/coutner/counter/

wlli fxi!!!

> > + * their own. They will also clobber ->rcu_flipctr_idx,
> > + * but that is OK, since we have already fetched it.
> > + */
> > +
> > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
> > + local_irq_restore(oldirq);
> > + }
> > +}
> > +EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > +
> > +/*
> > + * If a global counter flip has occurred since the last time that we
> > + * advanced callbacks, advance them. Hardware interrupts must be
> > + * disabled when calling this function.
> > + */
> > +static void __rcu_advance_callbacks(struct rcu_data *rdp)
> > +{
> > + int cpu;
> > + int i;
> > + int wlc = 0;
> > +
> > + if (rdp->completed != rcu_ctrlblk.completed) {
> > + if (rdp->waitlist[GP_STAGES - 1] != NULL) {
> > + *rdp->donetail = rdp->waitlist[GP_STAGES - 1];
> > + rdp->donetail = rdp->waittail[GP_STAGES - 1];
> > + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp);
> > + }
> > + for (i = GP_STAGES - 2; i >= 0; i--) {
> > + if (rdp->waitlist[i] != NULL) {
> > + rdp->waitlist[i + 1] = rdp->waitlist[i];
> > + rdp->waittail[i + 1] = rdp->waittail[i];
> > + wlc++;
> > + } else {
> > + rdp->waitlist[i + 1] = NULL;
> > + rdp->waittail[i + 1] =
> > + &rdp->waitlist[i + 1];
> > + }
> > + }
> > + if (rdp->nextlist != NULL) {
> > + rdp->waitlist[0] = rdp->nextlist;
> > + rdp->waittail[0] = rdp->nexttail;
> > + wlc++;
> > + rdp->nextlist = NULL;
> > + rdp->nexttail = &rdp->nextlist;
> > + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp);
> > + } else {
> > + rdp->waitlist[0] = NULL;
> > + rdp->waittail[0] = &rdp->waitlist[0];
> > + }
> > + rdp->waitlistcount = wlc;
> > + rdp->completed = rcu_ctrlblk.completed;
> > + }
> > +
> > + /*
> > + * Check to see if this CPU needs to report that it has seen
> > + * the most recent counter flip, thereby declaring that all
> > + * subsequent rcu_read_lock() invocations will respect this flip.
> > + */
> > +
> > + cpu = raw_smp_processor_id();
> > + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> > + smp_mb(); /* Subsequent counter accesses must see new value */
> > + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> > + smp_mb(); /* Subsequent RCU read-side critical sections */
> > + /* seen -after- acknowledgement. */
> > + }
> > +}
> > +
> > +/*
> > + * Get here when RCU is idle. Decide whether we need to
> > + * move out of idle state, and return non-zero if so.
> > + * "Straightforward" approach for the moment, might later
> > + * use callback-list lengths, grace-period duration, or
> > + * some such to determine when to exit idle state.
> > + * Might also need a pre-idle test that does not acquire
> > + * the lock, but let's get the simple case working first...
> > + */
> > +
> > +static int
> > +rcu_try_flip_idle(void)
> > +{
> > + int cpu;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> > + if (!rcu_pending(smp_processor_id())) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> > + return 0;
> > + }
> > +
> > + /*
> > + * Do the flip.
> > + */
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
> > + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
> > +
> > + /*
> > + * Need a memory barrier so that other CPUs see the new
> > + * counter value before they see the subsequent change of all
> > + * the rcu_flip_flag instances to rcu_flipped.
> > + */
> > +
> > + smp_mb(); /* see above block comment. */
> > +
> > + /* Now ask each CPU for acknowledgement of the flip. */
> > +
> > + for_each_possible_cpu(cpu)
> > + per_cpu(rcu_flip_flag, cpu) = rcu_flipped;
> > +
> > + return 1;
> > +}
> > +
> > +/*
> > + * Wait for CPUs to acknowledge the flip.
> > + */
> > +
> > +static int
> > +rcu_try_flip_waitack(void)
> > +{
> > + int cpu;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_a1);
> > + for_each_possible_cpu(cpu)
> > + if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1);
> > + return 0;
> > + }
> > +
> > + /*
> > + * Make sure our checks above don't bleed into subsequent
> > + * waiting for the sum of the counters to reach zero.
> > + */
> > +
> > + smp_mb(); /* see above block comment. */
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_a2);
> > + return 1;
> > +}
> > +
> > +/*
> > + * Wait for collective ``last'' counter to reach zero,
> > + * then tell all CPUs to do an end-of-grace-period memory barrier.
> > + */
> > +
> > +static int
> > +rcu_try_flip_waitzero(void)
> > +{
> > + int cpu;
> > + int lastidx = !(rcu_ctrlblk.completed & 0x1);
> > + int sum = 0;
> > +
> > + /* Check to see if the sum of the "last" counters is zero. */
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_z1);
> > + for_each_possible_cpu(cpu)
> > + sum += per_cpu(rcu_flipctr, cpu)[lastidx];
> > + if (sum != 0) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_ze1);
> > + return 0;
> > + }
> > +
> > + smp_mb(); /* Don't call for memory barriers before we see zero. */
> > +
> > + /* Call for a memory barrier from each CPU. */
> > +
> > + for_each_possible_cpu(cpu)
> > + per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_z2);
> > + return 1;
> > +}
> > +
> > +/*
> > + * Wait for all CPUs to do their end-of-grace-period memory barrier.
> > + * Return 0 once all CPUs have done so.
> > + */
> > +
> > +static int
> > +rcu_try_flip_waitmb(void)
> > +{
> > + int cpu;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_m1);
> > + for_each_possible_cpu(cpu)
> > + if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_me1);
> > + return 0;
> > + }
> > +
> > + smp_mb(); /* Ensure that the above checks precede any following flip. */
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_m2);
> > + return 1;
> > +}
> > +
> > +/*
> > + * Attempt a single flip of the counters. Remember, a single flip does
> > + * -not- constitute a grace period. Instead, the interval between
> > + * at least three consecutive flips is a grace period.
> > + *
> > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
>
> Oh, come now! It's not "nuts" to use this ;-)

;-)

> > + * on a large SMP, they might want to use a hierarchical organization of
> > + * the per-CPU-counter pairs.
> > + */
> > +static void rcu_try_flip(void)
> > +{
> > + unsigned long oldirq;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> > + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> > + return;
> > + }
> > +
> > + /*
> > + * Take the next transition(s) through the RCU grace-period
> > + * flip-counter state machine.
> > + */
> > +
> > + switch (rcu_try_flip_state) {
> > + case rcu_try_flip_idle_state:
> > + if (rcu_try_flip_idle())
> > + rcu_try_flip_state = rcu_try_flip_waitack_state;
>
> Just trying to understand all this. Here at flip_idle, only a CPU with
> no pending RCU calls will flip it. Then all the cpus flags will be set
> to rcu_flipped, and the ctrl.completed counter is incremented.

s/no pending RCU calls/at least one pending RCU call/, but otherwise
spot on.

So if the RCU grace-period machinery is idle, the first CPU to take
a scheduling-clock interrupt after having posted an RCU callback will
get things going.

> When a CPU calls process_callbacks, it would then move all it's
> callbacks to the next list (next -> wait[GP...] -> done), and set it's
> unique completed counter to completed. So no more moving up the lists
> will be done. It also will set it's state flag to rcu_flip_seen. Even if
> the cpu doesn't have any RCU callbacks, the state would be in
> rcu_try_flip_waitack_state so we go to the next switch case for a
> rcu_try_flip call.

Yep. The wait-for-acknowledgement cycle is there to handle rcu_read_lock()
invocations on other CPUs that executed concurrently with the counter
increment in rcu_try_flip_idle().

> Also the filp counter has been flipped so that all new rcu_read_locks
> will increment the cpus "other" index. We just need to wait for the
> index of the previous counters to reach zero.

Yep. Note that rcu_read_lock() might well be able to use its local
.completed counter rather than the global one in rcu_ctrlblk, but
(1) it is not clear that the reduction in cache misses would outweigh
the per-CPU access overhead given that rcu_read_lock() happens a -lot-
more often than does a counter flip and (2) doing this could well require
cranking up GP_STAGES.

> > + break;
> > + case rcu_try_flip_waitack_state:
> > + if (rcu_try_flip_waitack())
>
> Now, we just wait to see if all the cpus have called process_callbacks
> and now have the flag rcu_flip_seen set. So all the cpus have pushed
> their callbacks up to the next queue.

Yep! More importantly, we know that no CPU will ever increment the
"old" set of counters.

> > + rcu_try_flip_state = rcu_try_flip_waitzero_state;
> > + break;
> > + case rcu_try_flip_waitzero_state:
> > + if (rcu_try_flip_waitzero())
>
> Now this is where we wait for the sum of all the CPUs counters to reach
> zero. The reason for the sum is that the task may have been preempted
> and migrated to another CPU and decremented that counter. So the one CPU
> counter would be 1 and the other would be -1. But the sum is still zero.

Yep!

> Is there a chance that overflow of a counter (although probably very
> very unlikely) would cause any problems?

The only way it could cause a problem would be if there was ever
more than 4,294,967,296 outstanding rcu_read_lock() calls. I believe
that lockdep screams if it sees more than 30 nested locks within a
single task, so for systems that support no more than 100M tasks, we
should be OK. It might sometime be necessary to make this be a long
rather than an int. Should we just do that now and be done with it?

> Also, all the CPUs have their "check_mb" set.
>
> > + rcu_try_flip_state = rcu_try_flip_waitmb_state;
> > + break;
> > + case rcu_try_flip_waitmb_state:
> > + if (rcu_try_flip_waitmb())
>
> I have to admit that this seems a bit of an overkill, but I guess you
> know what you are doing. After going through three states, we still
> need to do a memory barrier on each CPU?

Yep. Because there are no memory barriers in rcu_read_unlock(), the
CPU is free to reorder the contents of the RCU read-side critical section
to follow the counter decrement. This means that this CPU would still
be referencing RCU-protected data after it had told the world that it
was no longer doing so. Forcing a memory barrier on each CPU guarantees
that if we see the memory-barrier acknowledge, we also see any prior
RCU read-side critical section.

> > + rcu_try_flip_state = rcu_try_flip_idle_state;
>
> Finally we are back to the original state and we start the process all
> over.

Yep!

> Is this analysis correct?

Other than noted above, yes.

> > + }
> > + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> > +}
> > +
> > +/*
> > + * Check to see if this CPU needs to do a memory barrier in order to
> > + * ensure that any prior RCU read-side critical sections have committed
> > + * their counter manipulations and critical-section memory references
> > + * before declaring the grace period to be completed.
> > + */
> > +static void rcu_check_mb(int cpu)
> > +{
> > + if (per_cpu(rcu_mb_flag, cpu) == rcu_mb_needed) {
> > + smp_mb(); /* Ensure RCU read-side accesses are visible. */
> > + per_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
> > + }
> > +}
> > +
> > +void rcu_check_callbacks(int cpu, int user)
> > +{
> > + unsigned long oldirq;
> > + struct rcu_data *rdp = RCU_DATA_CPU(cpu);
> > +
> > + rcu_check_mb(cpu);
> > + if (rcu_ctrlblk.completed == rdp->completed)
> > + rcu_try_flip();
> > + spin_lock_irqsave(&rdp->lock, oldirq);
> > + RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> > + __rcu_advance_callbacks(rdp);
> > + if (rdp->donelist == NULL) {
> > + spin_unlock_irqrestore(&rdp->lock, oldirq);
> > + } else {
> > + spin_unlock_irqrestore(&rdp->lock, oldirq);
> > + raise_softirq(RCU_SOFTIRQ);
> > + }
> > +}
> > +
> > +/*
> > + * Needed by dynticks, to make sure all RCU processing has finished
> > + * when we go idle:
>
> Didn't we have a discussion that this is no longer true? Or was it taken
> out of Dynamic ticks before. I don't see any users of it.

Might be safe to ditch it, then. I was afraid that there was a patch
out there somewhere still relying on it.

> > + */
> > +void rcu_advance_callbacks(int cpu, int user)
> > +{
> > + unsigned long oldirq;
> > + struct rcu_data *rdp = RCU_DATA_CPU(cpu);
> > +
> > + if (rcu_ctrlblk.completed == rdp->completed) {
> > + rcu_try_flip();
> > + if (rcu_ctrlblk.completed == rdp->completed)
> > + return;
> > + }
> > + spin_lock_irqsave(&rdp->lock, oldirq);
> > + RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> > + __rcu_advance_callbacks(rdp);
> > + spin_unlock_irqrestore(&rdp->lock, oldirq);
> > +}
>
> OK, that's all I have on this patch (will take a bit of a break before
> reviewing your other patches). But I will say that RCU has grown quite
> a bit, and is looking very good.

Glad you like it, and thank you again for the careful and thorough review.

> Basically, what I'm saying is "Great work, Paul!". This is looking
> good. Seems that we just need a little bit better explanation for those
> that are not up at the IQ level of you. I can write something up after
> this all gets finalized. Sort of a rcu-design.txt, that really tries to
> explain it to the simpleton's like me ;-)

I do greatly appreciate the compliments, especially coming from someone
like yourself, but it is also true that I have been implementing and
using RCU in various forms for longer than some Linux-community members
(not many, but a few) have been alive, and programming since 1972 or so.
Lots and lots of practice!

Hmmm... I started programming about the same time that I started
jogging consistently. Never realized that before.

I am thinking in terms of getting an improved discussion of RCU design and
use out there -- after all, the fifth anniversary of RCU's addition to
the kernel is coming right up. This does deserve better documentation,
especially given that for several depressing weeks near the beginning
of 2005 I believed that a realtime-friendly RCU might not be possible.

Thanx, Paul

2007-09-22 00:32:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 04:03:43PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 21, 2007 at 11:20:48AM -0400, Steven Rostedt wrote:
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > Paul,
> >
> > Looking further into this, I still think this is a bit of overkill. We
> > go through 20 states from call_rcu to list->func().
> >
> > On call_rcu we put our stuff on the next list. Before we move stuff from
> > next to wait, we need to go through 4 states. So we have
> >
> > next -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states ->
> > wait[2] -> 4 states -> wait[3] -> 4 states -> done.
> >
> > That's 20 states that we go through from the time we add our function to
> > the list to the time it actually gets called. Do we really need the 4
> > wait lists?
> >
> > Seems a bit overkill to me.
> >
> > What am I missing?
>
> "Nothing kills like overkill!!!" ;-)
>
> Seriously, I do expect to be able to squeeze this down over time, but
> feel the need to be a bit on the cowardly side at the moment.
>
> In any case, I will be looking at the scenarios more carefully. If
> it turns out that GP_STAGES can indeed be cranked down a bit, well,
> that is an easy change! I just fired off a POWER run with GP_STAGES
> set to 3, will let you know how it goes.

The first attempt blew up during boot badly enough that ABAT was unable
to recover the machine (sorry, grahal!!!). Just for grins, I am trying
it again on a machine that ABAT has had a better record of reviving...

Thanx, Paul

2007-09-22 01:15:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU


On Fri, 21 Sep 2007, Paul E. McKenney wrote:

> On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
> Covering the pieces that weren't in Peter's reply. ;-)
>
> And thank you -very- much for the careful and thorough review!!!
>
> > > #endif /* __KERNEL__ */
> > > #endif /* __LINUX_RCUCLASSIC_H */
> > > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h
> > > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> > > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
> > > @@ -52,7 +52,11 @@ struct rcu_head {
> > > void (*func)(struct rcu_head *head);
> > > };
> > >
> > > +#ifdef CONFIG_CLASSIC_RCU
> > > #include <linux/rcuclassic.h>
> > > +#else /* #ifdef CONFIG_CLASSIC_RCU */
> > > +#include <linux/rcupreempt.h>
> > > +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */
> >
> > A bit extreme on the comments here.
>
> My fingers do this without any help from the rest of me, but I suppose
> it is a bit of overkill in this case.

Heck, why stop the overkill here, the whole patch is overkill ;-)

> > > +
> > > +#define GP_STAGES 4
> >
> > I take it that GP stand for "grace period". Might want to state that
> > here. /* Grace period stages */ When I was looking at this code at 1am,
> > I kept asking myself "what's this GP?" (General Protection??). But
> > that's what happens when looking at code like this after midnight ;-)
>
> Good point, will add a comment. You did get it right, "grace period".

Thanks, so many places in the kernel have acronyms that are just suppose
to be "obvious". I hate them, because they make me feel so stupid when I
don't know what they are. After I find out, I usually slap my forehead and
say "duh!". My mind is set on reading code, not deciphering TLAs.


> >
> > Can you have a pointer somewhere that explains these states. And not a
> > "it's in this paper or directory". Either have a short discription here,
> > or specify where exactly to find the information (perhaps a
> > Documentation/RCU/preemptible_states.txt?).
> >
> > Trying to understand these states has caused me the most agony in
> > reviewing these patches.
>
> Good point, perhaps a comment block above the enum giving a short
> description of the purpose of each state. Maybe more detail in
> Documentation/RCU as well, as you suggest above.

That would be great.

> > > +
> > > +/*
> > > + * Return the number of RCU batches processed thus far. Useful for debug
> > > + * and statistics. The _bh variant is identical to straight RCU.
> > > + */
> >
> > If they are identical, then why the separation?
>
> I apologize for the repetition in this email.
>
> I apologize for the repetition in this email.
>
> I apologize for the repetition in this email.
>
> Yep, will fix with either #define or static inline, as you suggested
> in a later email.

you're starting to sound like me ;-)

> > > + struct task_struct *me = current;
> >
> > Nitpick, but other places in the kernel usually use "t" or "p" as a
> > variable to assign current to. It's just that "me" thows me off a
> > little while reviewing this. But this is just a nitpick, so do as you
> > will.
>
> Fair enough, as discussed earlier.

Who's on first, What's on second, and I-dont-know is on third.

> > > + unsigned long oldirq;
> >
> > Nitpick, "flags" is usually used for saving irq state.
>
> A later patch in the series fixes these -- I believe I got all of them.
> (The priority-boost patch, IIRC.)

OK

>
> > > +
> > > + /*
> > > + * Disable local interrupts to prevent the grace-period
> > > + * detection state machine from seeing us half-done.
> > > + * NMIs can still occur, of course, and might themselves
> > > + * contain rcu_read_lock().
> > > + */
> > > +
> > > + local_irq_save(oldirq);
> >
> > Isn't the GP detection done via a tasklet/softirq. So wouldn't a
> > local_bh_disable be sufficient here? You already cover NMIs, which would
> > also handle normal interrupts.
>
> We beat this into the ground in other email.

Nothing like kicking a dead horse on LKML ;-)

> > > +
> > > + /*
> > > + * It is now safe to decrement this task's nesting count.
> > > + * NMIs that occur after this statement will route their
> > > + * rcu_read_lock() calls through this "else" clause, and
> > > + * will thus start incrementing the per-CPU coutner on
> >
> > s/coutner/counter/
>
> wlli fxi!!!

snousd oogd

> > > +
> > > +/*
> > > + * Attempt a single flip of the counters. Remember, a single flip does
> > > + * -not- constitute a grace period. Instead, the interval between
> > > + * at least three consecutive flips is a grace period.
> > > + *
> > > + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
> >
> > Oh, come now! It's not "nuts" to use this ;-)
>
> ;-)

Although working with RCU may drive one nuts.

> > > + /*
> > > + * Take the next transition(s) through the RCU grace-period
> > > + * flip-counter state machine.
> > > + */
> > > +
> > > + switch (rcu_try_flip_state) {
> > > + case rcu_try_flip_idle_state:
> > > + if (rcu_try_flip_idle())
> > > + rcu_try_flip_state = rcu_try_flip_waitack_state;
> >
> > Just trying to understand all this. Here at flip_idle, only a CPU with
> > no pending RCU calls will flip it. Then all the cpus flags will be set
> > to rcu_flipped, and the ctrl.completed counter is incremented.
>
> s/no pending RCU calls/at least one pending RCU call/, but otherwise
> spot on.
>
> So if the RCU grace-period machinery is idle, the first CPU to take
> a scheduling-clock interrupt after having posted an RCU callback will
> get things going.

I said 'no' becaues of this:

+rcu_try_flip_idle(void)
+{
+ int cpu;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
+ if (!rcu_pending(smp_processor_id())) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
+ return 0;
+ }

But now I'm a bit more confused. :-/

Looking at the caller in kernel/timer.c I see

if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_tick);

And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
that we call this when we have a pending rcu, but if we have a pending
rcu, we won't flip the counter ??



>
> > When a CPU calls process_callbacks, it would then move all it's
> > callbacks to the next list (next -> wait[GP...] -> done), and set it's
> > unique completed counter to completed. So no more moving up the lists
> > will be done. It also will set it's state flag to rcu_flip_seen. Even if
> > the cpu doesn't have any RCU callbacks, the state would be in
> > rcu_try_flip_waitack_state so we go to the next switch case for a
> > rcu_try_flip call.
>
> Yep. The wait-for-acknowledgement cycle is there to handle rcu_read_lock()
> invocations on other CPUs that executed concurrently with the counter
> increment in rcu_try_flip_idle().
>
> > Also the filp counter has been flipped so that all new rcu_read_locks
> > will increment the cpus "other" index. We just need to wait for the
> > index of the previous counters to reach zero.
>
> Yep. Note that rcu_read_lock() might well be able to use its local
> .completed counter rather than the global one in rcu_ctrlblk, but
> (1) it is not clear that the reduction in cache misses would outweigh
> the per-CPU access overhead given that rcu_read_lock() happens a -lot-
> more often than does a counter flip and (2) doing this could well require
> cranking up GP_STAGES.
>
> > > + break;
> > > + case rcu_try_flip_waitack_state:
> > > + if (rcu_try_flip_waitack())
> >
> > Now, we just wait to see if all the cpus have called process_callbacks
> > and now have the flag rcu_flip_seen set. So all the cpus have pushed
> > their callbacks up to the next queue.
>
> Yep! More importantly, we know that no CPU will ever increment the
> "old" set of counters.
>
> > > + rcu_try_flip_state = rcu_try_flip_waitzero_state;
> > > + break;
> > > + case rcu_try_flip_waitzero_state:
> > > + if (rcu_try_flip_waitzero())
> >
> > Now this is where we wait for the sum of all the CPUs counters to reach
> > zero. The reason for the sum is that the task may have been preempted
> > and migrated to another CPU and decremented that counter. So the one CPU
> > counter would be 1 and the other would be -1. But the sum is still zero.
>
> Yep!
>
> > Is there a chance that overflow of a counter (although probably very
> > very unlikely) would cause any problems?
>
> The only way it could cause a problem would be if there was ever
> more than 4,294,967,296 outstanding rcu_read_lock() calls. I believe
> that lockdep screams if it sees more than 30 nested locks within a
> single task, so for systems that support no more than 100M tasks, we
> should be OK. It might sometime be necessary to make this be a long
> rather than an int. Should we just do that now and be done with it?

Sure, why not. More and more and more overkill!!!

(rostedt hears in his head the Monty Python "Spam" song).

>
> > Also, all the CPUs have their "check_mb" set.
> >
> > > + rcu_try_flip_state = rcu_try_flip_waitmb_state;
> > > + break;
> > > + case rcu_try_flip_waitmb_state:
> > > + if (rcu_try_flip_waitmb())
> >
> > I have to admit that this seems a bit of an overkill, but I guess you
> > know what you are doing. After going through three states, we still
> > need to do a memory barrier on each CPU?
>
> Yep. Because there are no memory barriers in rcu_read_unlock(), the
> CPU is free to reorder the contents of the RCU read-side critical section
> to follow the counter decrement. This means that this CPU would still
> be referencing RCU-protected data after it had told the world that it
> was no longer doing so. Forcing a memory barrier on each CPU guarantees
> that if we see the memory-barrier acknowledge, we also see any prior
> RCU read-side critical section.

And this seem reasonable to me that this would be enough to satisfy a
grace period. But the CPU moving around the rcu_read_(un)lock's around.

Are we sure that adding all these grace periods stages is better than just
biting the bullet and put in a memory barrier?

>
> > > + rcu_try_flip_state = rcu_try_flip_idle_state;
> >
> > Finally we are back to the original state and we start the process all
> > over.
>
> Yep!
>
> > Is this analysis correct?
>
> Other than noted above, yes.
>
> > > +
> > > +/*
> > > + * Needed by dynticks, to make sure all RCU processing has finished
> > > + * when we go idle:
> >
> > Didn't we have a discussion that this is no longer true? Or was it taken
> > out of Dynamic ticks before. I don't see any users of it.

Need to kick tglx or jstultz on this.

>
> Might be safe to ditch it, then. I was afraid that there was a patch
> out there somewhere still relying on it.
>


> >
> > OK, that's all I have on this patch (will take a bit of a break before
> > reviewing your other patches). But I will say that RCU has grown quite
> > a bit, and is looking very good.
>
> Glad you like it, and thank you again for the careful and thorough review.

I'm scared to do the preempt portion %^O

>
> > Basically, what I'm saying is "Great work, Paul!". This is looking
> > good. Seems that we just need a little bit better explanation for those
> > that are not up at the IQ level of you. I can write something up after
> > this all gets finalized. Sort of a rcu-design.txt, that really tries to
> > explain it to the simpleton's like me ;-)
>
> I do greatly appreciate the compliments, especially coming from someone
> like yourself, but it is also true that I have been implementing and
> using RCU in various forms for longer than some Linux-community members
> (not many, but a few) have been alive, and programming since 1972 or so.
> Lots and lots of practice!

`72, I was 4.

>
> Hmmm... I started programming about the same time that I started
> jogging consistently. Never realized that before.

Well, I hope you keep doing both for a long time to come.

>
> I am thinking in terms of getting an improved discussion of RCU design and
> use out there -- after all, the fifth anniversary of RCU's addition to
> the kernel is coming right up. This does deserve better documentation,
> especially given that for several depressing weeks near the beginning
> of 2005 I believed that a realtime-friendly RCU might not be possible.

That is definitely an accomplishment. And I know as well as you do that it
happened because of a lot of people sharing ideas. Some good, some bad,
but all helpful for heathy development!

-- Steve

2007-09-22 01:19:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU


--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> >
> > In any case, I will be looking at the scenarios more carefully. If
> > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > that is an easy change! I just fired off a POWER run with GP_STAGES
> > set to 3, will let you know how it goes.
>
> The first attempt blew up during boot badly enough that ABAT was unable
> to recover the machine (sorry, grahal!!!). Just for grins, I am trying
> it again on a machine that ABAT has had a better record of reviving...

This still frightens the hell out of me. Going through 15 states and
failing. Seems the CPU is holding off writes for a long long time. That
means we flipped the counter 4 times, and that still wasn't good enough?

Maybe I'll boot up my powerbook to see if it has the same issues.

Well, I'm still finishing up on moving into my new house, so I wont be
available this weekend.

Thanks,

-- Steve

2007-09-22 01:43:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
>
> --
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > >
> > > In any case, I will be looking at the scenarios more carefully. If
> > > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > > that is an easy change! I just fired off a POWER run with GP_STAGES
> > > set to 3, will let you know how it goes.
> >
> > The first attempt blew up during boot badly enough that ABAT was unable
> > to recover the machine (sorry, grahal!!!). Just for grins, I am trying
> > it again on a machine that ABAT has had a better record of reviving...
>
> This still frightens the hell out of me. Going through 15 states and
> failing. Seems the CPU is holding off writes for a long long time. That
> means we flipped the counter 4 times, and that still wasn't good enough?

Might be that the other machine has its 2.6.22 version of .config messed
up. I will try booting it on a stock 2.6.22 kernel when it comes back
to life -- not sure I ever did that before. Besides, the other similar
machine seems to have gone down for the count, but without me torturing
it...

Also, keep in mind that various stages can "record" a memory misordering,
for example, by incrementing the wrong counter.

> Maybe I'll boot up my powerbook to see if it has the same issues.
>
> Well, I'm still finishing up on moving into my new house, so I wont be
> available this weekend.

The other machine not only booted, but has survived several minutes of
rcutorture thus far. I am also trying POWER5 machine as well, as the
one currently running is a POWER4, which is a bit less aggressive about
memory reordering than is the POWER5.

Even if they pass, I refuse to reduce GP_STAGES until proven safe.
Trust me, you -don't- want to be unwittingly making use of a subtely
busted RCU implementation!!!

Thanx, Paul

2007-09-22 01:53:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > > + /*
> > > > + * Take the next transition(s) through the RCU grace-period
> > > > + * flip-counter state machine.
> > > > + */
> > > > +
> > > > + switch (rcu_try_flip_state) {
> > > > + case rcu_try_flip_idle_state:
> > > > + if (rcu_try_flip_idle())
> > > > + rcu_try_flip_state = rcu_try_flip_waitack_state;
> > >
> > > Just trying to understand all this. Here at flip_idle, only a CPU with
> > > no pending RCU calls will flip it. Then all the cpus flags will be set
> > > to rcu_flipped, and the ctrl.completed counter is incremented.
> >
> > s/no pending RCU calls/at least one pending RCU call/, but otherwise
> > spot on.
> >
> > So if the RCU grace-period machinery is idle, the first CPU to take
> > a scheduling-clock interrupt after having posted an RCU callback will
> > get things going.
>
> I said 'no' becaues of this:
>
> +rcu_try_flip_idle(void)
> +{
> + int cpu;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> + if (!rcu_pending(smp_processor_id())) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> + return 0;
> + }
>
> But now I'm a bit more confused. :-/
>
> Looking at the caller in kernel/timer.c I see
>
> if (rcu_pending(cpu))
> rcu_check_callbacks(cpu, user_tick);
>
> And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
> that we call this when we have a pending rcu, but if we have a pending
> rcu, we won't flip the counter ??

We don't enter unless there is something for RCU to do (might be a
pending callback, for example, but might also be needing to acknowledge
a counter flip). If, by the time we get to rcu_try_flip_idle(), there
is no longer anything to do (!rcu_pending()), we bail.

So a given CPU kicks the state machine out of idle only if it -still-
has something to do once it gets to rcu_try_flip_idle(), right?

[ . . . ]

> > > Is there a chance that overflow of a counter (although probably very
> > > very unlikely) would cause any problems?
> >
> > The only way it could cause a problem would be if there was ever
> > more than 4,294,967,296 outstanding rcu_read_lock() calls. I believe
> > that lockdep screams if it sees more than 30 nested locks within a
> > single task, so for systems that support no more than 100M tasks, we
> > should be OK. It might sometime be necessary to make this be a long
> > rather than an int. Should we just do that now and be done with it?
>
> Sure, why not. More and more and more overkill!!!
>
> (rostedt hears in his head the Monty Python "Spam" song).

;-) OK!

> > > Also, all the CPUs have their "check_mb" set.
> > >
> > > > + rcu_try_flip_state = rcu_try_flip_waitmb_state;
> > > > + break;
> > > > + case rcu_try_flip_waitmb_state:
> > > > + if (rcu_try_flip_waitmb())
> > >
> > > I have to admit that this seems a bit of an overkill, but I guess you
> > > know what you are doing. After going through three states, we still
> > > need to do a memory barrier on each CPU?
> >
> > Yep. Because there are no memory barriers in rcu_read_unlock(), the
> > CPU is free to reorder the contents of the RCU read-side critical section
> > to follow the counter decrement. This means that this CPU would still
> > be referencing RCU-protected data after it had told the world that it
> > was no longer doing so. Forcing a memory barrier on each CPU guarantees
> > that if we see the memory-barrier acknowledge, we also see any prior
> > RCU read-side critical section.
>
> And this seem reasonable to me that this would be enough to satisfy a
> grace period. But the CPU moving around the rcu_read_(un)lock's around.
>
> Are we sure that adding all these grace periods stages is better than just
> biting the bullet and put in a memory barrier?

Good question. I believe so, because the extra stages don't require
much additional processing, and because the ratio of rcu_read_lock()
calls to the number of grace periods is extremely high. But, if I
can prove it is safe, I will certainly decrease GP_STAGES or otherwise
optimize the state machine.

[ . . . ]

> > > OK, that's all I have on this patch (will take a bit of a break before
> > > reviewing your other patches). But I will say that RCU has grown quite
> > > a bit, and is looking very good.
> >
> > Glad you like it, and thank you again for the careful and thorough review.
>
> I'm scared to do the preempt portion %^O

Ummm... This -was- the preempt portion. ;-)

> > > Basically, what I'm saying is "Great work, Paul!". This is looking
> > > good. Seems that we just need a little bit better explanation for those
> > > that are not up at the IQ level of you. I can write something up after
> > > this all gets finalized. Sort of a rcu-design.txt, that really tries to
> > > explain it to the simpleton's like me ;-)
> >
> > I do greatly appreciate the compliments, especially coming from someone
> > like yourself, but it is also true that I have been implementing and
> > using RCU in various forms for longer than some Linux-community members
> > (not many, but a few) have been alive, and programming since 1972 or so.
> > Lots and lots of practice!
>
> `72, I was 4.

What, and you weren't programming yet??? ;-)

> > Hmmm... I started programming about the same time that I started
> > jogging consistently. Never realized that before.
>
> Well, I hope you keep doing both for a long time to come.

Me too! ;-)

> > I am thinking in terms of getting an improved discussion of RCU design and
> > use out there -- after all, the fifth anniversary of RCU's addition to
> > the kernel is coming right up. This does deserve better documentation,
> > especially given that for several depressing weeks near the beginning
> > of 2005 I believed that a realtime-friendly RCU might not be possible.
>
> That is definitely an accomplishment. And I know as well as you do that it
> happened because of a lot of people sharing ideas. Some good, some bad,
> but all helpful for heathy development!

Indeed! The current version is quite a bit different than my early-2005
posting (which relied on locks!), and a -lot- of people had a hand in
making it what it is today.

Thanx, Paul

2007-09-22 02:57:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU


[ sneaks away from the family for a bit to answer emails ]

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

> On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
> >
> > --
> > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > >
> > > > In any case, I will be looking at the scenarios more carefully. If
> > > > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > > > that is an easy change! I just fired off a POWER run with GP_STAGES
> > > > set to 3, will let you know how it goes.
> > >
> > > The first attempt blew up during boot badly enough that ABAT was unable
> > > to recover the machine (sorry, grahal!!!). Just for grins, I am trying
> > > it again on a machine that ABAT has had a better record of reviving...
> >
> > This still frightens the hell out of me. Going through 15 states and
> > failing. Seems the CPU is holding off writes for a long long time. That
> > means we flipped the counter 4 times, and that still wasn't good enough?
>
> Might be that the other machine has its 2.6.22 version of .config messed
> up. I will try booting it on a stock 2.6.22 kernel when it comes back
> to life -- not sure I ever did that before. Besides, the other similar
> machine seems to have gone down for the count, but without me torturing
> it...
>
> Also, keep in mind that various stages can "record" a memory misordering,
> for example, by incrementing the wrong counter.
>
> > Maybe I'll boot up my powerbook to see if it has the same issues.
> >
> > Well, I'm still finishing up on moving into my new house, so I wont be
> > available this weekend.
>
> The other machine not only booted, but has survived several minutes of
> rcutorture thus far. I am also trying POWER5 machine as well, as the
> one currently running is a POWER4, which is a bit less aggressive about
> memory reordering than is the POWER5.
>
> Even if they pass, I refuse to reduce GP_STAGES until proven safe.
> Trust me, you -don't- want to be unwittingly making use of a subtely
> busted RCU implementation!!!

I totally agree. This is the same reason I want to understand -why- it
fails with 3 stages. To make sure that adding a 4th stage really does fix
it, and doesn't just make the chances for the bug smaller.

I just have that funny feeling that we are band-aiding this for POWER with
extra stages and not really solving the bug.

I could be totally out in left field on this. But the more people have a
good understanding of what is happening (this includes why things fail)
the more people in general can trust this code. Right now I'm thinking
you may be the only one that understands this code enough to trust it. I'm
just wanting you to help people like me to trust the code by understanding
and not just having faith in others.

If you ever decide to give up jogging, we need to make sure that there are
people here that can still fill those running shoes (-:


-- Steve

2007-09-22 03:17:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU


[took off Ingo, because he has my ISP blacklisted, and I'm tired of
getting those return mail messages. He can read LKML or you can re-CC
him. Sad since this is a topic he should be reading. ]

--
On Fri, 21 Sep 2007, Paul E. McKenney wrote:

> On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
> > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:
>
> [ . . . ]
>
> > > > > + /*
> > > > > + * Take the next transition(s) through the RCU grace-period
> > > > > + * flip-counter state machine.
> > > > > + */
> > > > > +
> > > > > + switch (rcu_try_flip_state) {
> > > > > + case rcu_try_flip_idle_state:
> > > > > + if (rcu_try_flip_idle())
> > > > > + rcu_try_flip_state = rcu_try_flip_waitack_state;
> > > >
> > > > Just trying to understand all this. Here at flip_idle, only a CPU with
> > > > no pending RCU calls will flip it. Then all the cpus flags will be set
> > > > to rcu_flipped, and the ctrl.completed counter is incremented.
> > >
> > > s/no pending RCU calls/at least one pending RCU call/, but otherwise
> > > spot on.
> > >
> > > So if the RCU grace-period machinery is idle, the first CPU to take
> > > a scheduling-clock interrupt after having posted an RCU callback will
> > > get things going.
> >
> > I said 'no' becaues of this:
> >
> > +rcu_try_flip_idle(void)
> > +{
> > + int cpu;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> > + if (!rcu_pending(smp_processor_id())) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> > + return 0;
> > + }
> >
> > But now I'm a bit more confused. :-/
> >
> > Looking at the caller in kernel/timer.c I see
> >
> > if (rcu_pending(cpu))
> > rcu_check_callbacks(cpu, user_tick);
> >
> > And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
> > that we call this when we have a pending rcu, but if we have a pending
> > rcu, we won't flip the counter ??
>
> We don't enter unless there is something for RCU to do (might be a
> pending callback, for example, but might also be needing to acknowledge
> a counter flip). If, by the time we get to rcu_try_flip_idle(), there
> is no longer anything to do (!rcu_pending()), we bail.
>
> So a given CPU kicks the state machine out of idle only if it -still-
> has something to do once it gets to rcu_try_flip_idle(), right?
>

Now I can slap my forehead! Duh, I wasn't seeing that ! in front of the
rcu_pending condition in the rcu_try_flip_idle. We only flip if we do
indeed have something pending. I need some sleep. I also need to
re-evaluate some of my analysis of that code. But it doesn't change my
opinion of the stages.

> >
> > Are we sure that adding all these grace periods stages is better than just
> > biting the bullet and put in a memory barrier?
>
> Good question. I believe so, because the extra stages don't require
> much additional processing, and because the ratio of rcu_read_lock()
> calls to the number of grace periods is extremely high. But, if I
> can prove it is safe, I will certainly decrease GP_STAGES or otherwise
> optimize the state machine.

But until others besides yourself understand that state machine (doesn't
really need to be me) I would be worried about applying it without
barriers. The barriers may add a bit of overhead, but it adds some
confidence in the code. I'm arguing that we have barriers in there until
there's a fine understanding of why we fail with 3 stages and not 4.
Perhaps you don't have a box with enough cpus to fail at 4.

I don't know how the higher ups in the kernel command line feel, but I
think that memory barriers on critical sections are justified. But if you
can show a proof that adding extra stages is sufficient to deal with
CPUS moving memory writes around, then so be it. But I'm still not
convinced that these extra stages are really solving the bug instead of
just making it much less likely to happen.

Ingo praised this code since it had several years of testing in the RT
tree. But that version has barriers, so this new verison without the
barriers has not had that "run it through the grinder" feeling to it.

>
> [ . . . ]
>
> > > > OK, that's all I have on this patch (will take a bit of a break before
> > > > reviewing your other patches). But I will say that RCU has grown quite
> > > > a bit, and is looking very good.
> > >
> > > Glad you like it, and thank you again for the careful and thorough review.
> >
> > I'm scared to do the preempt portion %^O
>
> Ummm... This -was- the preempt portion. ;-)

hehe, I do need sleep I meant the boosting portion.

>
> > > > Basically, what I'm saying is "Great work, Paul!". This is looking
> > > > good. Seems that we just need a little bit better explanation for those
> > > > that are not up at the IQ level of you. I can write something up after
> > > > this all gets finalized. Sort of a rcu-design.txt, that really tries to
> > > > explain it to the simpleton's like me ;-)
> > >
> > > I do greatly appreciate the compliments, especially coming from someone
> > > like yourself, but it is also true that I have been implementing and
> > > using RCU in various forms for longer than some Linux-community members
> > > (not many, but a few) have been alive, and programming since 1972 or so.
> > > Lots and lots of practice!
> >
> > `72, I was 4.
>
> What, and you weren't programming yet??? ;-)

I think I was working a Turing Machine with my A-B-C blocks.

>
> > > Hmmm... I started programming about the same time that I started
> > > jogging consistently. Never realized that before.
> >
> > Well, I hope you keep doing both for a long time to come.
>
> Me too! ;-)
>
> > > I am thinking in terms of getting an improved discussion of RCU design and
> > > use out there -- after all, the fifth anniversary of RCU's addition to
> > > the kernel is coming right up. This does deserve better documentation,
> > > especially given that for several depressing weeks near the beginning
> > > of 2005 I believed that a realtime-friendly RCU might not be possible.
> >
> > That is definitely an accomplishment. And I know as well as you do that it
> > happened because of a lot of people sharing ideas. Some good, some bad,
> > but all helpful for heathy development!
>
> Indeed! The current version is quite a bit different than my early-2005
> posting (which relied on locks!), and a -lot- of people had a hand in
> making it what it is today.

True.

-- Steve

2007-09-22 04:07:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 11:15:42PM -0400, Steven Rostedt wrote:
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > On Fri, Sep 21, 2007 at 09:15:03PM -0400, Steven Rostedt wrote:
> > > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > > On Fri, Sep 21, 2007 at 10:40:03AM -0400, Steven Rostedt wrote:
> > > > > On Mon, Sep 10, 2007 at 11:34:12AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > Are we sure that adding all these grace periods stages is better than just
> > > biting the bullet and put in a memory barrier?
> >
> > Good question. I believe so, because the extra stages don't require
> > much additional processing, and because the ratio of rcu_read_lock()
> > calls to the number of grace periods is extremely high. But, if I
> > can prove it is safe, I will certainly decrease GP_STAGES or otherwise
> > optimize the state machine.
>
> But until others besides yourself understand that state machine (doesn't
> really need to be me) I would be worried about applying it without
> barriers. The barriers may add a bit of overhead, but it adds some
> confidence in the code. I'm arguing that we have barriers in there until
> there's a fine understanding of why we fail with 3 stages and not 4.
> Perhaps you don't have a box with enough cpus to fail at 4.
>
> I don't know how the higher ups in the kernel command line feel, but I
> think that memory barriers on critical sections are justified. But if you
> can show a proof that adding extra stages is sufficient to deal with
> CPUS moving memory writes around, then so be it. But I'm still not
> convinced that these extra stages are really solving the bug instead of
> just making it much less likely to happen.
>
> Ingo praised this code since it had several years of testing in the RT
> tree. But that version has barriers, so this new verison without the
> barriers has not had that "run it through the grinder" feeling to it.

Fair point... Though the -rt variant has its shortcomings as well,
such as being unusable from NMI/SMI handlers.

How about this: I continue running the GP_STAGES=3 run on the pair of
POWER machines (which are both going strong, and I also get a document
together describing the new version (and of course apply the changes we
have discussed, and merge with recent CPU-hotplug changes -- Gautham
Shenoy is currently working this), work out a good answer to "how
big exactly does GP_STAGES need to be", test whatever that number is,
assuming it is neither 3 nor 4, and figure out why the gekko-lp1 machine
choked on GP_STAGES=3.

Then we can work out the best path forward from wherever that ends up
being.

[ . . . ]

Thanx, Paul

2007-09-22 04:10:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 21, 2007 at 10:56:56PM -0400, Steven Rostedt wrote:
>
> [ sneaks away from the family for a bit to answer emails ]

[ same here, now that you mention it... ]

> --
> On Fri, 21 Sep 2007, Paul E. McKenney wrote:
>
> > On Fri, Sep 21, 2007 at 09:19:22PM -0400, Steven Rostedt wrote:
> > >
> > > --
> > > On Fri, 21 Sep 2007, Paul E. McKenney wrote:
> > > > >
> > > > > In any case, I will be looking at the scenarios more carefully. If
> > > > > it turns out that GP_STAGES can indeed be cranked down a bit, well,
> > > > > that is an easy change! I just fired off a POWER run with GP_STAGES
> > > > > set to 3, will let you know how it goes.
> > > >
> > > > The first attempt blew up during boot badly enough that ABAT was unable
> > > > to recover the machine (sorry, grahal!!!). Just for grins, I am trying
> > > > it again on a machine that ABAT has had a better record of reviving...
> > >
> > > This still frightens the hell out of me. Going through 15 states and
> > > failing. Seems the CPU is holding off writes for a long long time. That
> > > means we flipped the counter 4 times, and that still wasn't good enough?
> >
> > Might be that the other machine has its 2.6.22 version of .config messed
> > up. I will try booting it on a stock 2.6.22 kernel when it comes back
> > to life -- not sure I ever did that before. Besides, the other similar
> > machine seems to have gone down for the count, but without me torturing
> > it...
> >
> > Also, keep in mind that various stages can "record" a memory misordering,
> > for example, by incrementing the wrong counter.
> >
> > > Maybe I'll boot up my powerbook to see if it has the same issues.
> > >
> > > Well, I'm still finishing up on moving into my new house, so I wont be
> > > available this weekend.
> >
> > The other machine not only booted, but has survived several minutes of
> > rcutorture thus far. I am also trying POWER5 machine as well, as the
> > one currently running is a POWER4, which is a bit less aggressive about
> > memory reordering than is the POWER5.
> >
> > Even if they pass, I refuse to reduce GP_STAGES until proven safe.
> > Trust me, you -don't- want to be unwittingly making use of a subtely
> > busted RCU implementation!!!
>
> I totally agree. This is the same reason I want to understand -why- it
> fails with 3 stages. To make sure that adding a 4th stage really does fix
> it, and doesn't just make the chances for the bug smaller.

Or if it really does break, as opposed to my having happened upon a sick
or misconfigured machine.

> I just have that funny feeling that we are band-aiding this for POWER with
> extra stages and not really solving the bug.
>
> I could be totally out in left field on this. But the more people have a
> good understanding of what is happening (this includes why things fail)
> the more people in general can trust this code. Right now I'm thinking
> you may be the only one that understands this code enough to trust it. I'm
> just wanting you to help people like me to trust the code by understanding
> and not just having faith in others.

Agreed. Trusting me is grossly insufficient. For one thing, the Linux
kernel has a reasonable chance of outliving me.

> If you ever decide to give up jogging, we need to make sure that there are
> people here that can still fill those running shoes (-:

Well, I certainly don't jog as fast or as far as I used to! ;-)

Thanx, Paul

2007-09-23 17:34:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On 09/10, Paul E. McKenney wrote:
>
> Work in progress, not for inclusion.

Impressive work! a couple of random newbie's questions...

> --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
>
> extern void rcu_check_callbacks(int cpu, int user);

Also superfluously declared in rcuclassic.h and in rcupreempt.h

> --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 16:00:00.000000000 -0800
> +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h 2007-08-22 15:21:06.000000000 -0700
> +
> +#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting)

unused?

> diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
> --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 1969-12-31 16:00:00.000000000 -0800
> +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 15:35:19.000000000 -0700
>
> +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> +static struct rcu_ctrlblk rcu_ctrlblk = {
> + .fliplock = SPIN_LOCK_UNLOCKED,
> + .completed = 0,
> +};
> static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };

Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
rcu_ctrlblk.fliplock

Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?

> +void __rcu_read_lock(void)
> +{
> + int idx;
> + struct task_struct *me = current;
> + int nesting;
> +
> + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> + if (nesting != 0) {
> +
> + /* An earlier rcu_read_lock() covers us, just count it. */
> +
> + me->rcu_read_lock_nesting = nesting + 1;
> +
> + } else {
> + unsigned long oldirq;
> +
> + /*
> + * Disable local interrupts to prevent the grace-period
> + * detection state machine from seeing us half-done.
> + * NMIs can still occur, of course, and might themselves
> + * contain rcu_read_lock().
> + */
> +
> + local_irq_save(oldirq);

Could you please tell more, why do we need this cli?

It can't "protect" rcu_ctrlblk.completed, and the only change which affects
the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
above. (of course, we must disable preemption while changing rcu_flipctr).

Hmm... this was already discussed... from another message:

> Critical portions of the GP protection happen in the scheduler-clock
> interrupt, which is a hardirq. For example, the .completed counter
> is always incremented in hardirq context, and we cannot tolerate a
> .completed increment in this code. Allowing such an increment would
> defeat the counter-acknowledge state in the state machine.

Still can't understand, please help! .completed could be incremented by
another CPU, why rcu_check_callbacks() running on _this_ CPU is so special?

> +
> + /*
> + * Outermost nesting of rcu_read_lock(), so increment
> + * the current counter for the current CPU. Use volatile
> + * casts to prevent the compiler from reordering.
> + */
> +
> + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;

Isn't it "obvious" that this barrier is not needed? rcu_flipctr could be
change only by this CPU, regardless of the actual value of idx, we can't
read the "wrong" value of rcu_flipctr[idx], no?

OTOH, I can't understand how it works. With ot without local_irq_save(),
another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?

OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
my understanding is correct. Apart from this why GP_STAGES > 1 ???

Well, I think this code is just too tricky for me! Will try to read again
after sleep ;)

> +static int
> +rcu_try_flip_idle(void)
> +{
> + int cpu;
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> + if (!rcu_pending(smp_processor_id())) {
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> + return 0;
> + }
> +
> + /*
> + * Do the flip.
> + */
> +
> + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
> + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
> +
> + /*
> + * Need a memory barrier so that other CPUs see the new
> + * counter value before they see the subsequent change of all
> + * the rcu_flip_flag instances to rcu_flipped.
> + */

Why? Any code sequence which relies on that?

For example, rcu_check_callbacks does

if (rcu_ctrlblk.completed == rdp->completed)
rcu_try_flip();

it is possible that the timer interrupt calls rcu_check_callbacks()
exactly because rcu_pending() sees rcu_flipped, but without rmb() in
between we can see the old value of rcu_ctrlblk.completed.

This is not a problem because rcu_try_flip() needs rcu_ctrlblk.fliplock,
so rcu_try_flip() will see the new state != rcu_try_flip_idle_state, but
I can't understand any mb() in rcu_try_flip_xxx().

> +static void rcu_process_callbacks(struct softirq_action *unused)
> +{
> + unsigned long flags;
> + struct rcu_head *next, *list;
> + struct rcu_data *rdp = RCU_DATA_ME();
> +
> + spin_lock_irqsave(&rdp->lock, flags);
> + list = rdp->donelist;
> + if (list == NULL) {
> + spin_unlock_irqrestore(&rdp->lock, flags);
> + return;
> + }

Do we really need this fastpath? It is not needed for the correctness,
and this case is very unlikely (in fact I think it is not possible:
rcu_check_callbacks() (triggers RCU_SOFTIRQ) is called with irqs disabled).

> +void fastcall call_rcu(struct rcu_head *head,
> + void (*func)(struct rcu_head *rcu))
> +{
> + unsigned long oldirq;
> + struct rcu_data *rdp;
> +
> + head->func = func;
> + head->next = NULL;
> + local_irq_save(oldirq);
> + rdp = RCU_DATA_ME();
> + spin_lock(&rdp->lock);

This looks a bit strange. Is this optimization? Why not

spin_lock_irqsave(&rdp->lock, flags);
rdp = RCU_DATA_ME();
...

? RCU_DATA_ME() is cheap, but spin_lock() under local_irq_save() spins
without preemption.

Actually, why do we need rcu_data->lock ? Looks like local_irq_save()
should be enough, no? perhaps some -rt reasons ?

> + __rcu_advance_callbacks(rdp);

Any reason this func can't do rcu_check_mb() as well?

If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?

> +void __synchronize_sched(void)
> +{
> + cpumask_t oldmask;
> + int cpu;
> +
> + if (sched_getaffinity(0, &oldmask) < 0)
> + oldmask = cpu_possible_map;
> + for_each_online_cpu(cpu) {
> + sched_setaffinity(0, cpumask_of_cpu(cpu));
> + schedule();

This "schedule()" is not needed, any time sched_setaffinity() returns on another
CPU we already forced preemption of the previously active task on that CPU.

> + }
> + sched_setaffinity(0, oldmask);
> +}

Well, this is not correct... but doesn't matter because of the next patch.

But could you explain how it can deadlock (according to the changelog of
the next patch) ?

Thanks!

Oleg.

2007-09-24 00:15:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
> On 09/10, Paul E. McKenney wrote:
> >
> > Work in progress, not for inclusion.
>
> Impressive work! a couple of random newbie's questions...

Thank you for the kind words, and most especially for the careful review!!!

And Oleg, I don't think you qualify as a newbie anymore. ;-)

> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupdate.h 2007-07-19 14:02:36.000000000 -0700
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupdate.h 2007-08-22 15:21:06.000000000 -0700
> >
> > extern void rcu_check_callbacks(int cpu, int user);
>
> Also superfluously declared in rcuclassic.h and in rcupreempt.h

Definitely are two of them in rcupdate.h, good eyes! The ones in
rcuclassic.h and rcupreempt.h, by the time all the patches are
applied, are for rcu_check_callbacks_rt(). However, this, along
with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c,
will go away when the merge with upcoming hotplug patches is complete.

> > --- linux-2.6.22-b-fixbarriers/include/linux/rcupreempt.h 1969-12-31 16:00:00.000000000 -0800
> > +++ linux-2.6.22-c-preemptrcu/include/linux/rcupreempt.h 2007-08-22 15:21:06.000000000 -0700
> > +
> > +#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting)
>
> unused?

It would seem so! Will remove it.

> > diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c
> > --- linux-2.6.22-b-fixbarriers/kernel/rcupreempt.c 1969-12-31 16:00:00.000000000 -0800
> > +++ linux-2.6.22-c-preemptrcu/kernel/rcupreempt.c 2007-08-22 15:35:19.000000000 -0700
> >
> > +static DEFINE_PER_CPU(struct rcu_data, rcu_data);
> > +static struct rcu_ctrlblk rcu_ctrlblk = {
> > + .fliplock = SPIN_LOCK_UNLOCKED,
> > + .completed = 0,
> > +};
> > static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
>
> Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
> rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
> rcu_ctrlblk.fliplock

In the case of rcu_flipctr, this is historical -- the implementation in
-rt has only one rcu_data, rather than one per CPU. I cannot think of
a problem with placing it in rcu_data right off-hand, will look it over
carefully and see.

Good point on rcu_try_flip_state!!!

> Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?

Looks like it to me, thank you for the tip!

Hmmm... Why not the same for rcu_data? I guess because there is
very little sharing? The only example thus far of sharing would be
if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
critical section were preempted and then started on some other CPU),
though I will need to check more carefully.

Of course, if we start having CPUs access each others' rcu_data structures
to make RCU more power-friendly in dynticks situations, then maybe there
would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data.

Thoughts?

> > +void __rcu_read_lock(void)
> > +{
> > + int idx;
> > + struct task_struct *me = current;
> > + int nesting;
> > +
> > + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> > + if (nesting != 0) {
> > +
> > + /* An earlier rcu_read_lock() covers us, just count it. */
> > +
> > + me->rcu_read_lock_nesting = nesting + 1;
> > +
> > + } else {
> > + unsigned long oldirq;
> > +
> > + /*
> > + * Disable local interrupts to prevent the grace-period
> > + * detection state machine from seeing us half-done.
> > + * NMIs can still occur, of course, and might themselves
> > + * contain rcu_read_lock().
> > + */
> > +
> > + local_irq_save(oldirq);
>
> Could you please tell more, why do we need this cli?
>
> It can't "protect" rcu_ctrlblk.completed, and the only change which affects
> the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
> above. (of course, we must disable preemption while changing rcu_flipctr).
>
> Hmm... this was already discussed... from another message:
>
> > Critical portions of the GP protection happen in the scheduler-clock
> > interrupt, which is a hardirq. For example, the .completed counter
> > is always incremented in hardirq context, and we cannot tolerate a
> > .completed increment in this code. Allowing such an increment would
> > defeat the counter-acknowledge state in the state machine.
>
> Still can't understand, please help! .completed could be incremented by
> another CPU, why rcu_check_callbacks() running on _this_ CPU is so special?

Yeah, I guess my explanation -was- a bit lacking... When I re-read it, it
didn't even help -me- much! ;-)

I am putting together better documentation, but in the meantime, let me
try again...

The problem is not with .completed being incremented, but only
by it (1) being incremented (presumably by some other CPU and
then (2) having this CPU get a scheduling-clock interrupt, which
then causes this CPU to prematurely update the rcu_flip_flag.
This is a problem, because updating rcu_flip_flag is making a
promise that you won't ever again increment the "old" counter set
(at least not until the next counter flip). If the scheduling
clock interrupt were to happen between the time that we pick
up the .completed field and the time that we increment our
counter, we will have broken that promise, and that could cause
someone to prematurely declare the grace period to be finished
(as in before our RCU read-side critical section completes).
The problem occurs if we end up incrementing our counter just
after the grace-period state machine has found the sum of the
now-old counters to be zero. (This probably requires a diagram,
and one is in the works.)

But that is not the only reason for the cli...

The second reason is that cli prohibits preemption. If we were
preempted during this code sequence, we might end up running on
some other CPU, but still having a reference to the original
CPU's counter. If the original CPU happened to be doing another
rcu_read_lock() or rcu_read_unlock() just as we started running,
then someone's increment or decrement might get lost. We could
of course prevent with with preempt_disable() instead of cli.

There might well be another reason or two, but that is what
I can come up with at the moment. ;-)

> > +
> > + /*
> > + * Outermost nesting of rcu_read_lock(), so increment
> > + * the current counter for the current CPU. Use volatile
> > + * casts to prevent the compiler from reordering.
> > + */
> > +
> > + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> > + smp_read_barrier_depends(); /* @@@@ might be unneeded */
> > + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++;
>
> Isn't it "obvious" that this barrier is not needed? rcu_flipctr could be
> change only by this CPU, regardless of the actual value of idx, we can't
> read the "wrong" value of rcu_flipctr[idx], no?

I suspect that you are correct.

> OTOH, I can't understand how it works. With ot without local_irq_save(),
> another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
> at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?
>
> OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
> 0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
> my understanding is correct. Apart from this why GP_STAGES > 1 ???

This is indeed one reason. I am adding you to my list of people to send
early versions of the document to.

> Well, I think this code is just too tricky for me! Will try to read again
> after sleep ;)

Yeah, it needs better documentation...

> > +static int
> > +rcu_try_flip_idle(void)
> > +{
> > + int cpu;
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
> > + if (!rcu_pending(smp_processor_id())) {
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
> > + return 0;
> > + }
> > +
> > + /*
> > + * Do the flip.
> > + */
> > +
> > + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1);
> > + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
> > +
> > + /*
> > + * Need a memory barrier so that other CPUs see the new
> > + * counter value before they see the subsequent change of all
> > + * the rcu_flip_flag instances to rcu_flipped.
> > + */
>
> Why? Any code sequence which relies on that?

Yep. If a CPU saw the flip flag set to rcu_flipped before it saw
the new value of the counter, it might acknowledge the flip, but
then later use the old value of the counter. It might then end up
incrementing the old counter after some other CPU had concluded
that the sum is zero, which could result in premature detection
of a grace-period.

> For example, rcu_check_callbacks does
>
> if (rcu_ctrlblk.completed == rdp->completed)
> rcu_try_flip();
>
> it is possible that the timer interrupt calls rcu_check_callbacks()
> exactly because rcu_pending() sees rcu_flipped, but without rmb() in
> between we can see the old value of rcu_ctrlblk.completed.
>
> This is not a problem because rcu_try_flip() needs rcu_ctrlblk.fliplock,
> so rcu_try_flip() will see the new state != rcu_try_flip_idle_state, but
> I can't understand any mb() in rcu_try_flip_xxx().

OK, your point is that I am not taking the locks into account when
determining what memory barriers to use. You are quite correct, I was
being conservative given that this is not a fast path of any sort.
Nonetheless, I will review this -- no point in any unneeded memory
barriers. Though a comment will be required in any case, as someone
will no doubt want to remove the locking at some point...

> > +static void rcu_process_callbacks(struct softirq_action *unused)
> > +{
> > + unsigned long flags;
> > + struct rcu_head *next, *list;
> > + struct rcu_data *rdp = RCU_DATA_ME();
> > +
> > + spin_lock_irqsave(&rdp->lock, flags);
> > + list = rdp->donelist;
> > + if (list == NULL) {
> > + spin_unlock_irqrestore(&rdp->lock, flags);
> > + return;
> > + }
>
> Do we really need this fastpath? It is not needed for the correctness,
> and this case is very unlikely (in fact I think it is not possible:
> rcu_check_callbacks() (triggers RCU_SOFTIRQ) is called with irqs disabled).

Could happen in -rt, where softirq is in process context. And can't
softirq be pushed to process context in mainline? Used to be, checking...
Yep, there still is a ksoftirqd().

The stats would come out slightly different if this optimization were
removed, but shouldn't be a problem -- easy to put the RCU_TRACE_RDP()
under an "if".

Will think through whether or not this is worth the extra code, good
eyes!

> > +void fastcall call_rcu(struct rcu_head *head,
> > + void (*func)(struct rcu_head *rcu))
> > +{
> > + unsigned long oldirq;
> > + struct rcu_data *rdp;
> > +
> > + head->func = func;
> > + head->next = NULL;
> > + local_irq_save(oldirq);
> > + rdp = RCU_DATA_ME();
> > + spin_lock(&rdp->lock);
>
> This looks a bit strange. Is this optimization? Why not
>
> spin_lock_irqsave(&rdp->lock, flags);
> rdp = RCU_DATA_ME();
> ...

Can't do the above, because I need the pointer before I can lock it. ;-)

> ? RCU_DATA_ME() is cheap, but spin_lock() under local_irq_save() spins
> without preemption.
>
> Actually, why do we need rcu_data->lock ? Looks like local_irq_save()
> should be enough, no? perhaps some -rt reasons ?

We need to retain the ability to manipulate other CPU's rcu_data structures
in order to make dynticks work better and also to handle impending OOM.
So I kept all the rcu_data manipulations under its lock. However, as you
say, there would be no cross-CPU accesses except in cases of low-power
state and OOM, so lock contention should not be an issue.

Seem reasonable?

> > + __rcu_advance_callbacks(rdp);
>
> Any reason this func can't do rcu_check_mb() as well?

Can't think of any, and it might speed up the grace period in some cases.
Though the call in rcu_process_callbacks() needs to stay there as well,
otherwise a CPU that never did call_rcu() might never do the needed
rcu_check_mb().

> If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?

I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
transition has to be synchronized to the moving of the callbacks --
either that or we need more GP_STAGES.

> > +void __synchronize_sched(void)
> > +{
> > + cpumask_t oldmask;
> > + int cpu;
> > +
> > + if (sched_getaffinity(0, &oldmask) < 0)
> > + oldmask = cpu_possible_map;
> > + for_each_online_cpu(cpu) {
> > + sched_setaffinity(0, cpumask_of_cpu(cpu));
> > + schedule();
>
> This "schedule()" is not needed, any time sched_setaffinity() returns on another
> CPU we already forced preemption of the previously active task on that CPU.

OK, harmless but added overhead, then.

> > + }
> > + sched_setaffinity(0, oldmask);
> > +}
>
> Well, this is not correct... but doesn't matter because of the next patch.

Well, the next patch is made unnecessary because of upcoming changes
to hotplug. So, what do I have messed up?

> But could you explain how it can deadlock (according to the changelog of
> the next patch) ?

Might not anymore. It used to be that the CPU hotplug path did a
synchronize_sched() holding the hotplug lock, and that sched_setaffinity()
in turn attempted to acquire the hotplug lock. This came to light when
I proposed a similar set of patches for -mm early this year.

And with Gautham Shah's new hotplug patches, this problem goes away.
But these are not in 2.6.22, so I had to pull in the classic RCU
implementation to get a working synchronize_sched().

Thanx, Paul

2007-09-26 15:10:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On 09/23, Paul E. McKenney wrote:
>
> On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
> > Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?
>
> Looks like it to me, thank you for the tip!
>
> Hmmm... Why not the same for rcu_data? I guess because there is
> very little sharing? The only example thus far of sharing would be
> if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
> critical section were preempted and then started on some other CPU),

even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",
we remember the index, but not the CPU, we always modify the "local" data.

> > OTOH, I can't understand how it works. With ot without local_irq_save(),
> > another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
> > at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?
> >
> > OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
> > 0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
> > my understanding is correct. Apart from this why GP_STAGES > 1 ???
>
> This is indeed one reason.

Thanks a lot! Actually, this explains most of my questions. I was greatly
confused even before I started to read the code. Looking at rcu_flipctr[2]
I wrongly assumed that these 2 counters "belong" to different GPs. When
I started to understand the reality, I was confused by GP_STAGES == 4 ;)

> > > + * Disable local interrupts to prevent the grace-period
> > > + * detection state machine from seeing us half-done.
> > > + * NMIs can still occur, of course, and might themselves
> > > + * contain rcu_read_lock().
> > > + */
> > > +
> > > + local_irq_save(oldirq);
> >
> > Could you please tell more, why do we need this cli?
> >
> > It can't "protect" rcu_ctrlblk.completed, and the only change which affects
> > the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
> > above. (of course, we must disable preemption while changing rcu_flipctr).
> >
>
> The problem is not with .completed being incremented, but only
> by it (1) being incremented (presumably by some other CPU and
> then (2) having this CPU get a scheduling-clock interrupt, which
> then causes this CPU to prematurely update the rcu_flip_flag.
> This is a problem, because updating rcu_flip_flag is making a
> promise that you won't ever again increment the "old" counter set
> (at least not until the next counter flip). If the scheduling
> clock interrupt were to happen between the time that we pick
> up the .completed field and the time that we increment our
> counter, we will have broken that promise, and that could cause
> someone to prematurely declare the grace period to be finished

Thanks!

> The second reason is that cli prohibits preemption.

yes, this is clear

> > > + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
> > > +
> > > + /*
> > > + * Need a memory barrier so that other CPUs see the new
> > > + * counter value before they see the subsequent change of all
> > > + * the rcu_flip_flag instances to rcu_flipped.
> > > + */
> >
> > Why? Any code sequence which relies on that?
>
> Yep. If a CPU saw the flip flag set to rcu_flipped before it saw
> the new value of the counter, it might acknowledge the flip, but
> then later use the old value of the counter.

Yes, yes, I see now. We really need this barriers, except I think
rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,

// rcu_try_flip_waitzero()
if (A == 0) {
mb();
B == 0;
}

Do we really need the mb() in this case? How it is possible that STORE
goes before LOAD? "Obviously", the LOAD should be completed first, no?

> > This looks a bit strange. Is this optimization? Why not
> >
> > spin_lock_irqsave(&rdp->lock, flags);
> > rdp = RCU_DATA_ME();
>
> Can't do the above, because I need the pointer before I can lock it. ;-)

Oooh... well... I need to think more about your explanation :)

> > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
>
> I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
> transition has to be synchronized to the moving of the callbacks --
> either that or we need more GP_STAGES.

Hmm. Still can't understand.

Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
(say, another CPU does rcu_try_flip_idle() in between).

We ack the flip, call_rcu() enables irqs, the timer interrupt calls
__rcu_advance_callbacks() again and moves the callbacks.

So, it is still possible that "move callbacks" and "ack the flip" happen
out of order. But why this is bad?

This can't "speedup" the moving of our callbacks from next to done lists.
Yes, RCU state machine can switch to the next state/stage, but this looks
safe to me.

Help!

The last question, rcu_check_callbacks does

if (rcu_ctrlblk.completed == rdp->completed)
rcu_try_flip();

Could you clarify the check above? Afaics this is just optimization,
technically it is correct to rcu_try_flip() at any time, even if ->completed
are not synchronized. Most probably in that case rcu_try_flip_waitack() will
fail, but nothing bad can happen, yes?

> > > + }
> > > + sched_setaffinity(0, oldmask);
> > > +}
> >
> > Well, this is not correct... but doesn't matter because of the next patch.
>
> Well, the next patch is made unnecessary because of upcoming changes
> to hotplug. So, what do I have messed up?

oldmask could be obsolete now. Suppose that the admin moves that task to some
cpuset or just changes its ->cpus_allowed while it does synchronize_sched().

I think there is another problem. It would be nice to eliminate taking the global
sched_hotcpu_mutex in sched_setaffinity() (I think without CONFIG_HOTPLUG_CPU
it is not needed right now). In that case sched_setaffinity(0, cpumask_of_cpu(cpu))
can in fact return on the "wrong" CPU != cpu if another thread changes our affinity
in parallel, and this breaks synchronize_sched().


Can't we do something different? Suppose that we changed migration_thread(),
something like (roughly)

- __migrate_task(req->task, cpu, req->dest_cpu);
+ if (req->task)
+ __migrate_task(req->task, cpu, req->dest_cpu);
+ else
+ schedule(); // unneeded, mb() is enough?

complete(&req->done);

Now,
void synchronize_sched(void)
{
struct migration_req req;

req->task = NULL;
init_completion(&req.done);

for_each_online_cpu(cpu) {
struct rq *rq = cpu_rq(cpu);
int online;

spin_lock_irq(&rq->lock);
online = cpu_online(cpu); // HOTPLUG_CPU
if (online) {
list_add(&req->list, &rq->migration_queue);
req.done.done = 0;
}
spin_unlock_irq(&rq->lock);

if (online) {
wake_up_process(rq->migration_thread);
wait_for_completion(&req.done);
}
}
}

Alternatively, we can use schedule_on_each_cpu(), but it has other disadvantages.

Thoughts?

Oleg.

2007-09-27 15:47:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> On 09/23, Paul E. McKenney wrote:
> >
> > On Sun, Sep 23, 2007 at 09:38:07PM +0400, Oleg Nesterov wrote:
> > > Isn't DEFINE_PER_CPU_SHARED_ALIGNED better for rcu_flip_flag and rcu_mb_flag?
> >
> > Looks like it to me, thank you for the tip!
> >
> > Hmmm... Why not the same for rcu_data? I guess because there is
> > very little sharing? The only example thus far of sharing would be
> > if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
> > critical section were preempted and then started on some other CPU),
>
> even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",
> we remember the index, but not the CPU, we always modify the "local" data.

Ah, good point, and good reason to keep rcu_flipctr separate.

> > > OTOH, I can't understand how it works. With ot without local_irq_save(),
> > > another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
> > > at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?
> > >
> > > OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
> > > 0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
> > > my understanding is correct. Apart from this why GP_STAGES > 1 ???
> >
> > This is indeed one reason.
>
> Thanks a lot! Actually, this explains most of my questions. I was greatly
> confused even before I started to read the code. Looking at rcu_flipctr[2]
> I wrongly assumed that these 2 counters "belong" to different GPs. When
> I started to understand the reality, I was confused by GP_STAGES == 4 ;)

;-)

> > > > + * Disable local interrupts to prevent the grace-period
> > > > + * detection state machine from seeing us half-done.
> > > > + * NMIs can still occur, of course, and might themselves
> > > > + * contain rcu_read_lock().
> > > > + */
> > > > +
> > > > + local_irq_save(oldirq);
> > >
> > > Could you please tell more, why do we need this cli?
> > >
> > > It can't "protect" rcu_ctrlblk.completed, and the only change which affects
> > > the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
> > > above. (of course, we must disable preemption while changing rcu_flipctr).
> > >
> >
> > The problem is not with .completed being incremented, but only
> > by it (1) being incremented (presumably by some other CPU and
> > then (2) having this CPU get a scheduling-clock interrupt, which
> > then causes this CPU to prematurely update the rcu_flip_flag.
> > This is a problem, because updating rcu_flip_flag is making a
> > promise that you won't ever again increment the "old" counter set
> > (at least not until the next counter flip). If the scheduling
> > clock interrupt were to happen between the time that we pick
> > up the .completed field and the time that we increment our
> > counter, we will have broken that promise, and that could cause
> > someone to prematurely declare the grace period to be finished
>
> Thanks!
>
> > The second reason is that cli prohibits preemption.
>
> yes, this is clear
>
> > > > + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */
> > > > +
> > > > + /*
> > > > + * Need a memory barrier so that other CPUs see the new
> > > > + * counter value before they see the subsequent change of all
> > > > + * the rcu_flip_flag instances to rcu_flipped.
> > > > + */
> > >
> > > Why? Any code sequence which relies on that?
> >
> > Yep. If a CPU saw the flip flag set to rcu_flipped before it saw
> > the new value of the counter, it might acknowledge the flip, but
> > then later use the old value of the counter.
>
> Yes, yes, I see now. We really need this barriers, except I think
> rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
>
> // rcu_try_flip_waitzero()
> if (A == 0) {
> mb();
> B == 0;
> }
>
> Do we really need the mb() in this case? How it is possible that STORE
> goes before LOAD? "Obviously", the LOAD should be completed first, no?

Suppose that A was most recently stored by a CPU that shares a store
buffer with this CPU. Then it is possible that some other CPU sees
the store to B as happening before the store that "A==0" above is
loading from. This other CPU would naturally conclude that the store
to B must have happened before the load from A.

In more detail, suppose that CPU 0 and 1 share a store buffer, and
that CPU 2 and 3 share a second store buffer. This happens naturally
if CPUs 0 and 1 are really just different hardware threads within a
single core.

So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
and that the cacheline for B is initially owned by CPUs 0 and 1.
Then consider the following sequence of events:

o CPU 0 stores zero to A. This is a cache miss, so the new value
for A is placed in CPU 0's and 1's store buffer.

o CPU 1 executes the above code, first loading A. It sees
the value of A==0 in the store buffer, and therefore
stores zero to B, which hits in the cache. (I am assuming
that we left out the mb() above).

o CPU 2 loads from B, which misses the cache, and gets the
value that CPU 1 stored. Suppose it checks the value,
and based on this check, loads A. The old value of A might
still be in cache, which would lead CPU 2 to conclude that
the store to B by CPU 1 must have happened before the store
to A by CPU 0.

Memory barriers would prevent this confusion. An intro to store buffers
can be found at http://www.cs.utah.edu/mpv/papers/neiger/fmcad2001.pdf,
FYI.

> > > This looks a bit strange. Is this optimization? Why not
> > >
> > > spin_lock_irqsave(&rdp->lock, flags);
> > > rdp = RCU_DATA_ME();
> >
> > Can't do the above, because I need the pointer before I can lock it. ;-)
>
> Oooh... well... I need to think more about your explanation :)

;-)

> > > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> >
> > I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
> > transition has to be synchronized to the moving of the callbacks --
> > either that or we need more GP_STAGES.
>
> Hmm. Still can't understand.

Callbacks would be able to be injected into a grace period after it
started.

Or are you arguing that as long as interrupts remain disabled between
the two events, no harm done?

> Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> (say, another CPU does rcu_try_flip_idle() in between).
>
> We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> __rcu_advance_callbacks() again and moves the callbacks.
>
> So, it is still possible that "move callbacks" and "ack the flip" happen
> out of order. But why this is bad?
>
> This can't "speedup" the moving of our callbacks from next to done lists.
> Yes, RCU state machine can switch to the next state/stage, but this looks
> safe to me.

Ah -- you are in fact arguing that interrupts remain disabled throughout.
I would still rather that the rcu_flip_seen transition be adjacent
to the callback movement in the code. My fear is that the connection
might be lost otherwise... "Oh, but we can just momentarily enable
interrupts here!"

> Help!
>
> The last question, rcu_check_callbacks does
>
> if (rcu_ctrlblk.completed == rdp->completed)
> rcu_try_flip();
>
> Could you clarify the check above? Afaics this is just optimization,
> technically it is correct to rcu_try_flip() at any time, even if ->completed
> are not synchronized. Most probably in that case rcu_try_flip_waitack() will
> fail, but nothing bad can happen, yes?

>From a conceptual viewpoint, if this CPU hasn't caught up with the
last grace-period stage, it has no business trying to push forward to
the next stage. So this might (or might not) happen to work with this
particular implementation, it needs to stay as is. We need this code
to be robust enough to optimize the grace-period latencies, right?

> > > > + }
> > > > + sched_setaffinity(0, oldmask);
> > > > +}
> > >
> > > Well, this is not correct... but doesn't matter because of the next patch.
> >
> > Well, the next patch is made unnecessary because of upcoming changes
> > to hotplug. So, what do I have messed up?
>
> oldmask could be obsolete now. Suppose that the admin moves that task to some
> cpuset or just changes its ->cpus_allowed while it does synchronize_sched().

Ah, good point...

> I think there is another problem. It would be nice to eliminate taking the global
> sched_hotcpu_mutex in sched_setaffinity() (I think without CONFIG_HOTPLUG_CPU
> it is not needed right now). In that case sched_setaffinity(0, cpumask_of_cpu(cpu))
> can in fact return on the "wrong" CPU != cpu if another thread changes our affinity
> in parallel, and this breaks synchronize_sched().
>
>
> Can't we do something different? Suppose that we changed migration_thread(),
> something like (roughly)
>
> - __migrate_task(req->task, cpu, req->dest_cpu);
> + if (req->task)
> + __migrate_task(req->task, cpu, req->dest_cpu);
> + else
> + schedule(); // unneeded, mb() is enough?
>
> complete(&req->done);
>
> Now,
> void synchronize_sched(void)
> {
> struct migration_req req;
>
> req->task = NULL;
> init_completion(&req.done);
>
> for_each_online_cpu(cpu) {
> struct rq *rq = cpu_rq(cpu);
> int online;
>
> spin_lock_irq(&rq->lock);
> online = cpu_online(cpu); // HOTPLUG_CPU
> if (online) {
> list_add(&req->list, &rq->migration_queue);
> req.done.done = 0;
> }
> spin_unlock_irq(&rq->lock);
>
> if (online) {
> wake_up_process(rq->migration_thread);
> wait_for_completion(&req.done);
> }
> }
> }
>
> Alternatively, we can use schedule_on_each_cpu(), but it has other disadvantages.
>
> Thoughts?

Some people are calling for eliminating synchronize_sched() altogether,
but there are a few uses that would be hard to get rid of.

I need to think about your approach above. It looks like you are
leveraging the migration tasks, but I am concerned about concurrent
hotplug events. But either way, I do like the idea of communicating
with other tasks that actually do the context switches on behalf
of synchronize_sched().

Thanx, Paul

2007-09-28 14:43:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On 09/27, Paul E. McKenney wrote:
>
> On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> >
> > Yes, yes, I see now. We really need this barriers, except I think
> > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> >
> > // rcu_try_flip_waitzero()
> > if (A == 0) {
> > mb();
> > B == 0;
> > }
> >
> > Do we really need the mb() in this case? How it is possible that STORE
> > goes before LOAD? "Obviously", the LOAD should be completed first, no?
>
> Suppose that A was most recently stored by a CPU that shares a store
> buffer with this CPU. Then it is possible that some other CPU sees
> the store to B as happening before the store that "A==0" above is
> loading from. This other CPU would naturally conclude that the store
> to B must have happened before the load from A.

Ah, I was confused by the comment,

smp_mb(); /* Don't call for memory barriers before we see zero. */
^^^^^^^^^^^^^^^^^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.

But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called under
fliplock).

> In more detail, suppose that CPU 0 and 1 share a store buffer, and
> that CPU 2 and 3 share a second store buffer. This happens naturally
> if CPUs 0 and 1 are really just different hardware threads within a
> single core.
>
> So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
> and that the cacheline for B is initially owned by CPUs 0 and 1.
> Then consider the following sequence of events:
>
> o CPU 0 stores zero to A. This is a cache miss, so the new value
> for A is placed in CPU 0's and 1's store buffer.
>
> o CPU 1 executes the above code, first loading A. It sees
> the value of A==0 in the store buffer, and therefore
> stores zero to B, which hits in the cache. (I am assuming
> that we left out the mb() above).
>
> o CPU 2 loads from B, which misses the cache, and gets the
> value that CPU 1 stored. Suppose it checks the value,
> and based on this check, loads A. The old value of A might
> still be in cache, which would lead CPU 2 to conclude that
> the store to B by CPU 1 must have happened before the store
> to A by CPU 0.
>
> Memory barriers would prevent this confusion.

Thanks a lot!!! This fills another gap in my understanding.

OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace "CPU 0 stores" with
"CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?

Suppose that CPU 1 does

wmb();
B = 0

Can we assume that CPU 2 doing

if (B == 0) {
rmb();

must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?

> > > > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> > >
> > > I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
> > > transition has to be synchronized to the moving of the callbacks --
> > > either that or we need more GP_STAGES.
> >
> > Hmm. Still can't understand.
>
> Callbacks would be able to be injected into a grace period after it
> started.

Yes, but this is _exactly_ what the current code does in the scenario below,

> Or are you arguing that as long as interrupts remain disabled between
> the two events, no harm done?

no,

> > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> > (say, another CPU does rcu_try_flip_idle() in between).
> >
> > We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> > __rcu_advance_callbacks() again and moves the callbacks.
> >
> > So, it is still possible that "move callbacks" and "ack the flip" happen
> > out of order. But why this is bad?

Look, what happens is

// call_rcu()
rcu_flip_flag = rcu_flipped
insert the new callback
// timer irq
move the callbacks (the new one goes to wait[0])

But I still can't understand why this is bad,

> > This can't "speedup" the moving of our callbacks from next to done lists.
> > Yes, RCU state machine can switch to the next state/stage, but this looks
> > safe to me.

Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed
further events, we can't miss rcu_read_lock() section, no?

> > Help!

Please :)

> > if (rcu_ctrlblk.completed == rdp->completed)
> > rcu_try_flip();
> >
> > Could you clarify the check above? Afaics this is just optimization,
> > technically it is correct to rcu_try_flip() at any time, even if ->completed
> > are not synchronized. Most probably in that case rcu_try_flip_waitack() will
> > fail, but nothing bad can happen, yes?
>
> >From a conceptual viewpoint, if this CPU hasn't caught up with the
> last grace-period stage, it has no business trying to push forward to
> the next stage. So this might (or might not) happen to work with this
> particular implementation, it needs to stay as is. We need this code
> to be robust enough to optimize the grace-period latencies, right?

Yes, yes. I just wanted to be sure I didn't miss some other subtle reason.

> > void synchronize_sched(void)
> > {
> > struct migration_req req;
> >
> > req->task = NULL;
> > init_completion(&req.done);
> >
> > for_each_online_cpu(cpu) {
> > struct rq *rq = cpu_rq(cpu);
> > int online;
> >
> > spin_lock_irq(&rq->lock);
> > online = cpu_online(cpu); // HOTPLUG_CPU
> > if (online) {
> > list_add(&req->list, &rq->migration_queue);
> > req.done.done = 0;
> > }
> > spin_unlock_irq(&rq->lock);
> >
> > if (online) {
> > wake_up_process(rq->migration_thread);
> > wait_for_completion(&req.done);
> > }
> > }
> > }
> >
>
> I need to think about your approach above. It looks like you are
> leveraging the migration tasks, but I am concerned about concurrent
> hotplug events.

I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
interesting for us, we can miss it.

Hmm... but wake_up_process() should be moved under spin_lock().

Oleg.

2007-09-28 18:58:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> On 09/27, Paul E. McKenney wrote:
> >
> > On Wed, Sep 26, 2007 at 07:13:51PM +0400, Oleg Nesterov wrote:
> > >
> > > Yes, yes, I see now. We really need this barriers, except I think
> > > rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
> > >
> > > // rcu_try_flip_waitzero()
> > > if (A == 0) {
> > > mb();
> > > B == 0;
> > > }
> > >
> > > Do we really need the mb() in this case? How it is possible that STORE
> > > goes before LOAD? "Obviously", the LOAD should be completed first, no?
> >
> > Suppose that A was most recently stored by a CPU that shares a store
> > buffer with this CPU. Then it is possible that some other CPU sees
> > the store to B as happening before the store that "A==0" above is
> > loading from. This other CPU would naturally conclude that the store
> > to B must have happened before the load from A.
>
> Ah, I was confused by the comment,
>
> smp_mb(); /* Don't call for memory barriers before we see zero. */
> ^^^^^^^^^^^^^^^^^^
> So, in fact, we need this barrier to make sure that _other_ CPUs see these
> changes in order, thanks. Of course, _we_ already saw zero.

Fair point!

Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
rcu_flipctrs sum to zero" or some such?

> But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> it doesn't really need the barrier? (besides, it is always called under
> fliplock).

The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
so we cannot necessarily rely on locking to trivialize all of this.

> > In more detail, suppose that CPU 0 and 1 share a store buffer, and
> > that CPU 2 and 3 share a second store buffer. This happens naturally
> > if CPUs 0 and 1 are really just different hardware threads within a
> > single core.
> >
> > So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
> > and that the cacheline for B is initially owned by CPUs 0 and 1.
> > Then consider the following sequence of events:
> >
> > o CPU 0 stores zero to A. This is a cache miss, so the new value
> > for A is placed in CPU 0's and 1's store buffer.
> >
> > o CPU 1 executes the above code, first loading A. It sees
> > the value of A==0 in the store buffer, and therefore
> > stores zero to B, which hits in the cache. (I am assuming
> > that we left out the mb() above).
> >
> > o CPU 2 loads from B, which misses the cache, and gets the
> > value that CPU 1 stored. Suppose it checks the value,
> > and based on this check, loads A. The old value of A might
> > still be in cache, which would lead CPU 2 to conclude that
> > the store to B by CPU 1 must have happened before the store
> > to A by CPU 0.
> >
> > Memory barriers would prevent this confusion.
>
> Thanks a lot!!! This fills another gap in my understanding.

Glad it helped! ;-)

> OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> store buffer, the situation is simple, we can replace "CPU 0 stores" with
> "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
>
> Suppose that CPU 1 does
>
> wmb();
> B = 0
>
> Can we assume that CPU 2 doing
>
> if (B == 0) {
> rmb();
>
> must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?

Yes. CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
reads following its rmb() must therefore see all of CPU 1's stores
preceding the wmb().

> > > > > If this is possible, can't we move the code doing "s/rcu_flipped/rcu_flip_seen/"
> > > > > from __rcu_advance_callbacks() to rcu_check_mb() to unify the "acks" ?
> > > >
> > > > I believe that we cannot safely do this. The rcu_flipped-to-rcu_flip_seen
> > > > transition has to be synchronized to the moving of the callbacks --
> > > > either that or we need more GP_STAGES.
> > >
> > > Hmm. Still can't understand.
> >
> > Callbacks would be able to be injected into a grace period after it
> > started.
>
> Yes, but this is _exactly_ what the current code does in the scenario below,
>
> > Or are you arguing that as long as interrupts remain disabled between
> > the two events, no harm done?
>
> no,
>
> > > Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
> > > rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
> > > (say, another CPU does rcu_try_flip_idle() in between).
> > >
> > > We ack the flip, call_rcu() enables irqs, the timer interrupt calls
> > > __rcu_advance_callbacks() again and moves the callbacks.
> > >
> > > So, it is still possible that "move callbacks" and "ack the flip" happen
> > > out of order. But why this is bad?
>
> Look, what happens is
>
> // call_rcu()
> rcu_flip_flag = rcu_flipped
> insert the new callback
> // timer irq
> move the callbacks (the new one goes to wait[0])
>
> But I still can't understand why this is bad,
>
> > > This can't "speedup" the moving of our callbacks from next to done lists.
> > > Yes, RCU state machine can switch to the next state/stage, but this looks
> > > safe to me.
>
> Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed
> further events, we can't miss rcu_read_lock() section, no?
>
> > > Help!
>
> Please :)

Quite possibly my paranoia -- need to think about this some more.

Guess I need to expand the code-level portion of the document to cover
the grace-period side. That would force me either to get my explanation
into shape or admit that you are correct, as the case might be. ;-)

> > > if (rcu_ctrlblk.completed == rdp->completed)
> > > rcu_try_flip();
> > >
> > > Could you clarify the check above? Afaics this is just optimization,
> > > technically it is correct to rcu_try_flip() at any time, even if ->completed
> > > are not synchronized. Most probably in that case rcu_try_flip_waitack() will
> > > fail, but nothing bad can happen, yes?
> >
> > >From a conceptual viewpoint, if this CPU hasn't caught up with the
> > last grace-period stage, it has no business trying to push forward to
> > the next stage. So this might (or might not) happen to work with this
> > particular implementation, it needs to stay as is. We need this code
> > to be robust enough to optimize the grace-period latencies, right?
>
> Yes, yes. I just wanted to be sure I didn't miss some other subtle reason.
>
> > > void synchronize_sched(void)
> > > {
> > > struct migration_req req;
> > >
> > > req->task = NULL;
> > > init_completion(&req.done);
> > >
> > > for_each_online_cpu(cpu) {
> > > struct rq *rq = cpu_rq(cpu);
> > > int online;
> > >
> > > spin_lock_irq(&rq->lock);
> > > online = cpu_online(cpu); // HOTPLUG_CPU
> > > if (online) {
> > > list_add(&req->list, &rq->migration_queue);
> > > req.done.done = 0;
> > > }
> > > spin_unlock_irq(&rq->lock);
> > >
> > > if (online) {
> > > wake_up_process(rq->migration_thread);
> > > wait_for_completion(&req.done);
> > > }
> > > }
> > > }
> > >
> >
> > I need to think about your approach above. It looks like you are
> > leveraging the migration tasks, but I am concerned about concurrent
> > hotplug events.
>
> I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
> if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
> interesting for us, we can miss it.
>
> Hmm... but wake_up_process() should be moved under spin_lock().

The other approach would be to simply have a separate thread for this
purpose. Batching would amortize the overhead (a single trip around the
CPUs could satisfy an arbitrarily large number of synchronize_sched()
requests).

Thanx, Paul

Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU

Hi Paul,

Some silly doubts.

On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:
> Work in progress, not for inclusion.
>
> RCU priority boosting is needed when running a workload that might include
> CPU-bound user tasks running at realtime priorities with preemptible RCU.
> In this situation, RCU priority boosting is needed to avoid OOM.
>
> Please note that because Classic RCU does not permit RCU read-side
> critical sections to be preempted, there is no need to boost the priority
> of Classic RCU readers. Boosting the priority of a running process
> does not make it run any faster, at least not on any hardware that I am
> aware of. ;-)
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
>
> include/linux/init_task.h | 13
> include/linux/rcupdate.h | 17 +
> include/linux/rcupreempt.h | 20 +
> include/linux/sched.h | 24 +
> init/main.c | 1
> kernel/Kconfig.preempt | 14 -
> kernel/fork.c | 6
> kernel/rcupreempt.c | 608 ++++++++++++++++++++++++++++++++++++++++++---
> kernel/rtmutex.c | 7
> kernel/sched.c | 5
> lib/Kconfig.debug | 34 ++
> 11 files changed, 703 insertions(+), 46 deletions(-)
>
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/init_task.h linux-2.6.22-F-boostrcu/include/linux/init_task.h
> --- linux-2.6.22-E-hotplug/include/linux/init_task.h 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/include/linux/init_task.h 2007-08-31 14:09:02.000000000 -0700
> @@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy;
> .signalfd_list = LIST_HEAD_INIT(sighand.signalfd_list), \
> }
>
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> +#define INIT_RCU_BOOST_PRIO .rcu_prio = MAX_PRIO,
> +#define INIT_PREEMPT_RCU_BOOST(tsk) \
> + .rcub_rbdp = NULL, \
> + .rcub_state = RCU_BOOST_IDLE, \
> + .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry),
> +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +#define INIT_RCU_BOOST_PRIO
> +#define INIT_PREEMPT_RCU_BOOST(tsk)
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +
> extern struct group_info init_groups;
>
> #define INIT_STRUCT_PID { \
> @@ -125,6 +136,7 @@ extern struct group_info init_groups;
> .prio = MAX_PRIO-20, \
> .static_prio = MAX_PRIO-20, \
> .normal_prio = MAX_PRIO-20, \
> + INIT_RCU_BOOST_PRIO \
> .policy = SCHED_NORMAL, \
> .cpus_allowed = CPU_MASK_ALL, \
> .mm = NULL, \
> @@ -169,6 +181,7 @@ extern struct group_info init_groups;
> }, \
> INIT_TRACE_IRQFLAGS \
> INIT_LOCKDEP \
> + INIT_PREEMPT_RCU_BOOST(tsk) \
> }
>
>
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/rcupdate.h linux-2.6.22-F-boostrcu/include/linux/rcupdate.h
> --- linux-2.6.22-E-hotplug/include/linux/rcupdate.h 2007-08-24 11:03:22.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/include/linux/rcupdate.h 2007-08-24 17:04:14.000000000 -0700
> @@ -252,5 +252,22 @@ static inline void rcu_qsctr_inc(int cpu
> per_cpu(rcu_data_passed_quiesc, cpu) = 1;
> }
>
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> +extern void init_rcu_boost_late(void);
> +extern void __rcu_preempt_boost(void);
> +#define rcu_preempt_boost() /* cpp to avoid #include hell. */ \
> + do { \
> + if (unlikely(current->rcu_read_lock_nesting > 0)) \
> + __rcu_preempt_boost(); \
> + } while (0)
> +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +static inline void init_rcu_boost_late(void)
> +{
> +}
> +static inline void rcu_preempt_boost(void)
> +{
> +}
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +
> #endif /* __KERNEL__ */
> #endif /* __LINUX_RCUPDATE_H */
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/rcupreempt.h linux-2.6.22-F-boostrcu/include/linux/rcupreempt.h
> --- linux-2.6.22-E-hotplug/include/linux/rcupreempt.h 2007-08-24 11:20:32.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/include/linux/rcupreempt.h 2007-08-24 11:24:59.000000000 -0700
> @@ -42,6 +42,26 @@
> #include <linux/cpumask.h>
> #include <linux/seqlock.h>
>
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> +/*
> + * Task state with respect to being RCU-boosted. This state is changed
> + * by the task itself in response to the following three events:
^^^
> + * 1. Preemption (or block on lock) while in RCU read-side critical section.

I am wondering, can a task block on a lock while in RCU read-side
critical section?
> + * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section.
> + *

Event #3. is missing?


The state change from RCU_BOOST_BLOCKED to RCU_BOOSTED is not done by
the task itself, but by the rcu_boost_task. No?

> + * The RCU-boost task also updates the state when boosting priority.
> + */
> +enum rcu_boost_state {
> + RCU_BOOST_IDLE = 0, /* Not yet blocked if in RCU read-side. */
> + RCU_BOOST_BLOCKED = 1, /* Blocked from RCU read-side. */
> + RCU_BOOSTED = 2, /* Boosting complete. */
> + RCU_BOOST_INVALID = 3, /* For bogus state sightings. */
> +};
> +
> +#define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1)
> +
> +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +
> #define call_rcu_bh(head, rcu) call_rcu(head, rcu)
> #define rcu_bh_qsctr_inc(cpu) do { } while (0)
> #define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); }
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/sched.h linux-2.6.22-F-boostrcu/include/linux/sched.h
> --- linux-2.6.22-E-hotplug/include/linux/sched.h 2007-08-24 11:00:39.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/include/linux/sched.h 2007-08-24 17:07:01.000000000 -0700
> @@ -546,6 +546,22 @@ struct signal_struct {
> #define is_rt_policy(p) ((p) != SCHED_NORMAL && (p) != SCHED_BATCH)
> #define has_rt_policy(p) unlikely(is_rt_policy((p)->policy))
>
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> +#define set_rcu_prio(p, prio) /* cpp to avoid #include hell */ \
> + do { \
> + (p)->rcu_prio = (prio); \
> + } while (0)
> +#define get_rcu_prio(p) (p)->rcu_prio /* cpp to avoid #include hell */
> +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +static inline void set_rcu_prio(struct task_struct *p, int prio)
> +{
> +}
> +static inline int get_rcu_prio(struct task_struct *p)
> +{
> + return MAX_PRIO;
> +}
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
> +
> /*
> * Some day this will be a full-fledged user tracking system..
> */
> @@ -834,6 +850,9 @@ struct task_struct {
> #endif
> int load_weight; /* for niceness load balancing purposes */
> int prio, static_prio, normal_prio;
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> + int rcu_prio;
> +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
> struct list_head run_list;
> struct prio_array *array;
>
> @@ -858,6 +877,11 @@ struct task_struct {
> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> struct sched_info sched_info;
> #endif
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> + struct rcu_boost_dat *rcub_rbdp;
> + enum rcu_boost_state rcub_state;
> + struct list_head rcub_entry;
> +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
>
> struct list_head tasks;
> /*
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/init/main.c linux-2.6.22-F-boostrcu/init/main.c
> --- linux-2.6.22-E-hotplug/init/main.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/init/main.c 2007-08-24 11:24:59.000000000 -0700
> @@ -722,6 +722,7 @@ static void __init do_basic_setup(void)
> driver_init();
> init_irq_proc();
> do_initcalls();
> + init_rcu_boost_late();
> }
>
> static void __init do_pre_smp_initcalls(void)
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/fork.c linux-2.6.22-F-boostrcu/kernel/fork.c
> --- linux-2.6.22-E-hotplug/kernel/fork.c 2007-08-24 11:00:39.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/fork.c 2007-08-24 11:24:59.000000000 -0700
> @@ -1036,6 +1036,12 @@ static struct task_struct *copy_process(
> p->rcu_read_lock_nesting = 0;
> p->rcu_flipctr_idx = 0;
> #endif /* #ifdef CONFIG_PREEMPT_RCU */
> +#ifdef CONFIG_PREEMPT_RCU_BOOST
> + p->rcu_prio = MAX_PRIO;
> + p->rcub_rbdp = NULL;
> + p->rcub_state = RCU_BOOST_IDLE;
> + INIT_LIST_HEAD(&p->rcub_entry);
> +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
> p->vfork_done = NULL;
> spin_lock_init(&p->alloc_lock);
>
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/Kconfig.preempt linux-2.6.22-F-boostrcu/kernel/Kconfig.preempt
> --- linux-2.6.22-E-hotplug/kernel/Kconfig.preempt 2007-08-24 11:00:39.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/Kconfig.preempt 2007-08-24 11:24:59.000000000 -0700
> @@ -91,13 +91,13 @@ config PREEMPT_RCU
>
> endchoice
>
> -config RCU_TRACE
> - bool "Enable tracing for RCU - currently stats in debugfs"
> - select DEBUG_FS
> - default y
> +config PREEMPT_RCU_BOOST
> + bool "Enable priority boosting of RCU read-side critical sections"
> + depends on PREEMPT_RCU
> + default n
> help
> - This option provides tracing in RCU which presents stats
> - in debugfs for debugging RCU implementation.
> + This option permits priority boosting of RCU read-side critical
> + sections that have been preempted in order to prevent indefinite
> + delay of grace periods in face of runaway non-realtime processes.
>
> - Say Y here if you want to enable RCU tracing
> Say N if you are unsure.
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/rcupreempt.c linux-2.6.22-F-boostrcu/kernel/rcupreempt.c
> --- linux-2.6.22-E-hotplug/kernel/rcupreempt.c 2007-08-24 11:20:32.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/rcupreempt.c 2007-08-31 14:06:49.000000000 -0700
> @@ -51,6 +51,15 @@
> #include <linux/byteorder/swabb.h>
> #include <linux/cpumask.h>
> #include <linux/rcupreempt_trace.h>
> +#include <linux/kthread.h>
> +
> +/*
> + * Macro that prevents the compiler from reordering accesses, but does
> + * absolutely -nothing- to prevent CPUs from reordering. This is used
> + * only to mediate communication between mainline code and hardware
> + * interrupt and NMI handlers.
> + */
> +#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
>
> /*
> * PREEMPT_RCU data structures.
> @@ -82,6 +91,531 @@ static struct rcu_ctrlblk rcu_ctrlblk =
> };
> static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 };
>
> +#ifndef CONFIG_PREEMPT_RCU_BOOST
> +static inline void init_rcu_boost_early(void) { }
> +static inline void rcu_read_unlock_unboost(void) { }
> +#else /* #ifndef CONFIG_PREEMPT_RCU_BOOST */
> +
> +/* Defines possible event indices for ->rbs_stats[] (first index). */
> +
> +#define RCU_BOOST_DAT_BLOCK 0
> +#define RCU_BOOST_DAT_BOOST 1
> +#define RCU_BOOST_DAT_UNLOCK 2
> +#define N_RCU_BOOST_DAT_EVENTS 3
> +
> +/* RCU-boost per-CPU array element. */
> +
> +struct rcu_boost_dat {
> + spinlock_t rbs_lock; /* Protects state/CPU slice of structures. */
> + struct list_head rbs_toboost;
> + struct list_head rbs_boosted;
> + unsigned long rbs_blocked;
> + unsigned long rbs_boost_attempt;
> + unsigned long rbs_boost;
> + unsigned long rbs_unlock;
> + unsigned long rbs_unboosted;
> +#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
> + unsigned long rbs_stats[N_RCU_BOOST_DAT_EVENTS][N_RCU_BOOST_STATE];
> +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
> +};
> +#define RCU_BOOST_ELEMENTS 4
> +
> +static int rcu_boost_idx = -1; /* invalid value for early RCU use. */
> +static DEFINE_PER_CPU(struct rcu_boost_dat, rcu_boost_dat[RCU_BOOST_ELEMENTS]);
> +static struct task_struct *rcu_boost_task;
> +
> +#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
> +
> +/*
> + * Function to increment indicated ->rbs_stats[] element.
> + */
> +static inline void rcu_boost_dat_stat(struct rcu_boost_dat *rbdp,
> + int event,
> + enum rcu_boost_state oldstate)
> +{
> + if (oldstate >= RCU_BOOST_IDLE && oldstate <= RCU_BOOSTED) {
> + rbdp->rbs_stats[event][oldstate]++;
> + } else {
> + rbdp->rbs_stats[event][RCU_BOOST_INVALID]++;
> + }
> +}
> +
> +static inline void rcu_boost_dat_stat_block(struct rcu_boost_dat *rbdp,
> + enum rcu_boost_state oldstate)
> +{
> + rcu_boost_dat_stat(rbdp, RCU_BOOST_DAT_BLOCK, oldstate);
> +}
> +
> +static inline void rcu_boost_dat_stat_boost(struct rcu_boost_dat *rbdp,
> + enum rcu_boost_state oldstate)
> +{
> + rcu_boost_dat_stat(rbdp, RCU_BOOST_DAT_BOOST, oldstate);
> +}
> +
> +static inline void rcu_boost_dat_stat_unlock(struct rcu_boost_dat *rbdp,
> + enum rcu_boost_state oldstate)
> +{
> + rcu_boost_dat_stat(rbdp, RCU_BOOST_DAT_UNLOCK, oldstate);
> +}
> +
> +/*
> + * Prefix for kprint() strings for periodic statistics messages.
> + */
> +static char *rcu_boost_state_event[] = {
> + "block: ",
> + "boost: ",
> + "unlock: ",
> +};
> +
> +/*
> + * Indicators for numbers in kprint() strings. "!" indicates a state-event
> + * pair that should not happen, while "?" indicates a state that should
> + * not happen.
> + */
> +static char *rcu_boost_state_error[] = {
> + /*ibBe*/
> + " ?", /* block */
> + "! ?", /* boost */
> + "? ?", /* unlock */
> +};
> +
> +/*
> + * Print out RCU booster task statistics at the specified interval.
> + */
> +static void rcu_boost_dat_stat_print(void)
> +{
> + /* Three decimal digits per byte plus spacing per number and line. */
> + char buf[N_RCU_BOOST_STATE * (sizeof(long) * 3 + 2) + 2];
> + int cpu;
> + int event;
> + int i;
> + static time_t lastprint; /* static implies 0 initial value. */
> + struct rcu_boost_dat *rbdp;
> + int state;
> + struct rcu_boost_dat sum;
> +
> + /* Wait a graceful interval between printk spamming. */
> + /* Note: time_after() dislikes time_t. */
> +
> + if (xtime.tv_sec - lastprint < CONFIG_PREEMPT_RCU_BOOST_STATS_INTERVAL)
> + return;
> +
> + /* Sum up the state/event-independent counters. */
> +
> + sum.rbs_blocked = 0;
> + sum.rbs_boost_attempt = 0;
> + sum.rbs_boost = 0;
> + sum.rbs_unlock = 0;
> + sum.rbs_unboosted = 0;
> + for_each_possible_cpu(cpu)
> + for (i = 0; i < RCU_BOOST_ELEMENTS; i++) {
> + rbdp = per_cpu(rcu_boost_dat, cpu);
> + sum.rbs_blocked += rbdp[i].rbs_blocked;
> + sum.rbs_boost_attempt += rbdp[i].rbs_boost_attempt;
> + sum.rbs_boost += rbdp[i].rbs_boost;
> + sum.rbs_unlock += rbdp[i].rbs_unlock;
> + sum.rbs_unboosted += rbdp[i].rbs_unboosted;
> + }
> +
> + /* Sum up the state/event-dependent counters. */
> +
> + for (event = 0; event < N_RCU_BOOST_DAT_EVENTS; event++)
> + for (state = 0; state < N_RCU_BOOST_STATE; state++) {
> + sum.rbs_stats[event][state] = 0;
> + for_each_possible_cpu(cpu) {
> + for (i = 0; i < RCU_BOOST_ELEMENTS; i++)
> + sum.rbs_stats[event][state]
> + += per_cpu(rcu_boost_dat,
> + cpu)[i].rbs_stats[event][state];
> + }
> + }
> +
> + /* Print them out! */
> +
> + printk(KERN_INFO
> + "rcu_boost_dat: idx=%d "
> + "b=%lu ul=%lu ub=%lu boost: a=%lu b=%lu\n",
> + rcu_boost_idx,
> + sum.rbs_blocked, sum.rbs_unlock, sum.rbs_unboosted,
> + sum.rbs_boost_attempt, sum.rbs_boost);
> + for (event = 0; event < N_RCU_BOOST_DAT_EVENTS; event++) {
> + i = 0;
> + for (state = 0; state < N_RCU_BOOST_STATE; state++)
> + i += sprintf(&buf[i], " %ld%c",
> + sum.rbs_stats[event][state],
> + rcu_boost_state_error[event][state]);
> + printk(KERN_INFO "rcu_boost_dat %s %s\n",
> + rcu_boost_state_event[event], buf);
> + }
> +
> + /* Go away and don't come back for awhile. */
> +
> + lastprint = xtime.tv_sec;
> +}
> +
> +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
> +
> +static inline void rcu_boost_dat_stat_block(struct rcu_boost_dat *rbdp,
> + enum rcu_boost_state oldstate)
> +{
> +}
> +static inline void rcu_boost_dat_stat_boost(struct rcu_boost_dat *rbdp,
> + enum rcu_boost_state oldstate)
> +{
> +}
> +static inline void rcu_boost_dat_stat_unlock(struct rcu_boost_dat *rbdp,
> + enum rcu_boost_state oldstate)
> +{
> +}
> +static void rcu_boost_dat_stat_print(void)
> +{
> +}
> +
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
> +
> +/*
> + * Initialize RCU-boost state. This happens early in the boot process,
> + * when the scheduler does not yet exist. So don't try to use it.
> + */
> +static void init_rcu_boost_early(void)
> +{
> + struct rcu_boost_dat *rbdp;
> + int cpu;
> + int i;
> +
> + for_each_possible_cpu(cpu) {
> + rbdp = per_cpu(rcu_boost_dat, cpu);
> + for (i = 0; i < RCU_BOOST_ELEMENTS; i++) {
> + spin_lock_init(&rbdp[i].rbs_lock);
> + INIT_LIST_HEAD(&rbdp[i].rbs_toboost);
> + INIT_LIST_HEAD(&rbdp[i].rbs_boosted);
> + rbdp[i].rbs_blocked = 0;
> + rbdp[i].rbs_boost_attempt = 0;
> + rbdp[i].rbs_boost = 0;
> + rbdp[i].rbs_unlock = 0;
> + rbdp[i].rbs_unboosted = 0;
> +#ifdef CONFIG_PREEMPT_RCU_BOOST_STATS
> + {
> + int j, k;
> +
> + for (j = 0; j < N_RCU_BOOST_DAT_EVENTS; j++)
> + for (k = 0; k < N_RCU_BOOST_STATE; k++)
> + rbdp[i].rbs_stats[j][k] = 0;
> + }
> +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST_STATS */
> + }
> + smp_wmb(); /* Make sure readers see above initialization. */
> + rcu_boost_idx = 0; /* Allow readers to access data. */
> + }
> +}
> +
> +/*
> + * Return the list on which the calling task should add itself, or
> + * NULL if too early during initialization.
> + */
> +static inline struct rcu_boost_dat *rcu_rbd_new(void)
> +{
> + int cpu = smp_processor_id(); /* locks used, so preemption OK. */
> + int idx = ORDERED_WRT_IRQ(rcu_boost_idx);
> +
> + if (unlikely(idx < 0))
> + return NULL;
> + return &per_cpu(rcu_boost_dat, cpu)[idx];
> +}
> +
> +/*
> + * Return the list from which to boost target tasks.
> + * May only be invoked by the booster task, so guaranteed to
> + * already be initialized. Use rcu_boost_dat element least recently
> + * the destination for task blocking in RCU read-side critical sections.
> + */
> +static inline struct rcu_boost_dat *rcu_rbd_boosting(int cpu)
> +{
> + int idx = (rcu_boost_idx + 1) & (RCU_BOOST_ELEMENTS - 1);
> +
> + return &per_cpu(rcu_boost_dat, cpu)[idx];
> +}
> +
> +#define PREEMPT_RCU_BOOSTER_PRIO 49 /* Match curr_irq_prio manually. */
> + /* Administrators can always adjust */
> + /* via the /proc interface. */
> +
> +/*
> + * Boost the specified task from an RCU viewpoint.
> + * Boost the target task to a priority just a bit less-favored than
> + * that of the RCU-boost task, but boost to a realtime priority even
> + * if the RCU-boost task is running at a non-realtime priority.
> + * We check the priority of the RCU-boost task each time we boost
> + * in case the sysadm manually changes the priority.
> + */
> +static void rcu_boost_prio(struct task_struct *taskp)
> +{
> + unsigned long flags;
> + int rcuprio;
> +
> + spin_lock_irqsave(&current->pi_lock, flags);
> + rcuprio = rt_mutex_getprio(current) + 1;
> + if (rcuprio >= MAX_USER_RT_PRIO)
> + rcuprio = MAX_USER_RT_PRIO - 1;
> + spin_unlock_irqrestore(&current->pi_lock, flags);
> + spin_lock_irqsave(&taskp->pi_lock, flags);
> + if (taskp->rcu_prio != rcuprio) {
> + taskp->rcu_prio = rcuprio;
> + if (taskp->rcu_prio < taskp->prio)
> + rt_mutex_setprio(taskp, taskp->rcu_prio);
> + }
> + spin_unlock_irqrestore(&taskp->pi_lock, flags);
> +}
> +
> +/*
> + * Unboost the specified task from an RCU viewpoint.
> + */
> +static void rcu_unboost_prio(struct task_struct *taskp)
> +{
> + int nprio;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&taskp->pi_lock, flags);
> + taskp->rcu_prio = MAX_PRIO;
> + nprio = rt_mutex_getprio(taskp);
> + if (nprio > taskp->prio)
> + rt_mutex_setprio(taskp, nprio);
> + spin_unlock_irqrestore(&taskp->pi_lock, flags);
> +}
> +
> +/*
> + * Boost all of the RCU-reader tasks on the specified list.
> + */
> +static void rcu_boost_one_reader_list(struct rcu_boost_dat *rbdp)
> +{
> + LIST_HEAD(list);
> + unsigned long flags;
> + struct task_struct *taskp;
> +
> + /*
> + * Splice both lists onto a local list. We will still
> + * need to hold the lock when manipulating the local list
> + * because tasks can remove themselves at any time.
> + * The reason for splicing the rbs_boosted list is that
> + * our priority may have changed, so reboosting may be
> + * required.
> + */
> +
> + spin_lock_irqsave(&rbdp->rbs_lock, flags);
> + list_splice_init(&rbdp->rbs_toboost, &list);
> + list_splice_init(&rbdp->rbs_boosted, &list);
> + while (!list_empty(&list)) {
> +
> + /*
> + * Pause for a bit before boosting each task.
> + * @@@FIXME: reduce/eliminate pausing in case of OOM.
> + */
> +
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> + schedule_timeout_uninterruptible(1);
> + spin_lock_irqsave(&rbdp->rbs_lock, flags);
> +
> + /*
> + * All tasks might have removed themselves while
> + * we were waiting. Recheck list emptiness.
> + */
> +
> + if (list_empty(&list))
> + break;
> +
> + /* Remove first task in local list, count the attempt. */
> +
> + taskp = list_entry(list.next, typeof(*taskp), rcub_entry);
> + list_del_init(&taskp->rcub_entry);
> + rbdp->rbs_boost_attempt++;
> +
> + /* Ignore tasks in unexpected states. */
> +
> + if (taskp->rcub_state == RCU_BOOST_IDLE) {
> + list_add_tail(&taskp->rcub_entry, &rbdp->rbs_toboost);
> + rcu_boost_dat_stat_boost(rbdp, taskp->rcub_state);
> + continue;
> + }
> +
> + /* Boost the task's priority. */
> +
> + rcu_boost_prio(taskp);
> + rbdp->rbs_boost++;
> + rcu_boost_dat_stat_boost(rbdp, taskp->rcub_state);
> + taskp->rcub_state = RCU_BOOSTED;
> + list_add_tail(&taskp->rcub_entry, &rbdp->rbs_boosted);
> + }
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> +}
> +
> +/*
> + * Priority-boost tasks stuck in RCU read-side critical sections as
> + * needed (presumably rarely).
> + */
> +static int rcu_booster(void *arg)
> +{
> + int cpu;
> + struct sched_param sp = { .sched_priority = PREEMPT_RCU_BOOSTER_PRIO, };
> +
> + sched_setscheduler(current, SCHED_RR, &sp);
> + current->flags |= PF_NOFREEZE;
> +
> + do {
> +
> + /* Advance the lists of tasks. */
> +
> + rcu_boost_idx = (rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS;
> + for_each_possible_cpu(cpu) {
> +
> + /*
> + * Boost all sufficiently aged readers.
> + * Readers must first be preempted or block
> + * on a mutex in an RCU read-side critical section,
> + * then remain in that critical section for
> + * RCU_BOOST_ELEMENTS-1 time intervals.
> + * So most of the time we should end up doing
> + * nothing.
> + */
> +
> + rcu_boost_one_reader_list(rcu_rbd_boosting(cpu));
> +
> + /*
> + * Large SMP systems may need to sleep sometimes
> + * in this loop. Or have multiple RCU-boost tasks.
> + */
> + }
> +
> + /*
> + * Sleep to allow any unstalled RCU read-side critical
> + * sections to age out of the list. @@@ FIXME: reduce,
> + * adjust, or eliminate in case of OOM.
> + */
> +
> + schedule_timeout_uninterruptible(HZ);
> +
> + /* Print stats if enough time has passed. */
> +
> + rcu_boost_dat_stat_print();
> +
> + } while (!kthread_should_stop());
> +
> + return 0;
> +}
> +
> +/*
> + * Perform the portions of RCU-boost initialization that require the
> + * scheduler to be up and running.
> + */
> +void init_rcu_boost_late(void)
> +{
> +
> + /* Spawn RCU-boost task. */
> +
> + printk(KERN_INFO "Starting RCU priority booster\n");
> + rcu_boost_task = kthread_run(rcu_booster, NULL, "RCU Prio Booster");
> + if (IS_ERR(rcu_boost_task))
> + panic("Unable to create RCU Priority Booster, errno %ld\n",
> + -PTR_ERR(rcu_boost_task));
> +}
> +
> +/*
> + * Update task's RCU-boost state to reflect blocking in RCU read-side
> + * critical section, so that the RCU-boost task can find it in case it
> + * later needs its priority boosted.
> + */
> +void __rcu_preempt_boost(void)
> +{
> + struct rcu_boost_dat *rbdp;
> + unsigned long flags;
> +
> + /* Identify list to place task on for possible later boosting. */
> +
> + local_irq_save(flags);
> + rbdp = rcu_rbd_new();
> + if (rbdp == NULL) {
> + local_irq_restore(flags);
> + printk(KERN_INFO
> + "Preempted RCU read-side critical section too early.\n");
> + return;
> + }
> + spin_lock(&rbdp->rbs_lock);
> + rbdp->rbs_blocked++;
> +
> + /*
> + * Update state. We hold the lock and aren't yet on the list,
> + * so the booster cannot mess with us yet.
> + */
> +
> + rcu_boost_dat_stat_block(rbdp, current->rcub_state);
> + if (current->rcub_state != RCU_BOOST_IDLE) {
> +
> + /*
> + * We have been here before, so just update stats.
> + * It may seem strange to do all this work just to
> + * accumulate statistics, but this is such a
> + * low-probability code path that we shouldn't care.
> + * If it becomes a problem, it can be fixed.
> + */
> +
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> + return;
> + }
> + current->rcub_state = RCU_BOOST_BLOCKED;
> +
> + /* Now add ourselves to the list so that the booster can find us. */
> +
> + list_add_tail(&current->rcub_entry, &rbdp->rbs_toboost);
> + current->rcub_rbdp = rbdp;
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> +}
> +
> +/*
> + * Do the list-removal and priority-unboosting "heavy lifting" when
> + * required.
> + */
> +static void __rcu_read_unlock_unboost(void)
> +{
> + unsigned long flags;
> + struct rcu_boost_dat *rbdp;
> +
> + /* Identify the list structure and acquire the corresponding lock. */
> +
> + rbdp = current->rcub_rbdp;
> + spin_lock_irqsave(&rbdp->rbs_lock, flags);
> +
> + /* Remove task from the list it was on. */
> +
> + list_del_init(&current->rcub_entry);
> + rbdp->rbs_unlock++;
> + current->rcub_rbdp = NULL;
> +
> + /* Record stats, unboost if needed, and update state. */
> +
> + rcu_boost_dat_stat_unlock(rbdp, current->rcub_state);
> + if (current->rcub_state == RCU_BOOSTED) {
> + rcu_unboost_prio(current);
> + rbdp->rbs_unboosted++;
> + }
> + current->rcub_state = RCU_BOOST_IDLE;
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> +}
> +
> +/*
> + * Do any state changes and unboosting needed for rcu_read_unlock().
> + * Pass any complex work on to __rcu_read_unlock_unboost().
> + * The vast majority of the time, no work will be needed, as preemption
> + * and blocking within RCU read-side critical sections is comparatively
> + * rare.
> + */
> +static inline void rcu_read_unlock_unboost(void)
> +{
> +
> + if (unlikely(current->rcub_state != RCU_BOOST_IDLE))
> + __rcu_read_unlock_unboost();
> +}
> +
> +#endif /* #else #ifndef CONFIG_PREEMPT_RCU_BOOST */
> +
> /*
> * States for rcu_try_flip() and friends.
> */
> @@ -128,14 +662,6 @@ static DEFINE_PER_CPU(enum rcu_mb_flag_v
> static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE;
>
> /*
> - * Macro that prevents the compiler from reordering accesses, but does
> - * absolutely -nothing- to prevent CPUs from reordering. This is used
> - * only to mediate communication between mainline code and hardware
> - * interrupt and NMI handlers.
> - */
> -#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
> -
> -/*
> * RCU_DATA_ME: find the current CPU's rcu_data structure.
> * RCU_DATA_CPU: find the specified CPU's rcu_data structure.
> */
> @@ -194,7 +720,7 @@ void __rcu_read_lock(void)
> me->rcu_read_lock_nesting = nesting + 1;
>
> } else {
> - unsigned long oldirq;
> + unsigned long flags;
>
> /*
> * Disable local interrupts to prevent the grace-period
> @@ -203,7 +729,7 @@ void __rcu_read_lock(void)
> * contain rcu_read_lock().
> */
>
> - local_irq_save(oldirq);
> + local_irq_save(flags);
>
> /*
> * Outermost nesting of rcu_read_lock(), so increment
> @@ -233,7 +759,7 @@ void __rcu_read_lock(void)
> */
>
> ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> - local_irq_restore(oldirq);
> + local_irq_restore(flags);
> }
> }
> EXPORT_SYMBOL_GPL(__rcu_read_lock);
> @@ -255,7 +781,7 @@ void __rcu_read_unlock(void)
> me->rcu_read_lock_nesting = nesting - 1;
>
> } else {
> - unsigned long oldirq;
> + unsigned long flags;
>
> /*
> * Disable local interrupts to prevent the grace-period
> @@ -264,7 +790,7 @@ void __rcu_read_unlock(void)
> * contain rcu_read_lock() and rcu_read_unlock().
> */
>
> - local_irq_save(oldirq);
> + local_irq_save(flags);
>
> /*
> * Outermost nesting of rcu_read_unlock(), so we must
> @@ -305,7 +831,10 @@ void __rcu_read_unlock(void)
> */
>
> ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
> - local_irq_restore(oldirq);
> +
> + rcu_read_unlock_unboost();
> +
> + local_irq_restore(flags);
> }
> }
> EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> @@ -504,10 +1033,10 @@ rcu_try_flip_waitmb(void)
> */
> static void rcu_try_flip(void)
> {
> - unsigned long oldirq;
> + unsigned long flags;
>
> RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> - if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
> + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, flags))) {
> RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> return;
> }
> @@ -534,7 +1063,7 @@ static void rcu_try_flip(void)
> if (rcu_try_flip_waitmb())
> rcu_try_flip_state = rcu_try_flip_idle_state;
> }
> - spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
> }
>
> /*
> @@ -553,16 +1082,16 @@ static void rcu_check_mb(int cpu)
>
> void rcu_check_callbacks_rt(int cpu, int user)
> {
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
>
> rcu_check_mb(cpu);
> if (rcu_ctrlblk.completed == rdp->completed)
> rcu_try_flip();
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> __rcu_advance_callbacks(rdp);
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> /*
> @@ -571,18 +1100,19 @@ void rcu_check_callbacks_rt(int cpu, int
> */
> void rcu_advance_callbacks_rt(int cpu, int user)
> {
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
>
> if (rcu_ctrlblk.completed == rdp->completed) {
> rcu_try_flip();
> if (rcu_ctrlblk.completed == rdp->completed)
> return;
> + rcu_read_unlock_unboost();
> }
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> __rcu_advance_callbacks(rdp);
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -601,24 +1131,24 @@ void rcu_offline_cpu_rt(int cpu)
> {
> int i;
> struct rcu_head *list = NULL;
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
> struct rcu_head **tail = &list;
>
> /* Remove all callbacks from the newly dead CPU, retaining order. */
>
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> rcu_offline_cpu_rt_enqueue(rdp->donelist, rdp->donetail, list, tail);
> for (i = GP_STAGES - 1; i >= 0; i--)
> rcu_offline_cpu_rt_enqueue(rdp->waitlist[i], rdp->waittail[i],
> list, tail);
> rcu_offline_cpu_rt_enqueue(rdp->nextlist, rdp->nexttail, list, tail);
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> rdp->waitlistcount = 0;
>
> /* Disengage the newly dead CPU from grace-period computation. */
>
> - spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> + spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
> rcu_check_mb(cpu);
> if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> smp_mb(); /* Subsequent counter accesses must see new value */
> @@ -627,7 +1157,7 @@ void rcu_offline_cpu_rt(int cpu)
> /* seen -after- acknowledgement. */
> }
> cpu_clear(cpu, rcu_cpu_online_map);
> - spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
>
> /*
> * Place the removed callbacks on the current CPU's queue.
> @@ -640,20 +1170,20 @@ void rcu_offline_cpu_rt(int cpu)
> */
>
> rdp = RCU_DATA_ME();
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> *rdp->nexttail = list;
> if (list)
> rdp->nexttail = tail;
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> void __devinit rcu_online_cpu_rt(int cpu)
> {
> - unsigned long oldirq;
> + unsigned long flags;
>
> - spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> + spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
> cpu_set(cpu, rcu_cpu_online_map);
> - spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
> }
>
> #else /* #ifdef CONFIG_HOTPLUG_CPU */
> @@ -695,12 +1225,12 @@ void rcu_process_callbacks_rt(struct sof
> void fastcall call_rcu_preempt(struct rcu_head *head,
> void (*func)(struct rcu_head *rcu))
> {
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp;
>
> head->func = func;
> head->next = NULL;
> - local_irq_save(oldirq);
> + local_irq_save(flags);
> rdp = RCU_DATA_ME();
> spin_lock(&rdp->lock);
> __rcu_advance_callbacks(rdp);
> @@ -708,7 +1238,7 @@ void fastcall call_rcu_preempt(struct rc
> rdp->nexttail = &head->next;
> RCU_TRACE_RDP(rcupreempt_trace_next_add, rdp);
> spin_unlock(&rdp->lock);
> - local_irq_restore(oldirq);
> + local_irq_restore(flags);
> }
> EXPORT_SYMBOL_GPL(call_rcu_preempt);
>
> @@ -757,6 +1287,11 @@ int rcu_pending_rt(int cpu)
> return 0;
> }
>
> +/*
> + * Initialize RCU. This is called very early in boot, so is restricted
> + * to very simple operations. Don't even think about messing with anything
> + * that involves the scheduler, as it doesn't exist yet.
> + */
> void __init rcu_init_rt(void)
> {
> int cpu;
> @@ -778,6 +1313,7 @@ void __init rcu_init_rt(void)
> rdp->donelist = NULL;
> rdp->donetail = &rdp->donelist;
> }
> + init_rcu_boost_early();
> }
>
> /*
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/rtmutex.c linux-2.6.22-F-boostrcu/kernel/rtmutex.c
> --- linux-2.6.22-E-hotplug/kernel/rtmutex.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/rtmutex.c 2007-08-24 11:24:59.000000000 -0700
> @@ -111,11 +111,12 @@ static inline void mark_rt_mutex_waiters
> */
> int rt_mutex_getprio(struct task_struct *task)
> {
> + int prio = min(task->normal_prio, get_rcu_prio(task));
> +
> if (likely(!task_has_pi_waiters(task)))
> - return task->normal_prio;
> + return prio;
>
> - return min(task_top_pi_waiter(task)->pi_list_entry.prio,
> - task->normal_prio);
> + return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio);
> }
>
> /*
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/sched.c linux-2.6.22-F-boostrcu/kernel/sched.c
> --- linux-2.6.22-E-hotplug/kernel/sched.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/sched.c 2007-08-24 11:24:59.000000000 -0700
> @@ -1702,6 +1702,7 @@ void fastcall sched_fork(struct task_str
> * Make sure we do not leak PI boosting priority to the child:
> */
> p->prio = current->normal_prio;
> + set_rcu_prio(p, MAX_PRIO);
>
> INIT_LIST_HEAD(&p->run_list);
> p->array = NULL;
> @@ -1784,6 +1785,7 @@ void fastcall wake_up_new_task(struct ta
> else {
> p->prio = current->prio;
> p->normal_prio = current->normal_prio;
> + set_rcu_prio(p, MAX_PRIO);
> list_add_tail(&p->run_list, &current->run_list);
> p->array = current->array;
> p->array->nr_active++;
> @@ -3590,6 +3592,8 @@ asmlinkage void __sched schedule(void)
> }
> profile_hit(SCHED_PROFILING, __builtin_return_address(0));
>
> + rcu_preempt_boost();
> +
> need_resched:
> preempt_disable();
> prev = current;
> @@ -5060,6 +5064,7 @@ void __cpuinit init_idle(struct task_str
> idle->sleep_avg = 0;
> idle->array = NULL;
> idle->prio = idle->normal_prio = MAX_PRIO;
> + set_rcu_prio(idle, MAX_PRIO);
> idle->state = TASK_RUNNING;
> idle->cpus_allowed = cpumask_of_cpu(cpu);
> set_task_cpu(idle, cpu);
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/lib/Kconfig.debug linux-2.6.22-F-boostrcu/lib/Kconfig.debug
> --- linux-2.6.22-E-hotplug/lib/Kconfig.debug 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/lib/Kconfig.debug 2007-08-24 11:24:59.000000000 -0700
> @@ -391,6 +391,40 @@ config RCU_TORTURE_TEST
> Say M if you want the RCU torture tests to build as a module.
> Say N if you are unsure.
>
> +config RCU_TRACE
> + bool "Enable tracing for RCU - currently stats in debugfs"
> + select DEBUG_FS
> + depends on DEBUG_KERNEL
> + default y
> + help
> + This option provides tracing in RCU which presents stats
> + in debugfs for debugging RCU implementation.
> +
> + Say Y here if you want to enable RCU tracing
> + Say N if you are unsure.
> +
> +config PREEMPT_RCU_BOOST_STATS
> + bool "Enable RCU priority-boosting statistic printing"
> + depends on PREEMPT_RCU_BOOST
> + depends on DEBUG_KERNEL
> + default n
> + help
> + This option enables debug printk()s of RCU boost statistics,
> + which are normally only used to debug RCU priority boost
> + implementations.
> +
> + Say N if you are unsure.
> +
> +config PREEMPT_RCU_BOOST_STATS_INTERVAL
> + int "RCU priority-boosting statistic printing interval (seconds)"
> + depends on PREEMPT_RCU_BOOST_STATS
> + default 100
> + range 10 86400
> + help
> + This option controls the timing of debug printk()s of RCU boost
> + statistics, which are normally only used to debug RCU priority
> + boost implementations.
> +
> config LKDTM
> tristate "Linux Kernel Dump Test Tool Module"
> depends on DEBUG_KERNEL

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-09-28 23:05:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU



--
On Fri, 28 Sep 2007, Gautham R Shenoy wrote:

> >
> > +#ifdef CONFIG_PREEMPT_RCU_BOOST
> > +/*
> > + * Task state with respect to being RCU-boosted. This state is changed
> > + * by the task itself in response to the following three events:
> ^^^
> > + * 1. Preemption (or block on lock) while in RCU read-side critical section.
>
> I am wondering, can a task block on a lock while in RCU read-side
> critical section?

I think this may be specific to the -rt patch. In the -rt patch,
spin_locks turn into mutexes, and therefor can block a read-side critical
section.


> > + * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section.
> > + *
>
> Event #3. is missing?

I guess Paul needs to answer that one ;-)

-- Steve

2007-09-30 03:11:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU

On Fri, Sep 28, 2007 at 07:05:14PM -0400, Steven Rostedt wrote:
>
>
> --
> On Fri, 28 Sep 2007, Gautham R Shenoy wrote:
>
> > >
> > > +#ifdef CONFIG_PREEMPT_RCU_BOOST
> > > +/*
> > > + * Task state with respect to being RCU-boosted. This state is changed
> > > + * by the task itself in response to the following three events:
> > ^^^
> > > + * 1. Preemption (or block on lock) while in RCU read-side critical section.
> >
> > I am wondering, can a task block on a lock while in RCU read-side
> > critical section?
>
> I think this may be specific to the -rt patch. In the -rt patch,
> spin_locks turn into mutexes, and therefor can block a read-side critical
> section.

Yep! I do need to fix the comment.

> > > + * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section.
> > > + *
> >
> > Event #3. is missing?
>
> I guess Paul needs to answer that one ;-)

An older version had three, the new one has two, and I forgot to s/three/two/.

Thanx, Paul

2007-09-30 16:27:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On 09/28, Paul E. McKenney wrote:
>
> On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> > Ah, I was confused by the comment,
> >
> > smp_mb(); /* Don't call for memory barriers before we see zero. */
> > ^^^^^^^^^^^^^^^^^^
> > So, in fact, we need this barrier to make sure that _other_ CPUs see these
> > changes in order, thanks. Of course, _we_ already saw zero.
>
> Fair point!
>
> Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
> rcu_flipctrs sum to zero" or some such?
>
> > But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> > is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> > it doesn't really need the barrier? (besides, it is always called under
> > fliplock).
>
> The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
> so we cannot necessarily rely on locking to trivialize all of this.

Yes, but still I think this mb() is not necessary. Becasue we don't need
the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,
please forget.

> > OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> > store buffer, the situation is simple, we can replace "CPU 0 stores" with
> > "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
> >
> > Suppose that CPU 1 does
> >
> > wmb();
> > B = 0
> >
> > Can we assume that CPU 2 doing
> >
> > if (B == 0) {
> > rmb();
> >
> > must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?
>
> Yes. CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
> reads following its rmb() must therefore see all of CPU 1's stores
> preceding the wmb().

Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 1)?

Let's take a "real life" example,

A = B = X = 0;
P = Q = &A;

CPU_0 CPU_1 CPU_2

P = &B; *P = 1; if (X) {
wmb(); rmb();
X = 1; BUG_ON(*P != 1 && *Q != 1);
}

So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?

> The other approach would be to simply have a separate thread for this
> purpose. Batching would amortize the overhead (a single trip around the
> CPUs could satisfy an arbitrarily large number of synchronize_sched()
> requests).

Yes, this way we don't need to uglify migration_thread(). OTOH, we need
another kthread ;)

Oleg.

2007-09-30 16:34:36

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU

On 09/10, Paul E. McKenney wrote:
>
> --- linux-2.6.22-d-schedclassic/kernel/rcupreempt.c 2007-08-22 15:45:28.000000000 -0700
> +++ linux-2.6.22-e-hotplugcpu/kernel/rcupreempt.c 2007-08-22 15:56:22.000000000 -0700
> @@ -125,6 +125,8 @@ enum rcu_mb_flag_values {
> };
> static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done;
>
> +static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE;

I'd suggest to append "__read_mostly"

> +void rcu_offline_cpu_rt(int cpu)
> +{
> [...snip...]
> + spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> + rcu_check_mb(cpu);
> + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> + smp_mb(); /* Subsequent counter accesses must see new value */
> + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> + smp_mb(); /* Subsequent RCU read-side critical sections */
> + /* seen -after- acknowledgement. */

Imho, all these barriers are unneeded and confusing, we can't do them on behalf
of a dead CPU anyway. Can't we just do

per_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
?

Why can't we also do

__get_cpu_var(rcu_flipctr)[0] += per_cpu(rcu_flipctr, cpu)[0];
per_cpu(rcu_flipctr, cpu)[0] = 0;
__get_cpu_var(rcu_flipctr)[1] += per_cpu(rcu_flipctr, cpu)[1];
per_cpu(rcu_flipctr, cpu)[1] = 0;

? This way rcu_try_flip_waitzero() can also use rcu_cpu_online_map. This cpu
is dead, nobody can modify per_cpu(rcu_flipctr, cpu). And we can't confuse
rcu_try_flip_waitzero(), we are holding rcu_ctrlblk.fliplock.

> +void __devinit rcu_online_cpu_rt(int cpu)
> +{
> + unsigned long oldirq;
> +
> + spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> + cpu_set(cpu, rcu_cpu_online_map);

What if _cpu_up() fails? I think rcu_cpu_notify(CPU_UP_CANCELED) should call
rcu_offline_cpu_rt() too.

Oleg.

2007-09-30 23:02:22

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Sun, 30 Sep 2007, Oleg Nesterov wrote:

> Ah, but I asked the different question. We must see CPU 1's stores by
> definition, but what about CPU 0's stores (which could be seen by CPU 1)?
>
> Let's take a "real life" example,
>
> A = B = X = 0;
> P = Q = &A;
>
> CPU_0 CPU_1 CPU_2
>
> P = &B; *P = 1; if (X) {
> wmb(); rmb();
> X = 1; BUG_ON(*P != 1 && *Q != 1);
> }
>
> So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?

That can't be. CPU_2 sees X=1, that happened after (or same time at most -
from a cache inv. POV) to *P=1, that must have happened after P=&B (in
order for *P to assign B). So P=&B happened, from a pure time POV, before
the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.



- Davide


2007-10-01 17:06:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> On Sun, 30 Sep 2007, Oleg Nesterov wrote:
>
> > Ah, but I asked the different question. We must see CPU 1's stores by
> > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> >
> > Let's take a "real life" example,
> >
> > A = B = X = 0;
> > P = Q = &A;
> >
> > CPU_0 CPU_1 CPU_2
> >
> > P = &B; *P = 1; if (X) {
> > wmb(); rmb();
> > X = 1; BUG_ON(*P != 1 && *Q != 1);
> > }
> >
> > So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?
>
> That can't be. CPU_2 sees X=1, that happened after (or same time at most -
> from a cache inv. POV) to *P=1, that must have happened after P=&B (in
> order for *P to assign B). So P=&B happened, from a pure time POV, before
> the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.

Actually, CPU designers have to go quite a ways out of their way to
prevent this BUG_ON from happening. One way that it would happen
naturally would be if the cache line containing P were owned by CPU 2,
and if CPUs 0 and 1 shared a store buffer that they both snooped. So,
here is what could happen given careless or sadistic CPU designers:

o CPU 0 stores &B to P, but misses the cache, so puts the
result in the store buffer. This means that only CPUs 0 and 1
can see it.

o CPU 1 fetches P, and sees &B, so stores a 1 to B. Again,
this value for P is visible only to CPUs 0 and 1.

o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
in order. But it does nothing about CPU 0's stores, nor about CPU
1's loads, for that matter (and the only reason that POWER ends
up working the way you would like is because wmb() turns into
"sync" rather than the "eieio" instruction that would have been
used for smp_wmb() -- which is maybe what Oleg was thinking of,
but happened to abbreviate. If my analysis is buggy, Anton and
Paulus will no doubt correct me...)

o CPU 1 stores to X.

o CPU 2 loads X, and sees that the value is 1.

o CPU 2 does an rmb(), which orders its loads, but does nothing
about anyone else's loads or stores.

o CPU 2 fetches P from its cached copy, which still points to A,
which is still zero. So the BUG_ON fires.

o Some time later, CPU 0 gets the cache line containing P from
CPU 2, and updates it from the value in the store buffer, but
too late...

Unfortunately, cache-coherence protocols don't care much about pure
time... It is possible to make a 16-CPU machine believe that a single
variable has more than ten different values -at- -the- -same- -time-.
This is easy to do -- have all the CPUs store different values to the
same variable at the same time, then reload, collecting timestamps
between each pair of operations. On a large SMP, the values will sit
in the store buffers for many hundreds of nanoseconds, perhaps even
several microseconds, while the cache line containing the variable
being stored to shuttles around among the CPUs. ;-)

Thanx, Paul

2007-10-01 17:06:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU

On Sun, Sep 30, 2007 at 08:38:49PM +0400, Oleg Nesterov wrote:
> On 09/10, Paul E. McKenney wrote:
> >
> > --- linux-2.6.22-d-schedclassic/kernel/rcupreempt.c 2007-08-22 15:45:28.000000000 -0700
> > +++ linux-2.6.22-e-hotplugcpu/kernel/rcupreempt.c 2007-08-22 15:56:22.000000000 -0700
> > @@ -125,6 +125,8 @@ enum rcu_mb_flag_values {
> > };
> > static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done;
> >
> > +static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE;
>
> I'd suggest to append "__read_mostly"

Makes sense!

> > +void rcu_offline_cpu_rt(int cpu)
> > +{
> > [...snip...]
> > + spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> > + rcu_check_mb(cpu);
> > + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> > + smp_mb(); /* Subsequent counter accesses must see new value */
> > + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> > + smp_mb(); /* Subsequent RCU read-side critical sections */
> > + /* seen -after- acknowledgement. */
>
> Imho, all these barriers are unneeded and confusing, we can't do them on behalf
> of a dead CPU anyway. Can't we just do
>
> per_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
> per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
> ?

You are likely correct, but this is a slow path, extremely hard to
stress test, and I am freakin' paranoid about this sort of thing.

> Why can't we also do
>
> __get_cpu_var(rcu_flipctr)[0] += per_cpu(rcu_flipctr, cpu)[0];
> per_cpu(rcu_flipctr, cpu)[0] = 0;
> __get_cpu_var(rcu_flipctr)[1] += per_cpu(rcu_flipctr, cpu)[1];
> per_cpu(rcu_flipctr, cpu)[1] = 0;
>
> ? This way rcu_try_flip_waitzero() can also use rcu_cpu_online_map. This cpu
> is dead, nobody can modify per_cpu(rcu_flipctr, cpu). And we can't confuse
> rcu_try_flip_waitzero(), we are holding rcu_ctrlblk.fliplock.

Very good point!!! This would reduce latencies on systems where
the number of possible CPUs greatly exceeds that of the number of
online CPUs, so seems quite worthwhile.

> > +void __devinit rcu_online_cpu_rt(int cpu)
> > +{
> > + unsigned long oldirq;
> > +
> > + spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> > + cpu_set(cpu, rcu_cpu_online_map);
>
> What if _cpu_up() fails? I think rcu_cpu_notify(CPU_UP_CANCELED) should call
> rcu_offline_cpu_rt() too.

Good catch, will fix!!!

Thanx, Paul

2007-10-01 17:07:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Sun, Sep 30, 2007 at 08:31:02PM +0400, Oleg Nesterov wrote:
> On 09/28, Paul E. McKenney wrote:
> >
> > On Fri, Sep 28, 2007 at 06:47:14PM +0400, Oleg Nesterov wrote:
> > > Ah, I was confused by the comment,
> > >
> > > smp_mb(); /* Don't call for memory barriers before we see zero. */
> > > ^^^^^^^^^^^^^^^^^^
> > > So, in fact, we need this barrier to make sure that _other_ CPUs see these
> > > changes in order, thanks. Of course, _we_ already saw zero.
> >
> > Fair point!
> >
> > Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
> > rcu_flipctrs sum to zero" or some such?
> >
> > > But in that particular case this doesn't matter, rcu_try_flip_waitzero()
> > > is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
> > > it doesn't really need the barrier? (besides, it is always called under
> > > fliplock).
> >
> > The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
> > so we cannot necessarily rely on locking to trivialize all of this.
>
> Yes, but still I think this mb() is not necessary. Becasue we don't need
> the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
> CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,
> please forget.

Will do! ;-)

> > > OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
> > > store buffer, the situation is simple, we can replace "CPU 0 stores" with
> > > "CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
> > >
> > > Suppose that CPU 1 does
> > >
> > > wmb();
> > > B = 0
> > >
> > > Can we assume that CPU 2 doing
> > >
> > > if (B == 0) {
> > > rmb();
> > >
> > > must see all invalidations from CPU 0 which were seen by CPU 1 before wmb() ?
> >
> > Yes. CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
> > reads following its rmb() must therefore see all of CPU 1's stores
> > preceding the wmb().
>
> Ah, but I asked the different question. We must see CPU 1's stores by
> definition, but what about CPU 0's stores (which could be seen by CPU 1)?
>
> Let's take a "real life" example,
>
> A = B = X = 0;
> P = Q = &A;
>
> CPU_0 CPU_1 CPU_2
>
> P = &B; *P = 1; if (X) {
> wmb(); rmb();
> X = 1; BUG_ON(*P != 1 && *Q != 1);
> }
>
> So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?

It depends. ;-)

o Itanium: because both wmb() and rmb() map to the "mf"
instruction, and because "mf" instructions map to a
single global order, the BUG_ON cannot happen. (But
I could easily be mistaken -- I cannot call myself an
Itanium memory-ordering expert.) See:

ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf

for the official story.

o POWER: because wmb() maps to the "sync" instruction,
cumulativity applies, so that any instruction provably
following "X = 1" will see "P = &B" if the "*P = 1"
statement saw it. So the BUG_ON cannot happen.

o i386: memory ordering respects transitive visibility,
which seems to be similar to POWER's cumulativity
(http://developer.intel.com/products/processor/manuals/318147.pdf),
so the BUG_ON cannot happen.

o x86_64: same as i386.

o s390: the basic memory-ordering model is tight enough that the
BUG_ON cannot happen. (If I am confused about this, the s390
guys will not be shy about correcting me!)

o ARM: beats the heck out of me.

> > The other approach would be to simply have a separate thread for this
> > purpose. Batching would amortize the overhead (a single trip around the
> > CPUs could satisfy an arbitrarily large number of synchronize_sched()
> > requests).
>
> Yes, this way we don't need to uglify migration_thread(). OTOH, we need
> another kthread ;)

True enough!!!

Thanx, Paul

2007-10-01 18:44:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Sun, 30 Sep 2007, Paul E. McKenney wrote:

> On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> >
> > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > >
> > > Let's take a "real life" example,
> > >
> > > A = B = X = 0;
> > > P = Q = &A;
> > >
> > > CPU_0 CPU_1 CPU_2
> > >
> > > P = &B; *P = 1; if (X) {
> > > wmb(); rmb();
> > > X = 1; BUG_ON(*P != 1 && *Q != 1);
> > > }
> > >
> > > So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?
> >
> > That can't be. CPU_2 sees X=1, that happened after (or same time at most -
> > from a cache inv. POV) to *P=1, that must have happened after P=&B (in
> > order for *P to assign B). So P=&B happened, from a pure time POV, before
> > the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.
>
> Actually, CPU designers have to go quite a ways out of their way to
> prevent this BUG_ON from happening. One way that it would happen
> naturally would be if the cache line containing P were owned by CPU 2,
> and if CPUs 0 and 1 shared a store buffer that they both snooped. So,
> here is what could happen given careless or sadistic CPU designers:

Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
local CPU load ordering only, things are different, clearly. But ...



>
> o CPU 0 stores &B to P, but misses the cache, so puts the
> result in the store buffer. This means that only CPUs 0 and 1
> can see it.
>
> o CPU 1 fetches P, and sees &B, so stores a 1 to B. Again,
> this value for P is visible only to CPUs 0 and 1.
>
> o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
> in order. But it does nothing about CPU 0's stores, nor about CPU
> 1's loads, for that matter (and the only reason that POWER ends
> up working the way you would like is because wmb() turns into
> "sync" rather than the "eieio" instruction that would have been
> used for smp_wmb() -- which is maybe what Oleg was thinking of,
> but happened to abbreviate. If my analysis is buggy, Anton and
> Paulus will no doubt correct me...)

If a store buffer is shared between CPU_0 and CPU_1, it is very likely
that a sync done on CPU_1 is going to sync even CPU_0 stores that are
held in the buffer at the time of CPU_1's sync.



- Davide


2007-10-01 19:21:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Mon, Oct 01, 2007 at 11:44:25AM -0700, Davide Libenzi wrote:
> On Sun, 30 Sep 2007, Paul E. McKenney wrote:
>
> > On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> > >
> > > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > > >
> > > > Let's take a "real life" example,
> > > >
> > > > A = B = X = 0;
> > > > P = Q = &A;
> > > >
> > > > CPU_0 CPU_1 CPU_2
> > > >
> > > > P = &B; *P = 1; if (X) {
> > > > wmb(); rmb();
> > > > X = 1; BUG_ON(*P != 1 && *Q != 1);
> > > > }
> > > >
> > > > So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?
> > >
> > > That can't be. CPU_2 sees X=1, that happened after (or same time at most -
> > > from a cache inv. POV) to *P=1, that must have happened after P=&B (in
> > > order for *P to assign B). So P=&B happened, from a pure time POV, before
> > > the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.
> >
> > Actually, CPU designers have to go quite a ways out of their way to
> > prevent this BUG_ON from happening. One way that it would happen
> > naturally would be if the cache line containing P were owned by CPU 2,
> > and if CPUs 0 and 1 shared a store buffer that they both snooped. So,
> > here is what could happen given careless or sadistic CPU designers:
>
> Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
> that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
> local CPU load ordering only, things are different, clearly. But ...
>
> > o CPU 0 stores &B to P, but misses the cache, so puts the
> > result in the store buffer. This means that only CPUs 0 and 1
> > can see it.
> >
> > o CPU 1 fetches P, and sees &B, so stores a 1 to B. Again,
> > this value for P is visible only to CPUs 0 and 1.
> >
> > o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
> > in order. But it does nothing about CPU 0's stores, nor about CPU
> > 1's loads, for that matter (and the only reason that POWER ends
> > up working the way you would like is because wmb() turns into
> > "sync" rather than the "eieio" instruction that would have been
> > used for smp_wmb() -- which is maybe what Oleg was thinking of,
> > but happened to abbreviate. If my analysis is buggy, Anton and
> > Paulus will no doubt correct me...)
>
> If a store buffer is shared between CPU_0 and CPU_1, it is very likely
> that a sync done on CPU_1 is going to sync even CPU_0 stores that are
> held in the buffer at the time of CPU_1's sync.

That would indeed be one approach that CPU designers could take to
avoid being careless or sadistic. ;-)

Another approach would be to make CPU 1 refrain from snooping CPU 0's
entries in the shared store queue.

Thanx, Paul

2007-10-01 22:09:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Mon, 1 Oct 2007, Paul E. McKenney wrote:

> That would indeed be one approach that CPU designers could take to
> avoid being careless or sadistic. ;-)

That'd be the easier (unique maybe) approach too for them, from an silicon
complexity POV. Distinguishing between different CPUs stores once inside a
shared store buffer, would require tagging them in some way. That'd defeat
most of the pros of having a shared store buffer ;)



- Davide


2007-10-01 22:25:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On Mon, Oct 01, 2007 at 03:09:16PM -0700, Davide Libenzi wrote:
> On Mon, 1 Oct 2007, Paul E. McKenney wrote:
>
> > That would indeed be one approach that CPU designers could take to
> > avoid being careless or sadistic. ;-)
>
> That'd be the easier (unique maybe) approach too for them, from an silicon
> complexity POV. Distinguishing between different CPUs stores once inside a
> shared store buffer, would require tagging them in some way. That'd defeat
> most of the pros of having a shared store buffer ;)

Tagging requires but one bit per entry. Depends on the workload -- if
lots of barriers, bursty stores and little sharing, tagging might win.
If lots of sharing, then your suggested approach might win.

Thanx, Paul

2007-10-02 17:58:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC 3/9] RCU: Preemptible RCU

On 09/30, Paul E. McKenney wrote:
>
> On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> >
> > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > >
> > > Let's take a "real life" example,
> > >
> > > A = B = X = 0;
> > > P = Q = &A;
> > >
> > > CPU_0 CPU_1 CPU_2
> > >
> > > P = &B; *P = 1; if (X) {
> > > wmb(); rmb();
> > > X = 1; BUG_ON(*P != 1 && *Q != 1);
> > > }
> > >
> > > So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?
> >
> > That can't be. CPU_2 sees X=1, that happened after (or same time at most -
> > from a cache inv. POV) to *P=1, that must have happened after P=&B (in
> > order for *P to assign B). So P=&B happened, from a pure time POV, before
> > the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.
>
> Actually, CPU designers have to go quite a ways out of their way to
> prevent this BUG_ON from happening. One way that it would happen
> naturally would be if the cache line containing P were owned by CPU 2,
> and if CPUs 0 and 1 shared a store buffer that they both snooped. So,
> here is what could happen given careless or sadistic CPU designers:
>
> o CPU 0 stores &B to P, but misses the cache, so puts the
> result in the store buffer. This means that only CPUs 0 and 1
> can see it.
>
> o CPU 1 fetches P, and sees &B, so stores a 1 to B. Again,
> this value for P is visible only to CPUs 0 and 1.
>
> o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
> in order. But it does nothing about CPU 0's stores, nor about CPU
> 1's loads, for that matter (and the only reason that POWER ends
> up working the way you would like is because wmb() turns into
> "sync" rather than the "eieio" instruction that would have been
> used for smp_wmb() -- which is maybe what Oleg was thinking of,
> but happened to abbreviate. If my analysis is buggy, Anton and
> Paulus will no doubt correct me...)
>
> o CPU 1 stores to X.
>
> o CPU 2 loads X, and sees that the value is 1.
>
> o CPU 2 does an rmb(), which orders its loads, but does nothing
> about anyone else's loads or stores.
>
> o CPU 2 fetches P from its cached copy, which still points to A,
> which is still zero. So the BUG_ON fires.
>
> o Some time later, CPU 0 gets the cache line containing P from
> CPU 2, and updates it from the value in the store buffer, but
> too late...
>
> Unfortunately, cache-coherence protocols don't care much about pure
> time... It is possible to make a 16-CPU machine believe that a single
> variable has more than ten different values -at- -the- -same- -time-.

Davide, Paul, thank you very much! I've been wondering about this for the
long time, now I know the answer. Great.

Oleg.

Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU

On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:

[snip]

> +
> +/*
> + * Return the list from which to boost target tasks.
> + * May only be invoked by the booster task, so guaranteed to
> + * already be initialized. Use rcu_boost_dat element least recently
> + * the destination for task blocking in RCU read-side critical sections.
> + */
> +static inline struct rcu_boost_dat *rcu_rbd_boosting(int cpu)
> +{
> + int idx = (rcu_boost_idx + 1) & (RCU_BOOST_ELEMENTS - 1);

Why is this masking required? When we increment
the rcu_boost_idx in rcu_booster, we do perform a modulo operation
to ensure that it wraps around RCU_BOOST_ELEMENTS.


> +
> + return &per_cpu(rcu_boost_dat, cpu)[idx];
> +}
> +
> +#define PREEMPT_RCU_BOOSTER_PRIO 49 /* Match curr_irq_prio manually. */
> + /* Administrators can always adjust */
> + /* via the /proc interface. */
> +
> +/*
> + * Boost the specified task from an RCU viewpoint.
> + * Boost the target task to a priority just a bit less-favored than
> + * that of the RCU-boost task, but boost to a realtime priority even
> + * if the RCU-boost task is running at a non-realtime priority.
> + * We check the priority of the RCU-boost task each time we boost
> + * in case the sysadm manually changes the priority.
> + */
> +static void rcu_boost_prio(struct task_struct *taskp)
> +{
> + unsigned long flags;
> + int rcuprio;
> +
> + spin_lock_irqsave(&current->pi_lock, flags);
> + rcuprio = rt_mutex_getprio(current) + 1;
> + if (rcuprio >= MAX_USER_RT_PRIO)
> + rcuprio = MAX_USER_RT_PRIO - 1;
> + spin_unlock_irqrestore(&current->pi_lock, flags);
> + spin_lock_irqsave(&taskp->pi_lock, flags);
> + if (taskp->rcu_prio != rcuprio) {
> + taskp->rcu_prio = rcuprio;
> + if (taskp->rcu_prio < taskp->prio)
> + rt_mutex_setprio(taskp, taskp->rcu_prio);
> + }
> + spin_unlock_irqrestore(&taskp->pi_lock, flags);
> +}
> +
> +/*
> + * Unboost the specified task from an RCU viewpoint.
> + */
> +static void rcu_unboost_prio(struct task_struct *taskp)
> +{
> + int nprio;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&taskp->pi_lock, flags);
> + taskp->rcu_prio = MAX_PRIO;
> + nprio = rt_mutex_getprio(taskp);
> + if (nprio > taskp->prio)
> + rt_mutex_setprio(taskp, nprio);
> + spin_unlock_irqrestore(&taskp->pi_lock, flags);
> +}
> +
> +/*
> + * Boost all of the RCU-reader tasks on the specified list.
> + */
> +static void rcu_boost_one_reader_list(struct rcu_boost_dat *rbdp)
> +{
> + LIST_HEAD(list);
> + unsigned long flags;
> + struct task_struct *taskp;
> +
> + /*
> + * Splice both lists onto a local list. We will still
> + * need to hold the lock when manipulating the local list
> + * because tasks can remove themselves at any time.
> + * The reason for splicing the rbs_boosted list is that
> + * our priority may have changed, so reboosting may be
> + * required.
> + */
> +
> + spin_lock_irqsave(&rbdp->rbs_lock, flags);
> + list_splice_init(&rbdp->rbs_toboost, &list);
> + list_splice_init(&rbdp->rbs_boosted, &list);
> + while (!list_empty(&list)) {
> +
> + /*
> + * Pause for a bit before boosting each task.
> + * @@@FIXME: reduce/eliminate pausing in case of OOM.
> + */
> +
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> + schedule_timeout_uninterruptible(1);
> + spin_lock_irqsave(&rbdp->rbs_lock, flags);
> +
> + /*
> + * All tasks might have removed themselves while
> + * we were waiting. Recheck list emptiness.
> + */
> +
> + if (list_empty(&list))
> + break;
> +
> + /* Remove first task in local list, count the attempt. */
> +
> + taskp = list_entry(list.next, typeof(*taskp), rcub_entry);
> + list_del_init(&taskp->rcub_entry);
> + rbdp->rbs_boost_attempt++;
> +
> + /* Ignore tasks in unexpected states. */
> +
> + if (taskp->rcub_state == RCU_BOOST_IDLE) {
> + list_add_tail(&taskp->rcub_entry, &rbdp->rbs_toboost);
> + rcu_boost_dat_stat_boost(rbdp, taskp->rcub_state);
> + continue;
> + }
> +
> + /* Boost the task's priority. */
> +
> + rcu_boost_prio(taskp);
> + rbdp->rbs_boost++;
> + rcu_boost_dat_stat_boost(rbdp, taskp->rcub_state);
> + taskp->rcub_state = RCU_BOOSTED;
> + list_add_tail(&taskp->rcub_entry, &rbdp->rbs_boosted);
> + }
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> +}
> +
> +/*
> + * Priority-boost tasks stuck in RCU read-side critical sections as
> + * needed (presumably rarely).
> + */
> +static int rcu_booster(void *arg)
> +{
> + int cpu;
> + struct sched_param sp = { .sched_priority = PREEMPT_RCU_BOOSTER_PRIO, };
> +
> + sched_setscheduler(current, SCHED_RR, &sp);
> + current->flags |= PF_NOFREEZE;
> +
> + do {
> +
> + /* Advance the lists of tasks. */
> +
> + rcu_boost_idx = (rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS;
> + for_each_possible_cpu(cpu) {
> +
> + /*
> + * Boost all sufficiently aged readers.
> + * Readers must first be preempted or block
> + * on a mutex in an RCU read-side critical section,
> + * then remain in that critical section for
> + * RCU_BOOST_ELEMENTS-1 time intervals.
> + * So most of the time we should end up doing
> + * nothing.
> + */
> +
> + rcu_boost_one_reader_list(rcu_rbd_boosting(cpu));
> +
> + /*
> + * Large SMP systems may need to sleep sometimes
> + * in this loop. Or have multiple RCU-boost tasks.
> + */
> + }
> +
> + /*
> + * Sleep to allow any unstalled RCU read-side critical
> + * sections to age out of the list. @@@ FIXME: reduce,
> + * adjust, or eliminate in case of OOM.
> + */
> +
> + schedule_timeout_uninterruptible(HZ);
> +
> + /* Print stats if enough time has passed. */
> +
> + rcu_boost_dat_stat_print();
> +
> + } while (!kthread_should_stop());
> +
> + return 0;
> +}
> +
> +/*
> + * Perform the portions of RCU-boost initialization that require the
> + * scheduler to be up and running.
> + */
> +void init_rcu_boost_late(void)
> +{
> +
> + /* Spawn RCU-boost task. */
> +
> + printk(KERN_INFO "Starting RCU priority booster\n");
> + rcu_boost_task = kthread_run(rcu_booster, NULL, "RCU Prio Booster");
> + if (IS_ERR(rcu_boost_task))
> + panic("Unable to create RCU Priority Booster, errno %ld\n",
> + -PTR_ERR(rcu_boost_task));
> +}
> +
> +/*
> + * Update task's RCU-boost state to reflect blocking in RCU read-side
> + * critical section, so that the RCU-boost task can find it in case it
> + * later needs its priority boosted.
> + */
> +void __rcu_preempt_boost(void)
> +{
> + struct rcu_boost_dat *rbdp;
> + unsigned long flags;
> +
> + /* Identify list to place task on for possible later boosting. */
> +
> + local_irq_save(flags);
> + rbdp = rcu_rbd_new();
> + if (rbdp == NULL) {
> + local_irq_restore(flags);
> + printk(KERN_INFO
> + "Preempted RCU read-side critical section too early.\n");
> + return;
> + }
> + spin_lock(&rbdp->rbs_lock);
> + rbdp->rbs_blocked++;
> +
> + /*
> + * Update state. We hold the lock and aren't yet on the list,
> + * so the booster cannot mess with us yet.
> + */
> +
> + rcu_boost_dat_stat_block(rbdp, current->rcub_state);
> + if (current->rcub_state != RCU_BOOST_IDLE) {
> +
> + /*
> + * We have been here before, so just update stats.
> + * It may seem strange to do all this work just to
> + * accumulate statistics, but this is such a
> + * low-probability code path that we shouldn't care.
> + * If it becomes a problem, it can be fixed.
> + */
> +
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> + return;
> + }
> + current->rcub_state = RCU_BOOST_BLOCKED;
> +
> + /* Now add ourselves to the list so that the booster can find us. */
> +
> + list_add_tail(&current->rcub_entry, &rbdp->rbs_toboost);
> + current->rcub_rbdp = rbdp;
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> +}
> +
> +/*
> + * Do the list-removal and priority-unboosting "heavy lifting" when
> + * required.
> + */
> +static void __rcu_read_unlock_unboost(void)
> +{
> + unsigned long flags;
> + struct rcu_boost_dat *rbdp;
> +
> + /* Identify the list structure and acquire the corresponding lock. */
> +
> + rbdp = current->rcub_rbdp;
> + spin_lock_irqsave(&rbdp->rbs_lock, flags);
> +
> + /* Remove task from the list it was on. */
> +
> + list_del_init(&current->rcub_entry);
> + rbdp->rbs_unlock++;
> + current->rcub_rbdp = NULL;
> +
> + /* Record stats, unboost if needed, and update state. */
> +
> + rcu_boost_dat_stat_unlock(rbdp, current->rcub_state);
> + if (current->rcub_state == RCU_BOOSTED) {
> + rcu_unboost_prio(current);
> + rbdp->rbs_unboosted++;
> + }
> + current->rcub_state = RCU_BOOST_IDLE;
> + spin_unlock_irqrestore(&rbdp->rbs_lock, flags);
> +}
> +
> +/*
> + * Do any state changes and unboosting needed for rcu_read_unlock().
> + * Pass any complex work on to __rcu_read_unlock_unboost().
> + * The vast majority of the time, no work will be needed, as preemption
> + * and blocking within RCU read-side critical sections is comparatively
> + * rare.
> + */
> +static inline void rcu_read_unlock_unboost(void)
> +{
> +
> + if (unlikely(current->rcub_state != RCU_BOOST_IDLE))
> + __rcu_read_unlock_unboost();
> +}
> +
> +#endif /* #else #ifndef CONFIG_PREEMPT_RCU_BOOST */
> +
> /*
> * States for rcu_try_flip() and friends.
> */
> @@ -128,14 +662,6 @@ static DEFINE_PER_CPU(enum rcu_mb_flag_v
> static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE;
>
> /*
> - * Macro that prevents the compiler from reordering accesses, but does
> - * absolutely -nothing- to prevent CPUs from reordering. This is used
> - * only to mediate communication between mainline code and hardware
> - * interrupt and NMI handlers.
> - */
> -#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x))
> -
> -/*
> * RCU_DATA_ME: find the current CPU's rcu_data structure.
> * RCU_DATA_CPU: find the specified CPU's rcu_data structure.
> */
> @@ -194,7 +720,7 @@ void __rcu_read_lock(void)
> me->rcu_read_lock_nesting = nesting + 1;
>
> } else {
> - unsigned long oldirq;
> + unsigned long flags;
>
> /*
> * Disable local interrupts to prevent the grace-period
> @@ -203,7 +729,7 @@ void __rcu_read_lock(void)
> * contain rcu_read_lock().
> */
>
> - local_irq_save(oldirq);
> + local_irq_save(flags);
>
> /*
> * Outermost nesting of rcu_read_lock(), so increment
> @@ -233,7 +759,7 @@ void __rcu_read_lock(void)
> */
>
> ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx;
> - local_irq_restore(oldirq);
> + local_irq_restore(flags);
> }
> }
> EXPORT_SYMBOL_GPL(__rcu_read_lock);
> @@ -255,7 +781,7 @@ void __rcu_read_unlock(void)
> me->rcu_read_lock_nesting = nesting - 1;
>
> } else {
> - unsigned long oldirq;
> + unsigned long flags;
>
> /*
> * Disable local interrupts to prevent the grace-period
> @@ -264,7 +790,7 @@ void __rcu_read_unlock(void)
> * contain rcu_read_lock() and rcu_read_unlock().
> */
>
> - local_irq_save(oldirq);
> + local_irq_save(flags);
>
> /*
> * Outermost nesting of rcu_read_unlock(), so we must
> @@ -305,7 +831,10 @@ void __rcu_read_unlock(void)
> */
>
> ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--;
> - local_irq_restore(oldirq);
> +
> + rcu_read_unlock_unboost();
> +
> + local_irq_restore(flags);
> }
> }
> EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> @@ -504,10 +1033,10 @@ rcu_try_flip_waitmb(void)
> */
> static void rcu_try_flip(void)
> {
> - unsigned long oldirq;
> + unsigned long flags;
>
> RCU_TRACE_ME(rcupreempt_trace_try_flip_1);
> - if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) {
> + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, flags))) {
> RCU_TRACE_ME(rcupreempt_trace_try_flip_e1);
> return;
> }
> @@ -534,7 +1063,7 @@ static void rcu_try_flip(void)
> if (rcu_try_flip_waitmb())
> rcu_try_flip_state = rcu_try_flip_idle_state;
> }
> - spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
> }
>
> /*
> @@ -553,16 +1082,16 @@ static void rcu_check_mb(int cpu)
>
> void rcu_check_callbacks_rt(int cpu, int user)
> {
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
>
> rcu_check_mb(cpu);
> if (rcu_ctrlblk.completed == rdp->completed)
> rcu_try_flip();
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> __rcu_advance_callbacks(rdp);
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> /*
> @@ -571,18 +1100,19 @@ void rcu_check_callbacks_rt(int cpu, int
> */
> void rcu_advance_callbacks_rt(int cpu, int user)
> {
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
>
> if (rcu_ctrlblk.completed == rdp->completed) {
> rcu_try_flip();
> if (rcu_ctrlblk.completed == rdp->completed)
> return;
> + rcu_read_unlock_unboost();
> }
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp);
> __rcu_advance_callbacks(rdp);
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -601,24 +1131,24 @@ void rcu_offline_cpu_rt(int cpu)
> {
> int i;
> struct rcu_head *list = NULL;
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp = RCU_DATA_CPU(cpu);
> struct rcu_head **tail = &list;
>
> /* Remove all callbacks from the newly dead CPU, retaining order. */
>
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> rcu_offline_cpu_rt_enqueue(rdp->donelist, rdp->donetail, list, tail);
> for (i = GP_STAGES - 1; i >= 0; i--)
> rcu_offline_cpu_rt_enqueue(rdp->waitlist[i], rdp->waittail[i],
> list, tail);
> rcu_offline_cpu_rt_enqueue(rdp->nextlist, rdp->nexttail, list, tail);
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> rdp->waitlistcount = 0;
>
> /* Disengage the newly dead CPU from grace-period computation. */
>
> - spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> + spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
> rcu_check_mb(cpu);
> if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) {
> smp_mb(); /* Subsequent counter accesses must see new value */
> @@ -627,7 +1157,7 @@ void rcu_offline_cpu_rt(int cpu)
> /* seen -after- acknowledgement. */
> }
> cpu_clear(cpu, rcu_cpu_online_map);
> - spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
>
> /*
> * Place the removed callbacks on the current CPU's queue.
> @@ -640,20 +1170,20 @@ void rcu_offline_cpu_rt(int cpu)
> */
>
> rdp = RCU_DATA_ME();
> - spin_lock_irqsave(&rdp->lock, oldirq);
> + spin_lock_irqsave(&rdp->lock, flags);
> *rdp->nexttail = list;
> if (list)
> rdp->nexttail = tail;
> - spin_unlock_irqrestore(&rdp->lock, oldirq);
> + spin_unlock_irqrestore(&rdp->lock, flags);
> }
>
> void __devinit rcu_online_cpu_rt(int cpu)
> {
> - unsigned long oldirq;
> + unsigned long flags;
>
> - spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq);
> + spin_lock_irqsave(&rcu_ctrlblk.fliplock, flags);
> cpu_set(cpu, rcu_cpu_online_map);
> - spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq);
> + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, flags);
> }
>
> #else /* #ifdef CONFIG_HOTPLUG_CPU */
> @@ -695,12 +1225,12 @@ void rcu_process_callbacks_rt(struct sof
> void fastcall call_rcu_preempt(struct rcu_head *head,
> void (*func)(struct rcu_head *rcu))
> {
> - unsigned long oldirq;
> + unsigned long flags;
> struct rcu_data *rdp;
>
> head->func = func;
> head->next = NULL;
> - local_irq_save(oldirq);
> + local_irq_save(flags);
> rdp = RCU_DATA_ME();
> spin_lock(&rdp->lock);
> __rcu_advance_callbacks(rdp);
> @@ -708,7 +1238,7 @@ void fastcall call_rcu_preempt(struct rc
> rdp->nexttail = &head->next;
> RCU_TRACE_RDP(rcupreempt_trace_next_add, rdp);
> spin_unlock(&rdp->lock);
> - local_irq_restore(oldirq);
> + local_irq_restore(flags);
> }
> EXPORT_SYMBOL_GPL(call_rcu_preempt);
>
> @@ -757,6 +1287,11 @@ int rcu_pending_rt(int cpu)
> return 0;
> }
>
> +/*
> + * Initialize RCU. This is called very early in boot, so is restricted
> + * to very simple operations. Don't even think about messing with anything
> + * that involves the scheduler, as it doesn't exist yet.
> + */
> void __init rcu_init_rt(void)
> {
> int cpu;
> @@ -778,6 +1313,7 @@ void __init rcu_init_rt(void)
> rdp->donelist = NULL;
> rdp->donetail = &rdp->donelist;
> }
> + init_rcu_boost_early();
> }
>
> /*
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/rtmutex.c linux-2.6.22-F-boostrcu/kernel/rtmutex.c
> --- linux-2.6.22-E-hotplug/kernel/rtmutex.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/rtmutex.c 2007-08-24 11:24:59.000000000 -0700
> @@ -111,11 +111,12 @@ static inline void mark_rt_mutex_waiters
> */
> int rt_mutex_getprio(struct task_struct *task)
> {
> + int prio = min(task->normal_prio, get_rcu_prio(task));
> +
> if (likely(!task_has_pi_waiters(task)))
> - return task->normal_prio;
> + return prio;
>
> - return min(task_top_pi_waiter(task)->pi_list_entry.prio,
> - task->normal_prio);
> + return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio);
> }
>
> /*
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/kernel/sched.c linux-2.6.22-F-boostrcu/kernel/sched.c
> --- linux-2.6.22-E-hotplug/kernel/sched.c 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/kernel/sched.c 2007-08-24 11:24:59.000000000 -0700
> @@ -1702,6 +1702,7 @@ void fastcall sched_fork(struct task_str
> * Make sure we do not leak PI boosting priority to the child:
> */
> p->prio = current->normal_prio;
> + set_rcu_prio(p, MAX_PRIO);
>
> INIT_LIST_HEAD(&p->run_list);
> p->array = NULL;
> @@ -1784,6 +1785,7 @@ void fastcall wake_up_new_task(struct ta
> else {
> p->prio = current->prio;
> p->normal_prio = current->normal_prio;
> + set_rcu_prio(p, MAX_PRIO);
> list_add_tail(&p->run_list, &current->run_list);
> p->array = current->array;
> p->array->nr_active++;
> @@ -3590,6 +3592,8 @@ asmlinkage void __sched schedule(void)
> }
> profile_hit(SCHED_PROFILING, __builtin_return_address(0));
>
> + rcu_preempt_boost();
> +
> need_resched:
> preempt_disable();
> prev = current;
> @@ -5060,6 +5064,7 @@ void __cpuinit init_idle(struct task_str
> idle->sleep_avg = 0;
> idle->array = NULL;
> idle->prio = idle->normal_prio = MAX_PRIO;
> + set_rcu_prio(idle, MAX_PRIO);
> idle->state = TASK_RUNNING;
> idle->cpus_allowed = cpumask_of_cpu(cpu);
> set_task_cpu(idle, cpu);
> diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/lib/Kconfig.debug linux-2.6.22-F-boostrcu/lib/Kconfig.debug
> --- linux-2.6.22-E-hotplug/lib/Kconfig.debug 2007-07-08 16:32:17.000000000 -0700
> +++ linux-2.6.22-F-boostrcu/lib/Kconfig.debug 2007-08-24 11:24:59.000000000 -0700
> @@ -391,6 +391,40 @@ config RCU_TORTURE_TEST
> Say M if you want the RCU torture tests to build as a module.
> Say N if you are unsure.
>
> +config RCU_TRACE
> + bool "Enable tracing for RCU - currently stats in debugfs"
> + select DEBUG_FS
> + depends on DEBUG_KERNEL
> + default y
> + help
> + This option provides tracing in RCU which presents stats
> + in debugfs for debugging RCU implementation.
> +
> + Say Y here if you want to enable RCU tracing
> + Say N if you are unsure.
> +
> +config PREEMPT_RCU_BOOST_STATS
> + bool "Enable RCU priority-boosting statistic printing"
> + depends on PREEMPT_RCU_BOOST
> + depends on DEBUG_KERNEL
> + default n
> + help
> + This option enables debug printk()s of RCU boost statistics,
> + which are normally only used to debug RCU priority boost
> + implementations.
> +
> + Say N if you are unsure.
> +
> +config PREEMPT_RCU_BOOST_STATS_INTERVAL
> + int "RCU priority-boosting statistic printing interval (seconds)"
> + depends on PREEMPT_RCU_BOOST_STATS
> + default 100
> + range 10 86400
> + help
> + This option controls the timing of debug printk()s of RCU boost
> + statistics, which are normally only used to debug RCU priority
> + boost implementations.
> +
> config LKDTM
> tristate "Linux Kernel Dump Test Tool Module"
> depends on DEBUG_KERNEL

--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-05 12:24:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU




On Fri, 5 Oct 2007, Gautham R Shenoy wrote:

> On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:
>
> [snip]
>
> > +
> > +/*
> > + * Return the list from which to boost target tasks.
> > + * May only be invoked by the booster task, so guaranteed to
> > + * already be initialized. Use rcu_boost_dat element least recently
> > + * the destination for task blocking in RCU read-side critical sections.
> > + */
> > +static inline struct rcu_boost_dat *rcu_rbd_boosting(int cpu)
> > +{
> > + int idx = (rcu_boost_idx + 1) & (RCU_BOOST_ELEMENTS - 1);
>
> Why is this masking required? When we increment
> the rcu_boost_idx in rcu_booster, we do perform a modulo operation
> to ensure that it wraps around RCU_BOOST_ELEMENTS.

Because we are not masking rcu_boost_idx, we are masking
(rcu_boost_idx + 1) which may extend the bounderies of
RCU_BOOST_ELEMENTS.

-- Steve

Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU

On Fri, Oct 05, 2007 at 08:24:21AM -0400, Steven Rostedt wrote:
>
>
>
> On Fri, 5 Oct 2007, Gautham R Shenoy wrote:
>
> > On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:
> >
> > [snip]
> >
> > > +
> > > +/*
> > > + * Return the list from which to boost target tasks.
> > > + * May only be invoked by the booster task, so guaranteed to
> > > + * already be initialized. Use rcu_boost_dat element least recently
> > > + * the destination for task blocking in RCU read-side critical sections.
> > > + */
> > > +static inline struct rcu_boost_dat *rcu_rbd_boosting(int cpu)
> > > +{
> > > + int idx = (rcu_boost_idx + 1) & (RCU_BOOST_ELEMENTS - 1);
> >
> > Why is this masking required? When we increment
> > the rcu_boost_idx in rcu_booster, we do perform a modulo operation
> > to ensure that it wraps around RCU_BOOST_ELEMENTS.
>
> Because we are not masking rcu_boost_idx, we are masking
> (rcu_boost_idx + 1) which may extend the bounderies of
> RCU_BOOST_ELEMENTS.

Thanks!

But I'm still trying to understand why the (increment + masking)
is required at all.

The thread(producer) that requires boosting is added to the element
with index rcu_boost_idx.

The booster thread(consumer) increments the rcu_boost_idx to
(rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS, before it fetches the least
recently used rcu_boot_dat elements and boost the eligible tasks queued
in that element.

So, can't we just return per_cpu(rcu_boost_dat, cpu)[rcu_boost_idx] from
rcu_rbd_boosting(cpu) ? Isn't that already the least recently used
element?

>
> -- Steve

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-10-05 14:07:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU

On Fri, Oct 05, 2007 at 06:51:14PM +0530, Gautham R Shenoy wrote:
> On Fri, Oct 05, 2007 at 08:24:21AM -0400, Steven Rostedt wrote:
> > On Fri, 5 Oct 2007, Gautham R Shenoy wrote:
> > > On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:
> > >
> > > [snip]
> > >
> > > > +
> > > > +/*
> > > > + * Return the list from which to boost target tasks.
> > > > + * May only be invoked by the booster task, so guaranteed to
> > > > + * already be initialized. Use rcu_boost_dat element least recently
> > > > + * the destination for task blocking in RCU read-side critical sections.
> > > > + */
> > > > +static inline struct rcu_boost_dat *rcu_rbd_boosting(int cpu)
> > > > +{
> > > > + int idx = (rcu_boost_idx + 1) & (RCU_BOOST_ELEMENTS - 1);
> > >
> > > Why is this masking required? When we increment
> > > the rcu_boost_idx in rcu_booster, we do perform a modulo operation
> > > to ensure that it wraps around RCU_BOOST_ELEMENTS.
> >
> > Because we are not masking rcu_boost_idx, we are masking
> > (rcu_boost_idx + 1) which may extend the bounderies of
> > RCU_BOOST_ELEMENTS.
>
> Thanks!
>
> But I'm still trying to understand why the (increment + masking)
> is required at all.
>
> The thread(producer) that requires boosting is added to the element
> with index rcu_boost_idx.
>
> The booster thread(consumer) increments the rcu_boost_idx to
> (rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS, before it fetches the least
> recently used rcu_boot_dat elements and boost the eligible tasks queued
> in that element.
>
> So, can't we just return per_cpu(rcu_boost_dat, cpu)[rcu_boost_idx] from
> rcu_rbd_boosting(cpu) ? Isn't that already the least recently used
> element?

Good catch -- we need to advance the index -after- boosting, so that
new sleeping tasks are not immediately dropped on the to-be-boosted
list. Will fix!

(Non-fatal -- but means that the algorithm is effectively only using
three elements of the four-element array, so does need to be fixed.)

Thanx, Paul