Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756188AbYHVX37 (ORCPT ); Fri, 22 Aug 2008 19:29:59 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753563AbYHVX3t (ORCPT ); Fri, 22 Aug 2008 19:29:49 -0400 Received: from e3.ny.us.ibm.com ([32.97.182.143]:40779 "EHLO e3.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753190AbYHVX3s (ORCPT ); Fri, 22 Aug 2008 19:29:48 -0400 Subject: Re: [PATCH, RFC, tip/core/rcu] scalable classic RCU implementation From: Josh Triplett To: paulmck@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org, cl@linux-foundation.org, mingo@elte.hu, akpm@linux-foundation.org, manfred@colorfullife.com, dipankar@in.ibm.com, schamp@sgi.com, niv@us.ibm.com, dvhltc@us.ibm.com, ego@in.ibm.com, laijs@cn.fujitsu.com, rostedt@goodmis.org In-Reply-To: <20080821234318.GA1754@linux.vnet.ibm.com> References: <20080821234318.GA1754@linux.vnet.ibm.com> Content-Type: text/plain Date: Fri, 22 Aug 2008 16:29:32 -0700 Message-Id: <1219447772.5197.98.camel@josh-work.beaverton.ibm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.22.3.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 18903 Lines: 519 On Thu, 2008-08-21 at 16:43 -0700, Paul E. McKenney wrote: > Hello! > > Experimental, not for inclusion. > > Attached is a patch to Classic RCU that applies a hierarchy, greatly > reducing the contention on the top-level lock for large machines. > This passes mild rcutorture testing on x86 and ppc64, but is most > definitely not ready for inclusion. It is OK for experimental work > assuming sufficiently brave experimenters. See also Manfred Spraul's > patch at http://lkml.org/lkml/2008/8/21/336 (or his earlier work from > 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). > We will converge onto a common patch in the fullness of time, but > are currently exploring different regions of the design space. > > This patch provides CONFIG_RCU_FANOUT, which controls the bushiness > of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on > 64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT, > there is no hierarchy. By default, the RCU initialization code will > adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA > architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable > this balancing, allowing the hierarchy to be exactly aligned to the > underlying hardware. Up to two levels of hierarchy are permitted > (in addition to the root node), allowing up to 16,384 CPUs on 32-bit > systems and up to 262,144 CPUs on 64-bit systems. I just know that I > am going to regret saying this, but this seems more than sufficient > for the foreseeable future. (Some architectures might wish to set > CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs. > If this becomes a real problem, additional levels can be added, but I > doubt that it will make a significant difference on real hardware.) > > In the common case, a given CPU will manipulate its private rcu_data > structure and the rcu_node structure that it shares with its immediate > neighbors. This can reduce both lock and memory contention by multiple > orders of magnitude, which should eliminate the need for the strange > manipulations that are reported to be required when running Linux on > very large systems. > > Some shortcomings: > > o The interface to dynticks is clumsy at best. Improvements > on the way. > > o CPU onlining and offlining is probably broken. Will be tested. > > o The check-CPU-stalls code is busted. Will be fixed. > > o There are probably hangs, rcutorture failures, &c. > > o There is not yet a human-readable design document. Will be fixed. > > o The largest machine I can get my hands on at the moment only > has 8 CPUs, which really doesn't stress this algorithm much. > > If you want to use this against a Linus kernel, the following will work: > > Start with 2.6.27-rc3. > > Apply http://www.rdrop.com/users/paulmck/patches/paulmck-rcu.2008.08.20a.patch > which catches you up to a recent linux-2.6-tip tip/core/rcu commit. > > Apply http://www.rdrop.com/users/paulmck/patches/2.6.27-rc3-hierRCU-6.patch > which gets you the current hierarchical RCU implementation. > > Thoughts? Looks great. A few comments below. > Signed-off-by: Paul E. McKenney > --- > > include/linux/rcuclassic.h | 154 ++++-- > kernel/Kconfig.preempt | 31 + > kernel/Makefile | 3 > kernel/rcuclassic.c | 1095 +++++++++++++++++++++++++++++---------------- > kernel/rcuclassic_trace.c | 219 +++++++++ > 5 files changed, 1076 insertions(+), 426 deletions(-) > > diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h > index 1658995..97e646a 100644 > --- a/include/linux/rcuclassic.h > +++ b/include/linux/rcuclassic.h > @@ -18,6 +18,7 @@ > * Copyright IBM Corporation, 2001 ", 2008", here and elsewhere? > * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. > @@ -26,8 +27,10 @@ > * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) > * > * For detailed explanation of Read-Copy Update mechanism see - > - * Documentation/RCU > - * > + * Documentation/RCU > + * http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?) > + * http://lwn.net/Articles/263130/ (What is RCU's Usage?) > + * http://lwn.net/Articles/264090/ (What is RCU's API? + references) > */ Why put these references here rather than in Documentation/RCU? It seems easier to keep documentation up to date in one place. If you think these represent a good "getting started" set of documents, how about a Documentation/RCU/ReadTheseFirst with links to them, or how about linking to them from whatisRCU.txt? > #ifndef __LINUX_RCUCLASSIC_H > @@ -40,69 +43,136 @@ > #include > #include > > +/* > + * Define the shape of the rcu_node hierarchy based on NR_CPUS and > + * CONFIG_RCU_FANOUT. > + */ > > -/* Global control variables for rcupdate callback mechanism. */ > -struct rcu_ctrlblk { > - long cur; /* Current batch number. */ > - long completed; /* Number of the last completed batch */ > - long pending; /* Number of the last pending batch */ > -#ifdef CONFIG_DEBUG_RCU_STALL > - unsigned long gp_check; /* Time grace period should end, in seconds. */ > -#endif /* #ifdef CONFIG_DEBUG_RCU_STALL */ > - > - int signaled; > +#define MAX_RCU_LEVELS 3 > +#if NR_CPUS <= CONFIG_RCU_FANOUT > +#define NUM_RCU_LEVELS 1 > +#define NUM_RCU_LEVEL_1 1 > +#define NUM_RCU_LEVEL_2 NR_CPUS > +#define NUM_RCU_LEVEL_3 0 > +#define NUM_RCU_LEVEL_4 0 > +#define NUM_RCU_NODES NUM_RCU_LEVEL_1 > +#elif NR_CPUS <= CONFIG_RCU_FANOUT * CONFIG_RCU_FANOUT > +#define NUM_RCU_LEVELS 2 > +#define NUM_RCU_LEVEL_1 1 > +#define NUM_RCU_LEVEL_2 \ > + (((NR_CPUS) + (CONFIG_RCU_FANOUT) - 1) / (CONFIG_RCU_FANOUT)) > +#define NUM_RCU_LEVEL_3 NR_CPUS > +#define NUM_RCU_LEVEL_4 0 > +#define NUM_RCU_NODES \ > + ((NUM_RCU_LEVEL_1) + (NUM_RCU_LEVEL_2)) > +#elif NR_CPUS <= CONFIG_RCU_FANOUT * CONFIG_RCU_FANOUT * CONFIG_RCU_FANOUT > +#define NUM_RCU_LEVELS 3 > +#define RCU_FANOUT_SQ ((CONFIG_RCU_FANOUT) * (CONFIG_RCU_FANOUT)) > +#define NUM_RCU_LEVEL_1 1 > +#define NUM_RCU_LEVEL_2 \ > + (((NR_CPUS) + (RCU_FANOUT_SQ) - 1) / (RCU_FANOUT_SQ)) > +#define NUM_RCU_LEVEL_3 \ > + ((NR_CPUS) + (CONFIG_RCU_FANOUT) - 1) / (CONFIG_RCU_FANOUT) > +#define NUM_RCU_LEVEL_4 NR_CPUS > +#define NUM_RCU_NODES \ > + ((NUM_RCU_LEVEL_1) + \ > + (NUM_RCU_LEVEL_2) + \ > + (NUM_RCU_LEVEL_3)) > +#else > +#error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" > +#endif This should get replaced by the revised version you followed up with. > - spinlock_t lock ____cacheline_internodealigned_in_smp; > - cpumask_t cpumask; /* CPUs that need to switch in order */ > - /* for current batch to proceed. */ > +/* > + * Definition for node within the RCU grace-period-detection hierarchy. > + */ > +struct rcu_node { > + spinlock_t lock; > + unsigned long qsmask; /* CPUs or groups that need to switch in */ > + /* order for current grace period to proceed.*/ > + unsigned long qsmaskinit; > + /* Per-GP initialization for qsmask. */ > + int grplo; /* lowest-numbered CPU or group here. */ > + int grphi; /* highest-numbered CPU or group here. */ > + char grpnum; /* CPU/group number for next level up. */ > + char level; /* root is at level 0. */ These four fields should use sized types, and preferably unsigned types. > + struct rcu_node *parent; > } ____cacheline_internodealigned_in_smp; > > -/* Is batch a before batch b ? */ > -static inline int rcu_batch_before(long a, long b) > -{ > - return (a - b) < 0; > -} > +/* > + * RCU global state, including node hierarchy. This hierarchy is > + * represented in "heap" form in a dense array. The root (first level) > + * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second > + * level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), > + * and the third level in ->node[m+1] and following (->node[m+1] referenced > + * by ->level[2]). The number of levels is determined by the number of > + * CPUs and by CONFIG_RCU_FANOUT. Small systems will have a "hierarchy" > + * consisting of a single rcu_node. > + */ > +struct rcu_state { > + struct rcu_node node[NUM_RCU_NODES]; /* Hierarchy. */ > + struct rcu_node *level[NUM_RCU_LEVELS]; /* Hierarchy levels. */ > + int levelcnt[MAX_RCU_LEVELS + 1]; /* # nodes in each level. */ > + int levelspread[NUM_RCU_LEVELS]; /* kids/node in each level. */ These two should use sized types. > + > + /* The following fields are guarded by the root rcu_node's lock. */ > + > + char signaled ____cacheline_internodealigned_in_smp; > + /* sent GP-kick IPIs? */ u8 or bool, depending on semantics. If just a simple flag, how about bool? > + long gpnum; /* Current gp number. */ > + long completed; /* # of last completed gp. */ > + spinlock_t onofflock; /* exclude on/offline and */ > + /* starting new GP. */ > +}; > > -/* Is batch a after batch b ? */ > -static inline int rcu_batch_after(long a, long b) > -{ > - return (a - b) > 0; > -} > +#define RCU_DONE_TAIL 0 /* Also RCU_WAIT head. */ > +#define RCU_WAIT_TAIL 1 /* Also RCU_NEXT_READY head. */ > +#define RCU_NEXT_READY_TAIL 2 /* Also RCU_NEXT head. */ > +#define RCU_NEXT_TAIL 3 > +#define RCU_NEXT_SIZE 4 > > /* Per-CPU data for Read-Copy UPdate. */ > struct rcu_data { > - /* 1) quiescent state handling : */ > - long quiescbatch; /* Batch # for grace period */ > - int passed_quiesc; /* User-mode/idle loop etc. */ > - int qs_pending; /* core waits for quiesc state */ > + /* 1) quiescent-state and grace-period handling : */ > + long completed; /* Track rsp->completed gp number */ > + /* in order to detect GP end. */ > + long gpnum; /* Highest gp number that this CPU */ > + /* is aware of having started. */ > + int passed_quiesc; /* User-mode/idle loop etc. */ > + int qs_pending; /* Core waits for quiesc state. */ Looks like several whitespace changes occurred here; several of these lines didn't actually change except in whitespace. The same comment about sized types applies here, but these fields didn't actually change in this patch. > + struct rcu_node *mynode; /* This CPU's leaf of hierarchy */ > > /* 2) batch handling */ > /* > - * if nxtlist is not NULL, then: > - * batch: > - * The batch # for the last entry of nxtlist > - * [*nxttail[1], NULL = *nxttail[2]): > - * Entries that batch # <= batch > + * If nxtlist is not NULL, it is partitioned as follows. > + * Any of the partitions might be empty, in which case the > + * pointer to that partition will be equal to the pointer for > + * the following partition. When the list is empty, all of > + * the nxttail elements point to nxtlist, which is NULL. > + * > + * [*nxttail[2], NULL = *nxttail[3]): > + * Entries that might have arrived after current GP ended > + * [*nxttail[1], *nxttail[2]): > + * Entries known to have arrived before current GP ended > * [*nxttail[0], *nxttail[1]): > - * Entries that batch # <= batch - 1 > + * Entries that batch # <= ->completed - 1: waiting for current GP > * [nxtlist, *nxttail[0]): > - * Entries that batch # <= batch - 2 > + * Entries that batch # <= ->completed > * The grace period for these entries has completed, and > * the other grace-period-completed entries may be moved > * here temporarily in rcu_process_callbacks(). > */ > - long batch; > struct rcu_head *nxtlist; > - struct rcu_head **nxttail[3]; > - long qlen; /* # of queued callbacks */ > - struct rcu_head *donelist; > - struct rcu_head **donetail; > - long blimit; /* Upper limit on a processed batch */ > + struct rcu_head **nxttail[RCU_NEXT_SIZE]; > + long qlen; /* # of queued callbacks */ > + long blimit; /* Upper limit on a processed batch */ Some whitespace changes again here; several of these lines didn't change except in whitespace. > int cpu; > struct rcu_head barrier; > }; > > +extern struct rcu_state rcu_state; > DECLARE_PER_CPU(struct rcu_data, rcu_data); > + > +extern struct rcu_state rcu_bh_state; > DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); Why extern and in the header? I don't see anything else using them. > /* > diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt > index 9fdba03..43062bf 100644 > --- a/kernel/Kconfig.preempt > +++ b/kernel/Kconfig.preempt > @@ -68,7 +68,6 @@ config PREEMPT_RCU > > config RCU_TRACE > bool "Enable tracing for RCU - currently stats in debugfs" > - depends on PREEMPT_RCU Might want to document in the commit message that you have tracing information through RCU_TRACE, and that it applies to non-preemptible RCU as well now. > select DEBUG_FS > default y > help > @@ -77,3 +76,33 @@ config RCU_TRACE > > Say Y here if you want to enable RCU tracing > Say N if you are unsure. > + > +config RCU_FANOUT > + int "Hierarchical RCU fanout value" > + range 2 64 if 64BIT > + range 2 32 if !64BIT > + depends on CLASSIC_RCU > + default 64 if 64BIT > + default 32 if !64BIT > + help > + This option controls the fanout of hierarchical implementations > + of RCU, allowing RCU to work efficiently on machines with > + large numbers of CPUs. This value must be at least the cube > + root of NR_CPUS, which allows NR_CPUS up to 32,768 for 32-bit > + systems and up to 262,144 for 64-bit systems. > + > + Select a specific number if testing RCU itself. ...or if attempting to tune for a specific NUMA system. > + Take the default if unsure. > + > +config RCU_FANOUT_EXACT > + bool "Disable hierarchical RCU auto-balancing" > + depends on CLASSIC_RCU > + default n > + help > + This option forces use of the exact RCU_FANOUT value specified, > + regardless of imbalances in the hierarchy. This can be useful > + on systems with strong NUMA behavior. > + > + Without RCU_FANOUT_EXACT, the code will balance the hierarchy. You might want to give a specific example of a NUMA machine, the appropriate value to use on that machine, and the result with and without RCU_FANOUT_EXACT. > + Say n if unsure. > diff --git a/kernel/Makefile b/kernel/Makefile > index 4e1d7df..d838fbd 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -75,6 +75,9 @@ obj-$(CONFIG_SECCOMP) += seccomp.o > obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o > obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o > obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o > +ifeq ($(CONFIG_CLASSIC_RCU),y) > +obj-$(CONFIG_RCU_TRACE) += rcuclassic_trace.o > +endif > ifeq ($(CONFIG_PREEMPT_RCU),y) > obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o > endif It might actually make sense here to do this instead: ifeq ($(CONFIG_RCU_TRACE),y) obj-$(CONFIG_CLASSIC_RCU) += rcuclassic_trace.o obj-$(CONFIG_PREEMPT_RCU) += rcupreempt_trace.o endif > diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c > index 01e761a..5584b22 100644 > --- a/kernel/rcuclassic.c > +++ b/kernel/rcuclassic.c > @@ -27,7 +28,10 @@ > * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) > * > * For detailed explanation of Read-Copy Update mechanism see - > - * Documentation/RCU > + * Documentation/RCU > + * http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?) > + * http://lwn.net/Articles/263130/ (What is RCU's Usage?) > + * http://lwn.net/Articles/264090/ (What is RCU's API? + references) Same comment as before; maintaining these in a single place seems easier. > +struct rcu_state rcu_state = RCU_STATE_INITIALIZER(rcu_state); > DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; > + > +struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh_state); > DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L }; How about making these state structures static, along with removing the extern in the header? > -static int blimit = 10; > -static int qhimark = 10000; > -static int qlowmark = 100; > +static int blimit = 10; /* Maximum callbacks per softirq. */ > +static int qhimark = 10000; /* If this many pending, ignore blimit. */ > +static int qlowmark = 100; /* Once only this many pending, use blimit. */ Indentation mismatch on the comments? > #ifdef CONFIG_SMP > -static void force_quiescent_state(struct rcu_data *rdp, > - struct rcu_ctrlblk *rcp) > +static void force_quiescent_state(struct rcu_state *rsp) > { > int cpu; > - cpumask_t cpumask; > unsigned long flags; > > set_need_resched(); > - spin_lock_irqsave(&rcp->lock, flags); > - if (unlikely(!rcp->signaled)) { > - rcp->signaled = 1; > + if (!spin_trylock_irqsave(&rsp->onofflock, flags)) > + return; This seems to make force_quiescent_state rather less forceful. > +/* > + * Does the current CPU require a yet-as-unscheduled grace period? > + */ > +static inline int > +cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) > +{ > + return *rdp->nxttail[RCU_DONE_TAIL] && > + ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum); > +} ACCESS_ONCE, like memory barriers, benefits from an accompanying explanation. > -#else > +#else /* #ifdef CONFIG_HOTPLUG_CPU */ > > -static void rcu_offline_cpu(int cpu) > +static inline void > +rcu_offline_cpu(int cpu) > { > } No need to explicitly say "inline"; GCC should do the right thing here. Same comment applies a couple of other places in your patch. > @@ -658,14 +806,19 @@ int rcu_needs_cpu(int cpu) > struct rcu_data *rdp = &per_cpu(rcu_data, cpu); > struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu); > > - return !!rdp->nxtlist || !!rdp_bh->nxtlist || rcu_pending(cpu); > + return !!*rdp->nxttail[RCU_DONE_TAIL] || > + !!*rdp_bh->nxttail[RCU_DONE_TAIL] || > + rcu_pending(cpu); !! seems unnecessary here. > +void call_rcu_bh(struct rcu_head *head, > + void (*func)(struct rcu_head *rcu)) > +{ > + unsigned long flags; > + > + head->func = func; > + head->next = NULL; > + local_irq_save(flags); > + __call_rcu(head, &rcu_bh_state, &__get_cpu_var(rcu_bh_data)); > + local_irq_restore(flags); > +} > +EXPORT_SYMBOL_GPL(call_rcu_bh); This comment applies to the original code, but: You only call __call_rcu twice, in call_rcu and call_rcu_bh. Both times, you set head first, then wrap the call with local_irq_save. How about moving both into __call_rcu, making call_rcu and call_rcu_bh one-liners? > --- /dev/null > +++ b/kernel/rcuclassic_trace.c > +static struct mutex rcuclassic_trace_mutex; static DEFINE_MUTEX(rcuclassic_trace_mutex); Then you don't need mutex_init later in your init function. > +static char *rcuclassic_trace_buf; > +#define RCUPREEMPT_TRACE_BUF_SIZE 4096 Did you perhaps want PAGE_SIZE? - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/