Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754424AbbBZTGk (ORCPT ); Thu, 26 Feb 2015 14:06:40 -0500 Received: from mail.efficios.com ([78.47.125.74]:58575 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753933AbbBZTGi (ORCPT ); Thu, 26 Feb 2015 14:06:38 -0500 Date: Thu, 26 Feb 2015 19:06:33 +0000 (UTC) From: Mathieu Desnoyers To: paulmck@linux.vnet.ibm.com Cc: Peter Zijlstra , Andi Kleen , Andi Kleen , x86@kernel.org, linux-kernel@vger.kernel.org, oleg@redhat.com, rusty@rustcorp.com.au, mingo@kernel.org Message-ID: <996087648.185312.1424977593755.JavaMail.zimbra@efficios.com> In-Reply-To: <20150226182817.GY15405@linux.vnet.ibm.com> References: <1424482737-958-1-git-send-email-andi@firstfloor.org> <20150223170436.GC5029@twins.programming.kicks-ass.net> <20150223174340.GD27767@tassilo.jf.intel.com> <20150226114309.GR21418@twins.programming.kicks-ass.net> <2127583772.183982.1424966563927.JavaMail.zimbra@efficios.com> <20150226164356.GU21418@twins.programming.kicks-ass.net> <20150226182817.GY15405@linux.vnet.ibm.com> Subject: Re: [RFC][PATCH] module: Optimize __module_address() using a latched RB-tree MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [173.246.22.116] X-Mailer: Zimbra 8.0.7_GA_6021 (ZimbraWebClient - FF36 (Linux)/8.0.7_GA_6021) Thread-Topic: module: Optimize __module_address() using a latched RB-tree Thread-Index: tnI2ocZ6AryDD42FmPI8hxwZlk0zmA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 15338 Lines: 474 ----- Original Message ----- > From: "Paul E. McKenney" > To: "Peter Zijlstra" > Cc: "Mathieu Desnoyers" , "Andi Kleen" , "Andi Kleen" > , x86@kernel.org, linux-kernel@vger.kernel.org, oleg@redhat.com, rusty@rustcorp.com.au, > mingo@kernel.org > Sent: Thursday, February 26, 2015 1:28:17 PM > Subject: Re: [RFC][PATCH] module: Optimize __module_address() using a latched RB-tree > > On Thu, Feb 26, 2015 at 05:43:56PM +0100, Peter Zijlstra wrote: > > On Thu, Feb 26, 2015 at 04:02:43PM +0000, Mathieu Desnoyers wrote: > > > Perhaps you could use mod_value() below, and introduce a > > > "mod_size()" too. This would keep the init vs core selection > > > out of the traversal code. > > > > Indeed! > > > > > Is it customary to define static variables in the > > > middle of a C file ? > > > > Dunno, it seemed like a good a place as any. > > > > > The rest looks good, especially for use of the latch. > > > I'd be tempted to turn "0, 1, 2, 3" into an enum though, > > > so we can follow in the code what each of those array > > > entry really means. > > > > Agreed. > > > > --- > > Subject: module: Optimize __module_address() using a latched RB-tree > > From: Peter Zijlstra > > Date: Thu Feb 26 10:57:34 CET 2015 > > > > Currently __module_address() is using a linear search through all > > modules in order to find the module corresponding to the provided > > address. With a lot of modules this can take a lot of time. > > > > One of the users of this is __kernel_text_address() which is employed > > in many stack unwinders; which in turn are used by perf-callchain and > > ftrace (possibly from NMI context). > > > > So by optimizing __module_address() we optimize many stack unwinders > > which are used by both perf and tracing in performance sensitive code. > > > > Cc: Oleg Nesterov > > Cc: "Paul E. McKenney" > > Cc: Rusty Russell > > Cc: Steven Rostedt > > Cc: Mathieu Desnoyers > > Signed-off-by: Peter Zijlstra (Intel) > > Color me confused, both by the existing code and the modifications. > > It appears that you are using seqlock to force readers to retry when > a concurrent update occurs, but I don't see what is ensuring that the > readers see good data when racing with an insertion or a deletion. Yes, > the reader will be retried, courtesy of the seqlock, but that won't help > if the reader segfaults before it gets to the read_seqcount_retry(). Good point! This scheme works when readers have no "side effect" due to reading a data structure while it is changing. Dereferencing a pointer is indeed a side-effect which could lead to a segfault. > > Questions below. > > > --- > > include/linux/module.h | 19 +++- > > kernel/module.c | 212 > > +++++++++++++++++++++++++++++++++++++++++++++++-- > > 2 files changed, 224 insertions(+), 7 deletions(-) > > > > --- a/include/linux/module.h > > +++ b/include/linux/module.h > > @@ -17,6 +17,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -210,6 +211,11 @@ enum module_state { > > MODULE_STATE_UNFORMED, /* Still setting it up. */ > > }; > > > > +struct module_node { > > + struct module *mod; > > + struct rb_node node; > > +}; > > + > > struct module { > > enum module_state state; > > > > @@ -269,8 +275,15 @@ struct module { > > /* Startup function. */ > > int (*init)(void); > > > > - /* If this is non-NULL, vfree after init() returns */ > > - void *module_init; > > + /* > > + * If this is non-NULL, vfree after init() returns > > + * > > + * cacheline align here, such that: > > + * module_init, module_core, init_size, core_size and > > + * tree_node[0] > > + * are on the same cacheline. > > + */ > > + void *module_init ____cacheline_aligned; > > > > /* Here is the actual code + data, vfree'd on unload. */ > > void *module_core; > > @@ -281,6 +294,8 @@ struct module { > > /* The size of the executable code in each section. */ > > unsigned int init_text_size, core_text_size; > > > > + struct module_node tree_node[4]; > > + > > /* Size of RO sections of the module (text+rodata) */ > > unsigned int init_ro_size, core_ro_size; > > > > --- a/kernel/module.c > > +++ b/kernel/module.c > > @@ -102,6 +102,204 @@ > > DEFINE_MUTEX(module_mutex); > > EXPORT_SYMBOL_GPL(module_mutex); > > static LIST_HEAD(modules); > > + > > + > > +/* > > + * Use a latched RB-tree for __module_address() > > + * > > + * The latch concept is a multi-value concurrency data structure which > > allows > > + * unserialized access and guarantees at least one version is stable. > > + * > > + * It is employed here to optimize __module_address(), which needs to find > > the > > + * module (if any) associated with an address. Such questions are best > > answered > > + * using a binary search tree. > > + * > > + * Since modules use non-overlapping memory ranges we can use a regular > > RB-tree > > + * (as opposed to the interval tree). However, BSTs cannot be iterated > > while > > + * modified. > > + * > > + * To overcome this we use the latched RB-tree, it basically consists of > > two > > + * RB-trees which are modified in order, ensuring one is always > > consistent. > > + * > > + * Things are somewhat more complicated because there are two ranges per > > + * module, but other than that its pretty straight forward. > > + * > > + */ > > + > > +enum { > > + latch0_core = 0, > > + latch1_core = 1, > > + latch0_init = 2, > > + latch1_init = 3, > > +}; > > + > > +enum { > > + latch_bit = 0x01, > > + init_bit = 0x02, > > +}; > > + > > +struct latch_tree_root { > > + seqcount_t seq; > > + struct rb_root tree[2]; > > +}; > > + > > +static unsigned long mod_value(struct module *mod, int idx) > > +{ > > + if (idx & init_bit) > > + return (unsigned long)mod->module_init; > > + else > > + return (unsigned long)mod->module_core; > > +} > > + > > +static unsigned long mod_size(struct module *mod, int idx) > > +{ > > + if (idx & init_bit) > > + return mod->init_size; > > + else > > + return mod->core_size; > > +} > > + > > +static struct module *mod_entry(struct rb_node *n) > > +{ > > + struct module_node *mn = container_of(n, struct module_node, node); > > + return mn->mod; > > +} > > + > > +static int mod_node_idx(struct module *m, struct rb_node *n) > > +{ > > + struct module_node *mn = container_of(n, struct module_node, node); > > + int idx = mn - m->tree_node; > > + > > + BUG_ON(mn->mod != m); > > + BUG_ON((unsigned)idx > 3); > > + > > + return idx; > > +} > > + > > +static void __tree_insert(struct latch_tree_root *mod_tree, struct module > > *mod, int idx) > > +{ > > + struct rb_root *root = &mod_tree->tree[idx & latch_bit]; > > + struct module_node *mn = &mod->tree_node[idx]; > > + struct rb_node **link = &root->rb_node; > > + struct rb_node *parent = NULL; > > + unsigned long mod_val, m_val; > > + struct module *m; > > + int i; > > + > > + mn->mod = mod; > > + mod_val = mod_value(mod, idx); > > + > > + while (*link) { > > + parent = *link; > > + m = mod_entry(parent); > > + i = mod_node_idx(m, parent); > > + m_val = mod_value(m, i); > > + > > + if (mod_val < m_val) > > + link = &parent->rb_left; > > + else > > + link = &parent->rb_right; > > + } > > + > > + rb_link_node(&mn->node, parent, link); > > This makes the new module visible to readers, if I understand correctly. > There needs to be a memory barrier between initialization and this call > to rb_link_node(), otherwise, both the CPU (well, for Alpha) and the > compiler (for everyone) can reorder, which could result in some hapless > reader seeing pre-initialized data. > > Or did I miss the memory barrier? Given that readers (which will retry eventually) can indeed try to concurrently access the object, those would be needed. > > > + rb_insert_color(&mn->node, root); > > This -might- be OK -- the rotations do not make new nodes visible, > instead simply rearranging nodes that are already visible. > > For both rb_link_node() and rb_insert_color(), given that we are updating > pointers while readers are traversing them, READ_ONCE() and WRITE_ONCE() > would be good. Yes, the compiler -probably- doesn't mess you up, but it > would be completely within its rights to do so. :-/ > > Alpha would like either rcu_dereference() or lockless_dereference() > instead of READ_ONCE(), of course! Yes, probably better to err on the safe side here. We have to take into account that a reader can observe the data structure while being modified, but if this occurs, it will retry (but should not segfault nor loop endlessly). > > > +} > > + > > +static void __tree_remove(struct latch_tree_root *mod_tree, struct module > > *mod, int idx) > > +{ > > + struct rb_root *root = &mod_tree->tree[idx & latch_bit]; > > + struct module_node *mn = &mod->tree_node[idx]; > > + > > + rb_erase(&mn->node, root); > > This is just removing and not freeing, so OK from a read-side viewpoint. > WRITE_ONCE() would be good. I'd be worried about the eventual free after this function. There is a need to use RCU here to delay reclaim. > > > +} > > + > > +/* > > + * struct module is arranged such that: > > + * > > + * module_init, module_core, init_size, core_size, > > + * init_text_size, core_text_size and tree_node[0] > > + * > > + * are on the same cacheline, therefore if the below iteration is > > + * on latch0 and all module init has completed, we'll only hit > > + * tree_node[0] and every intermediate level will hit only a single > > + * cacheline. > > + * > > + * Furthermore, by ensuring init_text_size and core_text_size are > > + * also in this same cacheline we've made sure is_module_text_address() > > + * will also not require additional lines. > > + */ > > +static struct module *__tree_find(struct rb_root *r, unsigned long addr) > > +{ > > + struct rb_node *n = r->rb_node; > > + unsigned long m_val, m_size; > > + > > + while (n) { > > + struct module *m = mod_entry(n); > > + int idx = mod_node_idx(m, n); > > + > > + m_val = mod_value(m, idx); > > + m_size = mod_size(m, idx); > > + > > + if (addr < m_val) > > + n = n->rb_left; > > We need a rcu_dereference() or lockless_dereference() here, I think. > > > + else if (addr >= m_val + m_size) > > + n = n->rb_right; > > And here. > > > + else > > + return m; > > + } > > + > > + return NULL; > > I did finally find the RCU read-side critical section, supplied by > is_module_text_address()'s preempt_disable() and preempt_enable(). > The matching synchronize_sched() is supplied by do_init_module() and > load_module(). I expected a synchronize_sched() in free_module() as well, > but I only see a synchronize_rcu(). Is the required synchronize_sched() > on this code path hiding somewhere? > > Or did I miss an rcu_read_lock()/rcu_read_unlock() pair somewhere? I did not see any, nor any documentation that points to those. We might want to add documentation to raw_write_seqcount_latch() telling about the requirements about making sure concurrent readers don't trigger side-effects. It's not as simple as the timekeeper use-case, due to having pointers here. Good catch ! Thanks, Mathieu > > Thanx, Paul > > > +} > > + > > +static struct latch_tree_root mod_tree; > > + > > +static void mod_tree_insert(struct module *mod) > > +{ > > + raw_write_seqcount_latch(&mod_tree.seq); > > + __tree_insert(&mod_tree, mod, latch0_core); > > + if (mod->init_size) > > + __tree_insert(&mod_tree, mod, latch0_init); > > + raw_write_seqcount_latch(&mod_tree.seq); > > + __tree_insert(&mod_tree, mod, latch1_core); > > + if (mod->init_size) > > + __tree_insert(&mod_tree, mod, latch1_init); > > +} > > + > > +static void mod_tree_remove_init(struct module *mod) > > +{ > > + raw_write_seqcount_latch(&mod_tree.seq); > > + if (mod->init_size) > > + __tree_remove(&mod_tree, mod, latch0_init); > > + raw_write_seqcount_latch(&mod_tree.seq); > > + if (mod->init_size) > > + __tree_remove(&mod_tree, mod, latch1_init); > > +} > > + > > +static void mod_tree_remove(struct module *mod) > > +{ > > + raw_write_seqcount_latch(&mod_tree.seq); > > + __tree_remove(&mod_tree, mod, latch0_core); > > + if (mod->init_size) > > + __tree_remove(&mod_tree, mod, latch0_init); > > + raw_write_seqcount_latch(&mod_tree.seq); > > + __tree_remove(&mod_tree, mod, latch1_core); > > + if (mod->init_size) > > + __tree_remove(&mod_tree, mod, latch1_init); > > +} > > + > > +static struct module *mod_tree_find(unsigned long addr) > > +{ > > + struct module *m; > > + unsigned int seq; > > + > > + do { > > + seq = raw_read_seqcount(&mod_tree.seq); > > + m = __tree_find(&mod_tree.tree[seq & latch_bit], addr); > > + } while (read_seqcount_retry(&mod_tree.seq, seq)); > > + > > + return m; > > +} > > + > > #ifdef CONFIG_KGDB_KDB > > struct list_head *kdb_modules = &modules; /* kdb needs the list of modules > > */ > > #endif /* CONFIG_KGDB_KDB */ > > @@ -1854,6 +2052,7 @@ static void free_module(struct module *m > > mutex_lock(&module_mutex); > > /* Unlink carefully: kallsyms could be walking list. */ > > list_del_rcu(&mod->list); > > + mod_tree_remove(mod); > > /* Remove this module from bug list, this uses list_del_rcu */ > > module_bug_cleanup(mod); > > /* Wait for RCU synchronizing before releasing mod->list and buglist. */ > > @@ -3098,6 +3297,7 @@ static noinline int do_init_module(struc > > mod->symtab = mod->core_symtab; > > mod->strtab = mod->core_strtab; > > #endif > > + mod_tree_remove_init(mod); > > unset_module_init_ro_nx(mod); > > module_arch_freeing_init(mod); > > mod->module_init = NULL; > > @@ -3168,6 +3368,7 @@ static int add_unformed_module(struct mo > > goto out; > > } > > list_add_rcu(&mod->list, &modules); > > + mod_tree_insert(mod); > > err = 0; > > > > out: > > @@ -3367,6 +3568,7 @@ static int load_module(struct load_info > > mutex_lock(&module_mutex); > > /* Unlink carefully: kallsyms could be walking list. */ > > list_del_rcu(&mod->list); > > + mod_tree_remove(mod); > > wake_up_all(&module_wq); > > /* Wait for RCU synchronizing before releasing mod->list. */ > > synchronize_rcu(); > > @@ -3810,13 +4012,13 @@ struct module *__module_address(unsigned > > if (addr < module_addr_min || addr > module_addr_max) > > return NULL; > > > > - list_for_each_entry_rcu(mod, &modules, list) { > > + mod = mod_tree_find(addr); > > + if (mod) { > > + BUG_ON(!within_module(addr, mod)); > > if (mod->state == MODULE_STATE_UNFORMED) > > - continue; > > - if (within_module(addr, mod)) > > - return mod; > > + mod = NULL; > > } > > - return NULL; > > + return mod; > > } > > EXPORT_SYMBOL_GPL(__module_address); > > > > > > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/