Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932330AbZKSB56 (ORCPT ); Wed, 18 Nov 2009 20:57:58 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932222AbZKSB55 (ORCPT ); Wed, 18 Nov 2009 20:57:57 -0500 Received: from tomts16-srv.bellnexxia.net ([209.226.175.4]:59477 "EHLO tomts16-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932202AbZKSB54 (ORCPT ); Wed, 18 Nov 2009 20:57:56 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApsEANQvBEtGGN1W/2dsb2JhbACBTdcXgj6BfQSBbw Date: Wed, 18 Nov 2009 20:57:56 -0500 From: Mathieu Desnoyers To: "Paul E. McKenney" Cc: Jason Baron , mingo@elte.hu, mhiramat@redhat.com, linux-kernel@vger.kernel.org, hpa@zytor.com, tglx@linutronix.de, rostedt@goodmis.org, andi@firstfloor.org, roland@redhat.com, rth@redhat.com Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine Message-ID: <20091119015756.GA10668@Krystal> References: <37e397b27509c378f93b9a30f1158791d1b99be7.1258580048.git.jbaron@redhat.com> <20091119002826.GB28962@Krystal> <20091119005829.GF6683@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: <20091119005829.GF6683@linux.vnet.ibm.com> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.27.31-grsec (i686) X-Uptime: 20:54:49 up 92 days, 12:44, 4 users, load average: 0.20, 0.18, 0.20 User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10975 Lines: 297 * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote: > > * Jason Baron (jbaron@redhat.com) wrote: > > > Add text_poke_fixup() which takes a fixup address to where a processor > > > jumps if it hits the modifying address while code modifying. > > > text_poke_fixup() does following steps for this purpose. > > > > > > 1. Setup int3 handler for fixup. > > > 2. Put a breakpoint (int3) on the first byte of modifying region, > > > and synchronize code on all CPUs. > > > 3. Modify other bytes of modifying region, and synchronize code on all CPUs. > > > 4. Modify the first byte of modifying region, and synchronize code > > > on all CPUs. > > > 5. Clear int3 handler. > > > > > > > Hi Masami, > > > > I like the approach and the API is clean. I have intersped comments > > below. > > > > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I > > experienced recently in the message below. Might be worth having a look, > > I suspect this might have caused the hangs Paul McKenney had with his > > past TREE RCU callback migration. I think he did take a mutex in the cpu > > hotplug callbacks and might have used IPIs within that same lock. > > Hello, Mathieu, > > By "mutex", do you mean "mutex_lock()"? If so, I don't do that from > within the CPU hotplug notifiers. I do spin_lock_irqsave(). > > People have been known to invoke synchronize_rcu() from CPU hotplug > notifiers, however -- and this does block. Is that what you are > getting at? Hi Paul, What I had in mind is more like a N-way deadlock involving the cpu hotplug mutex, on_each_cpu, and possibly stop_machine interaction. However I did not push the lockup analysis far enough to know for sure, but I feel like it's good to let others know. I am not sure if blocking primitives could be affected by this. > > I do invoke smp_send_reschedule() with irqs disabled, which did arouse my > suspicions of late. But this seems to bypass the smp_call_function() > code that is most definitely illegal to invoke with irqs disabled, > so no smoking gun. All that aside, if invoking smp_send_reschedule() > with irqs disabled is in any way a bad idea, please let me know so I > can rearrange the code appropriately. > > RCU is currently running reasonably well with the set of patches I have > submitted. It the kernel is now withstanding a level of punishment that > would have reduced 2.6.28 (for example) to a steaming pile of bits, with > or without the help of rcutorture. And I am now hitting the occasional > non-RCU bug, so I am starting to feel like RCU is approaching stability. > Approaching, but not there yet -- a few suspicious areas remain. Time > to crank the testing up another notch or two. ;-) Nice :) I've been going through the same "stabilization" phase in the past weeks for LTTng. It's good to see things stabilize even under heavy cpu hotplug stress-tests. Thanks, Mathieu > > Thanx, Paul > > > > Thus, if some other processor execute modifying address when step2 to step4, > > > it will be jumped to fixup code. > > > > > > This still has many limitations for modifying multi-instructions at once. > > > However, it is enough for 'a 5 bytes nop replacing with a jump' patching, > > > because; > > > - Replaced instruction is just one instruction, which is executed atomically. > > > - Replacing instruction is a jump, so we can set fixup address where the jump > > > goes to. > > > > > > Signed-off-by: Masami Hiramatsu > > > --- > > > arch/x86/include/asm/alternative.h | 12 ++++ > > > arch/x86/include/asm/kprobes.h | 1 + > > > arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++ > > > kernel/kprobes.c | 2 +- > > > 4 files changed, 134 insertions(+), 1 deletions(-) > > > > > > > [snip snip] > > > > > index de7353c..af47f12 100644 > > > --- a/arch/x86/kernel/alternative.c > > > +++ b/arch/x86/kernel/alternative.c > > > @@ -4,6 +4,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len) > > > local_irq_restore(flags); > > > return addr; > > > } > > > + > > > +/* > > > + * On pentium series, Unsynchronized cross-modifying code > > > + * operations can cause unexpected instruction execution results. > > > + * So after code modified, we should synchronize it on each processor. > > > + */ > > > +static void __kprobes __local_sync_core(void *info) > > > +{ > > > + sync_core(); > > > +} > > > + > > > +void __kprobes sync_core_all(void) > > > +{ > > > + on_each_cpu(__local_sync_core, NULL, 1); > > > > OK, so you rely on the fact that on_each_cpu has memory barriers and > > executes the remote "__local_sync_core" with the appropriate memory > > barriers underneath, am I correct ? > > > > > +} > > > + > > > +/* Safely cross-code modifying with fixup address */ > > > +static void *patch_fixup_from; > > > +static void *patch_fixup_addr; > > > + > > > +static int __kprobes patch_exceptions_notify(struct notifier_block *self, > > > + unsigned long val, void *data) > > > +{ > > > + struct die_args *args = data; > > > + struct pt_regs *regs = args->regs; > > > + > > > + if (likely(!patch_fixup_from)) > > > + return NOTIFY_DONE; > > > + > > > + if (val != DIE_INT3 || !regs || user_mode_vm(regs) || > > > + (unsigned long)patch_fixup_from != regs->ip) > > > + return NOTIFY_DONE; > > > + > > > + args->regs->ip = (unsigned long)patch_fixup_addr; > > > + return NOTIFY_STOP; > > > +} > > > + > > > +/** > > > + * text_poke_fixup() -- cross-modifying kernel text with fixup address. > > > + * @addr: Modifying address. > > > + * @opcode: New instruction. > > > + * @len: length of modifying bytes. > > > + * @fixup: Fixup address. > > > + * > > > + * Note: You must backup replaced instructions before calling this, > > > + * if you need to recover it. > > > + * Note: Must be called under text_mutex. > > > + */ > > > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len, > > > + void *fixup) > > > +{ > > > + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION; > > > + static const int int3_size = sizeof(int3_insn); > > > + > > > + /* Replacing 1 byte can be done atomically. */ > > > > I'm sure it can be done atomically, but can it be done safely though ? > > > > (disclaimer: we're still waiting for the official answer on this > > statement): Assuming instruction trace cache effects of SMP cross-code > > modification, and that only int3 would be safe to patch, then even > > changing 1 single byte could only be done by going to an intermediate > > int3 and synchronizing all other cores. > > > > > + if (unlikely(len <= 1)) > > > + return text_poke(addr, opcode, len); > > > + > > > + /* Preparing */ > > > + patch_fixup_addr = fixup; > > > + wmb(); > > > > hrm, missing comment ? > > > > > + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */ > > > + > > > + /* Cap by an int3 */ > > > + text_poke(addr, &int3_insn, int3_size); > > > + sync_core_all(); > > > + > > > + /* Replace tail bytes */ > > > + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size, > > > + len - int3_size); > > > + sync_core_all(); > > > + > > > + /* Replace int3 with head byte */ > > > + text_poke(addr, opcode, int3_size); > > > + sync_core_all(); > > > + > > > + /* Cleanup */ > > > + patch_fixup_from = NULL; > > > + wmb(); > > > > missing comment here too. > > > > > + return addr; > > > > Little quiz question: > > > > When patch_fixup_from is set to NULL, what ensures that the int3 > > handlers have completed their execution ? > > > > I think it's probably OK, because the int3 is an interrupt gate, which > > therefore disables interrupts as soon as it runs, and executes the > > notifier while irqs are off. When we run sync_core_all() after replacing > > the int3 by the new 1st byte, we only return when all other cores have > > executed an interrupt, which implies that all int3 handlers previously > > running should have ended. Is it right ? It looks to me as if this 3rd > > sync_core_all() is only needed because of that. Probably that adding a > > comment would be good. > > > > > > Another thing: I've recently noticed that the following locking seems to > > hang the system with doing stress-testing concurrently with cpu > > hotplug/hotunplug: > > > > mutex_lock(&text_mutex); > > on_each_cpu(something, NULL, 1); > > > > The hang seems to be caused by the fact that alternative.c has: > > > > within cpu hotplug (cpu hotplug lock held) > > mutex_lock(&text_mutex); > > > > It might also be caused by the interaction with the stop_machine() > > performed within the cpu hotplug lock. I did not find the root cause of > > the problem, but this probably calls for lockdep improvements. > > > > In any case, given you are dealing with the exact same locking scenario > > here, I would recomment the following stress-test: > > > > in a loop, use text_poke_jump() > > in a loop, hotunplug all cpus, then hotplug all cpus > > > > I had to fix this temporarily by taking get/put_online_cpus() about the > > text_mutex. > > > > [snip snip] > > > > > +static struct notifier_block patch_exceptions_nb = { > > > + .notifier_call = patch_exceptions_notify, > > > + .priority = 0x7fffffff /* we need to be notified first */ > > > +}; > > > + > > > +static int __init patch_init(void) > > > +{ > > > + return register_die_notifier(&patch_exceptions_nb); > > > +} > > > + > > > +arch_initcall(patch_init); > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > > index e5342a3..43a30d8 100644 > > > --- a/kernel/kprobes.c > > > +++ b/kernel/kprobes.c > > > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes); > > > > > > static struct notifier_block kprobe_exceptions_nb = { > > > .notifier_call = kprobe_exceptions_notify, > > > - .priority = 0x7fffffff /* we need to be notified first */ > > > + .priority = 0x7ffffff0 /* High priority, but not first. */ > > > > It would be good to keep all these priorities in a centralized header. > > > > Thanks, > > > > Mathieu > > > > > }; > > > > > > unsigned long __weak arch_deref_entry_point(void *entry) > > > -- > > > 1.6.5.1 > > > > > > > -- > > Mathieu Desnoyers > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/