Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp338904ybx; Fri, 1 Nov 2019 04:25:46 -0700 (PDT) X-Google-Smtp-Source: APXvYqxfeyLW6gcneL+SIPep45PFOyn8wW4uWQG5KdyX2jDBPe0v2GZnDMu/5B0hXCvOxeGB+Mw5 X-Received: by 2002:a17:906:27ce:: with SMTP id k14mr9109342ejc.72.1572607546757; Fri, 01 Nov 2019 04:25:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572607546; cv=none; d=google.com; s=arc-20160816; b=G+0WjFRnIIO9JrUR0pOrCibRhXF4GJoxQBsFjP8RkCrDWnDoUzIEHewhS9688NmeIT 3HdxtfJ4LaNhOmvYrP7Cw9l+UeSMpp3DYy8KETPhZUB4U0lpo1HTJELI/ohhYbZuKpe+ EnmM+PJdy+j0Xcf2z7ttWBlADrCtbf4GrF+lzDw3NFcqpJBMmJ50fYUREyedtYm/FnYE Neo/PZbRiUO3H/FF7NTzVEdcO/hmWCDPBRINDT2oGIpREoLkSNPGLAfAoQJpIjDfUDqd BnViTVS7OGZKWclGXI06KiT7f7RXqXu5V7RNq19lWY7EW9PCuzUdijp2l8f9e+BwpeWg 4d6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=BJtEhFqYG8uOifm5mmlKegogKIvfMvjboDz3ts+GSu0=; b=ks7J33+q7pSnMTdtgEDMCktbxqFxf5fQcJJsuwqki9OIrKDFK14cPcfEfC5UkG9aLn TLkApsu+ILJgcUknVqOQsnX/OQwNNp3mEPhnTa39UqvFpc+ndhKKBZwE/oanKOayHXy+ 4CnwKGNssF+MlU31qT8k9vrkA+vzu2l3vCqkRgG6gBc6aebUtBKB3DPKQjVXSMl9CXVl I1u0TnMfuu26NvHOqi+2cAdg8eDHVOf96Nw6cE+DEuVua2GyIy2N5GwmQM7m9eMpXw0g mn3/FT2I66900E3xUe3c8OdMGrMy4aT3giODwExLqnC21H8r9rHamf679oeDbcNR2+Fv Hcng== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=y1370jn8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n16si5307480ejy.278.2019.11.01.04.25.23; Fri, 01 Nov 2019 04:25:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=y1370jn8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728839AbfKAKFC (ORCPT + 99 others); Fri, 1 Nov 2019 06:05:02 -0400 Received: from merlin.infradead.org ([205.233.59.134]:41876 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726771AbfKAKFC (ORCPT ); Fri, 1 Nov 2019 06:05:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=BJtEhFqYG8uOifm5mmlKegogKIvfMvjboDz3ts+GSu0=; b=y1370jn8j/57QtOIgRdTgTDfe LcgNagsLO1zqKSY9aO4YhH4a6JpLRmjfakggcz7l80BFf6yOVZlg/erXG2Vj083+FuKmA7JSiDEkh 37awmuwzZ23auW5Cn6lDq3ui+5BsZOW38V7Ozb8R+whhLxIu4Xh0rNSXJEboDHXcsGneHySkj9cP9 Z26xp1b8W7S7nO8SkJU1D4L8jo7RB2YJ+AsEZbRDKkztc60RFUCFJs7jx5a9PtG6a4CUFdIBGE9lq /CY0EIOi7X8Xqdk+cutkTyXRiH2G7dcFxmJffO65FJ1vMEsVZwE8nLG7bY63FgihEqNdZPgORLz0R y0wSHoRTQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by merlin.infradead.org with esmtpsa (Exim 4.92.3 #3 (Red Hat Linux)) id 1iQTnO-0005Q7-UI; Fri, 01 Nov 2019 10:04:43 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 60878300334; Fri, 1 Nov 2019 11:03:38 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 59B4E26540347; Fri, 1 Nov 2019 11:04:40 +0100 (CET) Date: Fri, 1 Nov 2019 11:04:40 +0100 From: Peter Zijlstra To: Leo Yan Cc: Will Deacon , Mark Rutland , Mike Leach , Adrian Hunter , Ingo Molnar , Borislav Petkov , "H . Peter Anvin" , x86@kernel.org, Alexander Shishkin , Mathieu Poirier , Arnaldo Carvalho de Melo , Jiri Olsa , linux-kernel@vger.kernel.org Subject: Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event Message-ID: <20191101100440.GU4131@hirez.programming.kicks-ass.net> References: <20191025130000.13032-1-adrian.hunter@intel.com> <20191025130000.13032-2-adrian.hunter@intel.com> <20191030104747.GA21153@leoy-ThinkPad-X240s> <20191030124659.GQ4114@hirez.programming.kicks-ass.net> <20191030141950.GB21153@leoy-ThinkPad-X240s> <20191030162325.GT4114@hirez.programming.kicks-ass.net> <20191031073136.GC21153@leoy-ThinkPad-X240s> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191031073136.GC21153@leoy-ThinkPad-X240s> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 31, 2019 at 03:31:36PM +0800, Leo Yan wrote: > Before move farward, I'd like to step back to describe clearly what's > current problem on Arm64 and check one question for jump label: > > I checked the kernel code, both kprobe and ftrace both uses > stop_machine() to alter instructions, That's not currect for Aargh64, see aarch64_insn_patch_text_nosync(), which is used in both ftrace and jump_label. > since all CPUs run into stop > machine's synchronization, there have no race condition between > instructions transition and CPUs execte the altered instruction; thus > it's safe for kprobe and ftrace to use perf event PERF_TEXT_POKE_UPDATE > to notify instruction transition and can allow us to read out 'correct' > instruction for decoder. Agreed, IFF patching happens using stop_machine(), things are easy. ARM is (so far) exclusively using stop_machine() based text_poking, although the last time I spoke to Will about this, he said the _nosync stuff is possible on 32bit too, just nobody has bothered implementing it. > But for jump label, it doesn't use the stop_machine() and perf event > PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see > the example for transition from nop to branch): > > CPU0 CPU1 > NOP instruction > `-> static_key_enable() > `-> aarch64_insn_patch_text_nosync() > `-> perf event PERF_TEXT_POKE_UPDATE > -> Execute nop > instruction > `-> aarch64_insn_write() > `-> __flush_icache_range() > > Since x86 platform have INT3 as a mediate state, it can avoid the > race condition between CPU0 (who is do transition) and other CPUs (who > is possible to execute nop/branch). Ah, you found the _nosync thing in jump_label, here's the one in ftrace: arch/arm64/kernel/ftrace.c: if (aarch64_insn_patch_text_nosync((void *)pc, new)) And yes, this is racy. > > The thing is, as I argued, the instruction state between PRE and POST is > > ambiguous. This makes it impossible to decode the branch decision > > stream. > > > > Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we > > have CPU1 covering the instruction at T3. > > > > How do you decide where CPU1 goes and what the next conditional branch > > is? > > Sorry for my not well thought. > > I agree that T3 is an uncertain state with below flow: > > CPU0 CPU1 > perf event PERF_TEXT_POKE_UPDATE_PRE -> T1 > > Int3 / NOP -> T3 > > Int3 / branch -> T3' > > perf event PERF_TEXT_POKE_UPDATE_POST -> T5 > > Except if the trace has extra info and can use old/new instructions > combination for analysis, otherwise PRE/POST pair events aren't helpful > for resolve this issue (if trace decoder can do this, then the change in > kernel will be much simpler). > > Below are two potential options we can use on Arm64 platform: > > - Change to use stop_machine() for jump label; this might introduce > performance issue if jump label is altered frequently. > > To mitigate the impaction, we can only use stop_machine() when > detect the perf events are enabled, otherwise will rollback to use > the old code path. > > - We can use breakpoint to emulate the similiar flow with x86's int3, > thus we can dismiss the race condition between one CPU alters > instruction and other CPUs run into the alternative instruction. > > @Will, @Mark, could you help review this? Appreciate any comments > and suggestions. And please let me know if you want to consolidate > related works with your side (or as you know if there have ongoing > discussion or someone works on this). Given people are building larger Aargh64 machines (I've heard about 100+ CPUs already), I'm thinking the 3rd option is the most performant. But yes, as you mention earlier, we can make this optional on the TEXT_POKE_UPDATE event being in use. I'm thinking something along the lines of: static uintptr_t nosync_addr; static u32 nosync_insn; int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn) { const u32 break = // some_breakpoint_insn; uintptr_t tp = (uintptr_t)addr; int ret; lockdep_assert_held(&text_mutex); /* A64 instructions must be word aligned */ if (tp & 0x3) return -EINVAL; if (perf_text_poke_update_enabled()) { nosync_insn = insn; smp_store_release(&nosync_addr, tp); ret = aarch64_insn_write(addr, break); if (ret == 0) __flush_icache_range(tp, tp + AARCH64_INSN_SIZE); perf_event_text_poke(....); } ret = aarch64_insn_write(addr, insn); if (ret == 0) __flush_icache_range(tp, tp + AARCH64_INSN_SIZE); return ret; } And have the 'break' handler do: aarch64_insn_break_handler(struct pt_regs *regs) { unsigned long addr = smp_load_acquire(&nosync_addr); u32 insn = nosync_insn; if (regs->ip != addr) return; // emulate @insn } I understood from Will the whole nosync scheme only works for a limited set of instructions, but you only have to implement emulation for the actual instructions used of course. (which is what we do on x86) Does this sound workable?