Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp2628309ybx; Fri, 8 Nov 2019 07:06:47 -0800 (PST) X-Google-Smtp-Source: APXvYqy8au2VazMpdjbVwCrnCiYp79FK11iZjrtgLjSg7PCw5ZyWDQwRr1Eg44IXxyla5FZzDH/I X-Received: by 2002:a17:906:80c:: with SMTP id e12mr9290743ejd.59.1573225606888; Fri, 08 Nov 2019 07:06:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573225606; cv=none; d=google.com; s=arc-20160816; b=p2WI3aluo3C8Xuf4OeJzbKNiqvXxE3vPMFazGjhzzO+2ZWBxuMGfjggwMpSkH1pSmS NngBp2maFUsnJmfaQ/p4z7Qj/fgzpaMIxwADZq07ymjdQ3/n+VtQDNRIViiAW/RbuhY2 wOliW8xiJ5YwQNU8ynSWCz9nmujRQKjBZSITCBbh+kmLdzPlivVsSWwNqTJtAwHL6m+Z ah+jgZv/OhqrtfaIrzxm8iWHUt3RW7bFCpZs2a7koc22AFLV3EWptFXQkKEUGuhWA14J bOlPibLCvcHtS3erWFf2VVECMoioe/8rHQPbKs2HXqk7jSszsjeJIl0b0b+ZtF2radua KlMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=vJp6cAXN+dUfVPQDeABvOLqnK8LTiKhkA6L1GxR0KxI=; b=zRw/mJMOQ7aGDKSrZwNf+5/THRnXwbC08k9wTw4nInmoSn5LijFkp5CQrH7bpFo4yk YbHKsjEyEq0vAHY1CrWmfjAyubZ3uTUKzDWw3hsq8E3judL9vPm/zQQTs1boIf75nVyn W+LuyaV0iu1NksBn8VwTgxtYYC/horV9nkNM7ytTUV1I65gMQwt9P189dS+REY0BJYRk bbd/b3InJCpsrMo7waNSsvuoJmCLaoxKZF2w4q4eEzRF7ZDWO0YkUE3kUZqndOzXShY3 EzQzxNEfHqgezunaErf490VMsShnHug6pYJPZUdO3xTd+zz7vSIMkJbMdi5UY1hBi9I5 iJHQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ZSi1KLxT; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c5si4182251edc.246.2019.11.08.07.06.22; Fri, 08 Nov 2019 07:06:46 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ZSi1KLxT; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726227AbfKHPFm (ORCPT + 99 others); Fri, 8 Nov 2019 10:05:42 -0500 Received: from mail-yw1-f67.google.com ([209.85.161.67]:42321 "EHLO mail-yw1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726036AbfKHPFl (ORCPT ); Fri, 8 Nov 2019 10:05:41 -0500 Received: by mail-yw1-f67.google.com with SMTP id z67so2242294ywb.9 for ; Fri, 08 Nov 2019 07:05:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=vJp6cAXN+dUfVPQDeABvOLqnK8LTiKhkA6L1GxR0KxI=; b=ZSi1KLxTHC0JsBr1m7FvI9OJJr89ObT0qA53ShtRXqFFwpHSqfnM2ipbciIaiIV5Cd W/3CE5c1OTw909XwIx5s676pywHmlsscGloBgcVMpaeW63hoOldMbxprPQXqxfaV4St4 dKq5XdfMD1gcyeDUWx+vtLrTGI4Wh9S1+aJ6/MCcUDeEeoR78BlbJyr3RykS8oxvdnmS XF4fFpmNlO86BFyskuJjbGL5Lk8ReyeXJ2z0k5RIhFz34JMH6ymA2LEGfiNEYpsrfk6N rjLSHV1O8EUsaOPDYbwRAhbJqdM67FFXLEFL/OW8EJeFBySc50dVJPxnRQB8mLCxtd+M oaSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=vJp6cAXN+dUfVPQDeABvOLqnK8LTiKhkA6L1GxR0KxI=; b=c82F+BPF4TGM5WAsEFj6xu0U/wlPsmuz4fcKbtU5Jz3RzaX/BTNAGROCxMB21pjx+q GTrQelV47jmxWelHDvI4Fy5wlzOgIwrqD5AafneKK5y3jtON1BhneFg5NSKQeUoXnFB/ P2/67/8/V2mCmY4JmbbhT++HnwfFhIohHeMivv+7Cdunu7zoYLXUOC+hgUfh8zr+9orO oqwfoLr/jdEZV0SsQzLVRYED4y71fipjoGxudhpBpLF0d2HIVcei9EQeE6OZIYBYqS74 4vUJ7eK+rFQX/VUi83XHVn3aMDnrbNcwYPcZqwf/qERyqwSlsVmxtWoydGl26NlUft/w s9NQ== X-Gm-Message-State: APjAAAX1nhxg4wlakPiSgsAhQAGroUfM4++SrgYlDdV6LpoJjj2kTkcC wxszMEcObNlR8fBrfmoBceGVFA== X-Received: by 2002:a0d:e445:: with SMTP id n66mr7495001ywe.179.1573225540571; Fri, 08 Nov 2019 07:05:40 -0800 (PST) Received: from leoy-ThinkPad-X240s (li1038-30.members.linode.com. [45.33.96.30]) by smtp.gmail.com with ESMTPSA id u7sm6290532ywu.45.2019.11.08.07.05.33 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 08 Nov 2019 07:05:39 -0800 (PST) Date: Fri, 8 Nov 2019 23:05:30 +0800 From: Leo Yan To: Peter Zijlstra Cc: Will Deacon , Mark Rutland , Mike Leach , Adrian Hunter , Ingo Molnar , Borislav Petkov , "H . Peter Anvin" , x86@kernel.org, Alexander Shishkin , Mathieu Poirier , Arnaldo Carvalho de Melo , Jiri Olsa , linux-kernel@vger.kernel.org Subject: Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event Message-ID: <20191108150530.GA7721@leoy-ThinkPad-X240s> References: <20191025130000.13032-1-adrian.hunter@intel.com> <20191025130000.13032-2-adrian.hunter@intel.com> <20191030104747.GA21153@leoy-ThinkPad-X240s> <20191030124659.GQ4114@hirez.programming.kicks-ass.net> <20191030141950.GB21153@leoy-ThinkPad-X240s> <20191030162325.GT4114@hirez.programming.kicks-ass.net> <20191031073136.GC21153@leoy-ThinkPad-X240s> <20191101100440.GU4131@hirez.programming.kicks-ass.net> <20191104022346.GC26019@leoy-ThinkPad-X240s> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191104022346.GC26019@leoy-ThinkPad-X240s> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 04, 2019 at 10:23:46AM +0800, Leo Yan wrote: [...] > > > But for jump label, it doesn't use the stop_machine() and perf event > > > PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see > > > the example for transition from nop to branch): > > > > > > CPU0 CPU1 > > > NOP instruction > > > `-> static_key_enable() > > > `-> aarch64_insn_patch_text_nosync() > > > `-> perf event PERF_TEXT_POKE_UPDATE > > > -> Execute nop > > > instruction > > > `-> aarch64_insn_write() > > > `-> __flush_icache_range() > > > > > > Since x86 platform have INT3 as a mediate state, it can avoid the > > > race condition between CPU0 (who is do transition) and other CPUs (who > > > is possible to execute nop/branch). > > > > Ah, you found the _nosync thing in jump_label, here's the one in ftrace: > > > > arch/arm64/kernel/ftrace.c: if (aarch64_insn_patch_text_nosync((void *)pc, new)) > > > > And yes, this is racy. > > > > > > The thing is, as I argued, the instruction state between PRE and POST is > > > > ambiguous. This makes it impossible to decode the branch decision > > > > stream. > > > > > > > > Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we > > > > have CPU1 covering the instruction at T3. > > > > > > > > How do you decide where CPU1 goes and what the next conditional branch > > > > is? > > > > > > Sorry for my not well thought. > > > > > > I agree that T3 is an uncertain state with below flow: > > > > > > CPU0 CPU1 > > > perf event PERF_TEXT_POKE_UPDATE_PRE -> T1 > > > > > > Int3 / NOP -> T3 > > > > > > Int3 / branch -> T3' > > > > > > perf event PERF_TEXT_POKE_UPDATE_POST -> T5 > > > > > > Except if the trace has extra info and can use old/new instructions > > > combination for analysis, otherwise PRE/POST pair events aren't helpful > > > for resolve this issue (if trace decoder can do this, then the change in > > > kernel will be much simpler). > > > > > > Below are two potential options we can use on Arm64 platform: > > > > > > - Change to use stop_machine() for jump label; this might introduce > > > performance issue if jump label is altered frequently. > > > > > > To mitigate the impaction, we can only use stop_machine() when > > > detect the perf events are enabled, otherwise will rollback to use > > > the old code path. > > > > > > - We can use breakpoint to emulate the similiar flow with x86's int3, > > > thus we can dismiss the race condition between one CPU alters > > > instruction and other CPUs run into the alternative instruction. > > > > > > @Will, @Mark, could you help review this? Appreciate any comments > > > and suggestions. And please let me know if you want to consolidate > > > related works with your side (or as you know if there have ongoing > > > discussion or someone works on this). > > > > Given people are building larger Aargh64 machines (I've heard about 100+ > > CPUs already), I'm thinking the 3rd option is the most performant. > > > > But yes, as you mention earlier, we can make this optional on the > > TEXT_POKE_UPDATE event being in use. > > > > I'm thinking something along the lines of: > > > > static uintptr_t nosync_addr; > > static u32 nosync_insn; > > > > int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn) > > { > > const u32 break = // some_breakpoint_insn; > > uintptr_t tp = (uintptr_t)addr; > > int ret; > > > > lockdep_assert_held(&text_mutex); > > > > /* A64 instructions must be word aligned */ > > if (tp & 0x3) > > return -EINVAL; > > > > if (perf_text_poke_update_enabled()) { > > > > nosync_insn = insn; > > smp_store_release(&nosync_addr, tp); > > > > ret = aarch64_insn_write(addr, break); > > if (ret == 0) > > __flush_icache_range(tp, tp + AARCH64_INSN_SIZE); > > > > perf_event_text_poke(....); > > } > > > > ret = aarch64_insn_write(addr, insn); > > if (ret == 0) > > __flush_icache_range(tp, tp + AARCH64_INSN_SIZE); > > > > return ret; > > } > > > > And have the 'break' handler do: > > > > aarch64_insn_break_handler(struct pt_regs *regs) > > { > > unsigned long addr = smp_load_acquire(&nosync_addr); > > u32 insn = nosync_insn; > > > > if (regs->ip != addr) > > return; > > > > // emulate @insn > > } > > > > I understood from Will the whole nosync scheme only works for a limited > > set of instructions, but you only have to implement emulation for the > > actual instructions used of course. > > > > (which is what we do on x86) > > > > Does this sound workable? I will update some status for prototype (the code has been uploaded into git server [1]) and found some issues for text poke perf event on arm64. These issues are mainly related with arch64 architecture. - The first issue is for the nosync instruction emulation. On arm64, some instructions need to be single stepped and some instructions is emulated. Especially, after I read the code for kprobe implementation on Arm64. So the main idea for prototyping is to use the almos same method with kprobe for nosync instruction. As result, there have duplicated code between kprobe and text poke handling. If this is the right way to move forward, we need to refactor the common code so that kprobe and text poke can share with each other. - The second issue is race condition between the CPU alters instructions and other CPUs hit the altered instructions (and breakpointed). Peter's suggestion uses global variables 'nosync_insn' and 'nosync_addr' to record the altered instruction. But from the testing I found for single static key, usually it will change for multiple address at once. So this might cause the issue is: CPU(a) will loop to alter instructions for different address (sometimes the opcode also is different for different address), at the meantime, CPU(b) hits an altered instruction and handle exception for the breakpoint, if CPU(a) is continuing to alter instruction for the next address, thne CPU(a) might wrongly to use the value from 'nosync_insn' and 'nosync_addr'. Simply to say, we cannot only support single nosync instruction but need to support multiple nosync instructions in the loop. - The third issue is might be nested issue. E.g. for the injected break instruction, we have no method to pass perf event for this instruction; and if we connect with the first issue, if the instruction is single stepped with slot (the slot is allocated with get_insn_slot()), we cannot to allow the perf user space to know the instructions which are executed in the slots. I am not sure if I think too complex here. But seems to me, we inject more instructions for poke text event, and if these injected instructions are executed but the userspace decoder has no way to pass the related info. Just clarify, I am fine for merging this patch set, but we might consider more what's the best way on Arm64. Welcome any public comments and suggestions; I will sync internally for how to follow up this functionality. Thanks, Leo Yan [1] https://git.linaro.org/people/leo.yan/linux-coresight.git/log/?h=arm64_perf_text_pork_prototype