Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp876180pxb; Thu, 21 Apr 2022 12:20:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyZuj8aSIHXJgse4Co+knMLQRhLEZZEWPWubMnTO3KEtjxb/MfFGOnmnXoNbsv0OCTzWrh0 X-Received: by 2002:a17:907:2d11:b0:6f0:f39:f647 with SMTP id gs17-20020a1709072d1100b006f00f39f647mr885381ejc.694.1650568852483; Thu, 21 Apr 2022 12:20:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650568852; cv=none; d=google.com; s=arc-20160816; b=VrnI0vZi59IkS9ZEGEkbbaKAYySj5QTC6ZF0zm8lp8AUf4MzLl8ac6Wz8y/9wVrIAe b4M5jB9MNc642fsZDvCtPfDTQn8F85QX2Cp2ImA9y02Jc+THaJBC1re32ygFdnaw2rVf s2KPkV2QwB2LD0ZCAcyo064azQJ8NqrOirZQKiZy+wLSkmkCwL4qB5LP/Dw55Whlwmke ze3ecYvhSnypViCmUPrnKoyHhrbRMb8PbkxQhxyEPc0O3ZGc7BkfqvxUISKc3c3TZSk2 DgoY3DDcuYN7mC7eh1lFnqE2KO6v/od/srCuW/MbhcIdgbYj3BSRBwjWN3GR669EcLJT bSog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=llMBZqJH1dYw1Mf0G8hkKbsPzmIedeeqX4Q8FqHkr1o=; b=nFwxjrhi9vXuV4cAYe7hQBut+k7iHyK2+duDBm5fw5Es19YKY8v25C322+j9o+IS9U MNAE97cMFzX4aDOa5Qei+0QeVhl0D4A41VVERmLRrFFqkYcA/Npg0OcjsG5ZIGIhPx+U lNkgMwr0TPuP2MjvbFBZPaCSQhFlnj+3+fwBcahoUzXZWs5Wic0o+5eSa1GzwKYcSo/g N0gR9KU6lLiAdQgyEGInXeLmIIzxSgX3np3m4SOy8pqJyR0O6X3MMYpSQpsZ6wxdrmvo 7wtTmXt5ZLNFrLJzZZ+cWayTUrNbYSB5X3Iq1EAGUNdeCiOW40gkRBxJAg0ORp8YsYq0 xOaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g17-20020a056402425100b0041d9a78fe01si5040758edb.502.2022.04.21.12.20.29; Thu, 21 Apr 2022 12:20:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230427AbiDUQdo (ORCPT + 99 others); Thu, 21 Apr 2022 12:33:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46196 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1390534AbiDUQdO (ORCPT ); Thu, 21 Apr 2022 12:33:14 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7D55549269 for ; Thu, 21 Apr 2022 09:27:44 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 478C9153B; Thu, 21 Apr 2022 09:27:44 -0700 (PDT) Received: from lakrids (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CC8E73F73B; Thu, 21 Apr 2022 09:27:42 -0700 (PDT) Date: Thu, 21 Apr 2022 17:27:40 +0100 From: Mark Rutland To: Steven Rostedt Cc: Wang ShaoBo , cj.chengjian@huawei.com, huawei.libin@huawei.com, xiexiuqi@huawei.com, liwei391@huawei.com, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com, will@kernel.org, zengshun.wu@outlook.com Subject: Re: [RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically allocated trampolines Message-ID: References: <20220316100132.244849-1-bobo.shaobowang@huawei.com> <20220316100132.244849-4-bobo.shaobowang@huawei.com> <20220421100639.03c0d123@gandalf.local.home> <20220421114201.21228eeb@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220421114201.21228eeb@gandalf.local.home> X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 21, 2022 at 11:42:01AM -0400, Steven Rostedt wrote: > On Thu, 21 Apr 2022 16:14:13 +0100 > Mark Rutland wrote: > > > > Let's say you have 10 ftrace_ops registered (with bpf and kprobes this can > > > be quite common). But each of these ftrace_ops traces a function (or > > > functions) that are not being traced by the other ftrace_ops. That is, each > > > ftrace_ops has its own unique function(s) that they are tracing. One could > > > be tracing schedule, the other could be tracing ksoftirqd_should_run > > > (whatever). > > > > Ok, so that's when messing around with bpf or kprobes, and not generally > > when using plain old ftrace functionality under /sys/kernel/tracing/ > > (unless that's concurrent with one of the former, as per your other > > reply) ? > > It's any user of the ftrace infrastructure, which includes kprobes, bpf, > perf, function tracing, function graph tracing, and also affects instances. > > > > > > Without this change, because the arch does not support dynamically > > > allocated trampolines, it means that all these ftrace_ops will be > > > registered to the same trampoline. That means, for every function that is > > > traced, it will loop through all 10 of theses ftrace_ops and check their > > > hashes to see if their callback should be called or not. > > > > Sure; I can see how that can be quite expensive. > > > > What I'm trying to figure out is who this matters to and when, since the > > implementation is going to come with a bunch of subtle/fractal > > complexities, and likely a substantial overhead too when enabling or > > disabling tracing of a patch-site. I'd like to understand the trade-offs > > better. > > > > > With dynamically allocated trampolines, each ftrace_ops will have their own > > > trampoline, and that trampoline will be called directly if the function > > > is only being traced by the one ftrace_ops. This is much more efficient. > > > > > > If a function is traced by more than one ftrace_ops, then it falls back to > > > the loop. > > > > I see -- so the dynamic trampoline is just to get the ops? Or is that > > doing additional things? > > It's to get both the ftrace_ops (as that's one of the parameters) as well > as to call the callback directly. Not sure if arm is affected by spectre, > but the "loop" function is filled with indirect function calls, where as > the dynamic trampolines call the callback directly. > > Instead of: > > bl ftrace_caller > > ftrace_caller: > [..] > bl ftrace_ops_list_func > [..] > > > void ftrace_ops_list_func(...) > { > __do_for_each_ftrace_ops(op, ftrace_ops_list) { > if (ftrace_ops_test(op, ip)) // test the hash to see if it > // should trace this > // function. > op->func(...); > } > } > > It does: > > bl dyanmic_tramp > > dynamic_tramp: > [..] > bl func // call the op->func directly! > > > Much more efficient! > > > > > > There might be a middle-ground here where we patch the ftrace_ops > > pointer into a literal pool at the patch-site, which would allow us to > > handle this atomically, and would avoid the issues with out-of-range > > trampolines. > > Have an example of what you are suggesting? We can make the compiler to place 2 NOPs before the function entry point, and 2 NOPs after it using `-fpatchable-function-entry=4,2` (the arguments are ,). On arm64 all instructions are 4 bytes, and we'll use the first two NOPs as an 8-byte literal pool. Ignoring BTI for now, the compiler generates (with some magic labels added here for demonstration): __before_func: NOP NOP func: NOP NOP __remainder_of_func: ... At ftrace_init_nop() time we patch that to: __before_func: // treat the 2 NOPs as an 8-byte literal-pool .quad // see below func: MOV X9, X30 NOP __remainder_of_func: ... When enabling tracing we do __before_func: // patch this with the relevant ops pointer .quad func: MOV X9, X30 BL // common trampoline __remainder_of_func: .. The `BL ` clobbers X30 with __remainder_of_func, so within the trampoline we can find the ops pointer at an offset from X30. On arm64 we can load that directly with something like: LDR , [X30, # -(__remainder_of_func - __before_func)] ... then load the ops->func from that and invoke it (or pass it to a helper which does): // Ignoring the function arguments for this demonstration LDR , [, #OPS_FUNC_OFFSET] BLR That avoids iterating over the list *without* requiring separate trampolines, and allows us to patch the sequence without requiring stop-the-world logic (since arm64 has strong requirements for patching most instructions other than branches and nops). We can initialize the ops pointer to a default ops that does the whole __do_for_each_ftrace_ops() dance. To handle BTI we can have two trampolines, or we can always reserve 3 NOPs before the function so that we can have a consistent offset regardless. Thanks, Mark.