Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp2718157iob; Fri, 6 May 2022 08:58:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzdw67L8jDzseq/iHX6Ti9dbCW7oqca2ejQL04qljTe71goA3aqu4yh4peK+pgSJ0JN3lnT X-Received: by 2002:aa7:dbd0:0:b0:427:4e6e:d779 with SMTP id v16-20020aa7dbd0000000b004274e6ed779mr4138944edt.27.1651852702438; Fri, 06 May 2022 08:58:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651852702; cv=none; d=google.com; s=arc-20160816; b=peHvgCZ0kGwjyd3YN4sXzHQATGekoON/O2fZw/jygP5GmjNlKxMEUHvpJY672h98cj 4d7bW6x0GWB+mHz8VykQxi0foqV4JHLaNi/wnOYNBQYL3PzFBp7PevlK5HpY/PPmEumQ /hDKb+K5QbsrVk3OjFD8TSFLXlvgTgSN0T7KD/JT4OTapz1MEHv24FrbK6WBgBTSTSPv GTImXZA+Lu819/E0tZbt1Trdgnt/l0qUlur3YeahGEzU3oO5Z/xEb79Nhyoldjp3PunP /Mc4jsS+BysJKCfF4k1q+EhOmgHxWekrSWa7RM3TzBNMpZxzhYHDJSOf1UINBC8pnXmE eDqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=sAHzEV6JuwZV2FuOgbxRWCaql38HzE/IkEbk8h8PYPA=; b=sCYr/CY5IMYga+MkfXuGP3QQOA4j/rUQygZxg+8qJpdVxpPGKUynVk26gA6WGHOcDh zsH9UhmF7dcxV6+vxtSR7sw56dlO90aL0jR0PIBsAyUWxPf50BSmnakxJ/fxNAGul4Xl Q03CKymbKz0p/iLb4ptYtYwuXUGcUlGuyUxyn3fnK1kkGhTwR03P1FvCy9myMgqgje9x LjA2pbglv9VqDuuewwdE2KIIxh8xeMnbNDNIyRHd41V6j9KtoNf6sLHmtvh7FOJiKfvw usWmud37h+Z3lXpCegxzPXNsmh6Q+ZkUbGkqe7Q9fwjgsg3IIMBUR9xrQ+IY8wBIBn+8 avuw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v17-20020a1709062f1100b006e89dab2309si5000323eji.187.2022.05.06.08.57.56; Fri, 06 May 2022 08:58:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240303AbiEEDBZ (ORCPT + 99 others); Wed, 4 May 2022 23:01:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35498 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231204AbiEEDBX (ORCPT ); Wed, 4 May 2022 23:01:23 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 738BD49270 for ; Wed, 4 May 2022 19:57:45 -0700 (PDT) Received: from dggpemm500022.china.huawei.com (unknown [172.30.72.53]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4KtyzH1BWTzhYpF; Thu, 5 May 2022 10:57:15 +0800 (CST) Received: from dggpemm500015.china.huawei.com (7.185.36.181) by dggpemm500022.china.huawei.com (7.185.36.162) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Thu, 5 May 2022 10:57:36 +0800 Received: from [10.174.177.133] (10.174.177.133) by dggpemm500015.china.huawei.com (7.185.36.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Thu, 5 May 2022 10:57:36 +0800 Subject: Re: [RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically allocated trampolines To: Mark Rutland , Steven Rostedt CC: , , , , , , , , References: <20220316100132.244849-1-bobo.shaobowang@huawei.com> <20220316100132.244849-4-bobo.shaobowang@huawei.com> <20220421100639.03c0d123@gandalf.local.home> <20220421114201.21228eeb@gandalf.local.home> From: "Wangshaobo (bobo)" Message-ID: <8f36ebd2-2c56-d896-3a91-c97a5760b344@huawei.com> Date: Thu, 5 May 2022 10:57:35 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="gbk"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.133] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm500015.china.huawei.com (7.185.36.181) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-6.7 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ?? 2022/4/22 0:27, Mark Rutland ะด??: > On Thu, Apr 21, 2022 at 11:42:01AM -0400, Steven Rostedt wrote: >> On Thu, 21 Apr 2022 16:14:13 +0100 >> Mark Rutland wrote: >> >>>> Let's say you have 10 ftrace_ops registered (with bpf and kprobes this can >>>> be quite common). But each of these ftrace_ops traces a function (or >>>> functions) that are not being traced by the other ftrace_ops. That is, each >>>> ftrace_ops has its own unique function(s) that they are tracing. One could >>>> be tracing schedule, the other could be tracing ksoftirqd_should_run >>>> (whatever). >>> Ok, so that's when messing around with bpf or kprobes, and not generally >>> when using plain old ftrace functionality under /sys/kernel/tracing/ >>> (unless that's concurrent with one of the former, as per your other >>> reply) ? >> It's any user of the ftrace infrastructure, which includes kprobes, bpf, >> perf, function tracing, function graph tracing, and also affects instances. >> >>>> Without this change, because the arch does not support dynamically >>>> allocated trampolines, it means that all these ftrace_ops will be >>>> registered to the same trampoline. That means, for every function that is >>>> traced, it will loop through all 10 of theses ftrace_ops and check their >>>> hashes to see if their callback should be called or not. >>> Sure; I can see how that can be quite expensive. >>> >>> What I'm trying to figure out is who this matters to and when, since the >>> implementation is going to come with a bunch of subtle/fractal >>> complexities, and likely a substantial overhead too when enabling or >>> disabling tracing of a patch-site. I'd like to understand the trade-offs >>> better. >>> >>>> With dynamically allocated trampolines, each ftrace_ops will have their own >>>> trampoline, and that trampoline will be called directly if the function >>>> is only being traced by the one ftrace_ops. This is much more efficient. >>>> >>>> If a function is traced by more than one ftrace_ops, then it falls back to >>>> the loop. >>> I see -- so the dynamic trampoline is just to get the ops? Or is that >>> doing additional things? >> It's to get both the ftrace_ops (as that's one of the parameters) as well >> as to call the callback directly. Not sure if arm is affected by spectre, >> but the "loop" function is filled with indirect function calls, where as >> the dynamic trampolines call the callback directly. >> >> Instead of: >> >> bl ftrace_caller >> >> ftrace_caller: >> [..] >> bl ftrace_ops_list_func >> [..] >> >> >> void ftrace_ops_list_func(...) >> { >> __do_for_each_ftrace_ops(op, ftrace_ops_list) { >> if (ftrace_ops_test(op, ip)) // test the hash to see if it >> // should trace this >> // function. >> op->func(...); >> } >> } >> >> It does: >> >> bl dyanmic_tramp >> >> dynamic_tramp: >> [..] >> bl func // call the op->func directly! >> >> >> Much more efficient! >> >> >>> There might be a middle-ground here where we patch the ftrace_ops >>> pointer into a literal pool at the patch-site, which would allow us to >>> handle this atomically, and would avoid the issues with out-of-range >>> trampolines. >> Have an example of what you are suggesting? > We can make the compiler to place 2 NOPs before the function entry point, and 2 > NOPs after it using `-fpatchable-function-entry=4,2` (the arguments are > ,). On arm64 all instructions are 4 bytes, and we'll use the > first two NOPs as an 8-byte literal pool. > > Ignoring BTI for now, the compiler generates (with some magic labels added here > for demonstration): > > __before_func: > NOP > NOP > func: > NOP > NOP > __remainder_of_func: > ... > > At ftrace_init_nop() time we patch that to: > > __before_func: > // treat the 2 NOPs as an 8-byte literal-pool > .quad // see below > func: > MOV X9, X30 > NOP > __remainder_of_func: > ... > > When enabling tracing we do > > __before_func: > // patch this with the relevant ops pointer > .quad > func: > MOV X9, X30 > BL // common trampoline I have a question that does this common trampoline allocated by module_alloc()? if yes, how to handle the long jump from traced func to common trampoline if only adding two NOPs in front of func. -- Wang ShaoBo > __remainder_of_func: > .. > > The `BL ` clobbers X30 with __remainder_of_func, so within > the trampoline we can find the ops pointer at an offset from X30. On > arm64 we can load that directly with something like: > > LDR , [X30, # -(__remainder_of_func - __before_func)] > > ... then load the ops->func from that and invoke it (or pass it to a > helper which does): > > // Ignoring the function arguments for this demonstration > LDR , [, #OPS_FUNC_OFFSET] > BLR > > That avoids iterating over the list *without* requiring separate > trampolines, and allows us to patch the sequence without requiring > stop-the-world logic (since arm64 has strong requirements for patching > most instructions other than branches and nops). > > We can initialize the ops pointer to a default ops that does the whole > __do_for_each_ftrace_ops() dance. > > To handle BTI we can have two trampolines, or we can always reserve 3 NOPs > before the function so that we can have a consistent offset regardless. > > Thanks, > Mark. > .