Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp3586545rwd; Fri, 16 Jun 2023 21:37:10 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6B+XtU/60sluECWi5CxF5ih0xYa0rVNtB3qBRgimgBqHz8M8lj1+Au603ryJzVUC8DW6sh X-Received: by 2002:a17:902:d4ce:b0:1a9:b0a3:f03a with SMTP id o14-20020a170902d4ce00b001a9b0a3f03amr5126369plg.9.1686976630470; Fri, 16 Jun 2023 21:37:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686976630; cv=none; d=google.com; s=arc-20160816; b=TXnJHxo4eiGaKz2zSJrX9/uTx7CG4g97qV2G7uQLILTLopqwSgNQRN5INUVkYA+Xz+ Zv40wGoPRPHLtulAqgyxNez5p2o9HAFgqtaTeh1ZnKeu+H3BSfNQwlm5Y6LVZhQYW6z7 naQKG4/AB2cfZQ57YcsCvQ9o6pyBcBCwWxycxmhjCBrxItstutSn3w4sxkxdlrbRuBIl Ashq7E4HGgQCvwHlWDljg3KA6vzi/pXRKj/c+17ciWCj6XzAy3bLFUBqLr9JB/+0wmIx DkIiQ34hJxGSL4HJv7T+t/7bg89ZfO+RLmId9XEu09p1w+sRMR4e3WxhUrMCz/1z68rL p8Pw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=WtBfhzbvtPREnp5cZ9GPbYujP1MtKrAtGhhq+GbAha0=; b=MHYqgAAFE/bBu7af7qaYiblttlij16sCjXf8gMlz6oLIkdCfaa+qm/iZ0nxGvg97mN LsBs9vqy95kQcw+ROd1RGMfhK3nEVvNR4HUbIRuVPlpRrFpfmVGAohQVCBI1uNyucMb4 zvaFz1tbMATFkhm+H3SQqPjNl+uZP3IzDUGPPBKZqWm/kjcUxxgCNs/of03Gy4cQ/Gz2 qjGP8LKIeCyge8huAcQU23KQ9F87nDXl1thO9sQseQ5ugoeHqKYqA05JN4K0jTDpyiT7 jW50XZOtiyklyRVBNTjn5JyuXSM0LJpwTKsbtEhQiHDDJk39Ns+8yQL5W/iCfloS5dLC Eo+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="prsSj/Mi"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q1-20020a170902dac100b001ab2a0e3163si2809205plx.598.2023.06.16.21.36.52; Fri, 16 Jun 2023 21:37:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="prsSj/Mi"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231149AbjFQEN3 (ORCPT + 99 others); Sat, 17 Jun 2023 00:13:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38982 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229493AbjFQEN0 (ORCPT ); Sat, 17 Jun 2023 00:13:26 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9481430F1; Fri, 16 Jun 2023 21:13:25 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E694961701; Sat, 17 Jun 2023 04:13:24 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D9D6BC433C8; Sat, 17 Jun 2023 04:13:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686975204; bh=O9IhQbkEdVOj/uSoDsWPMJ4oir6I9roRy38/BIcjOhk=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=prsSj/MitUcMcJEJJGh9ljH+bbdtrtfA0qq/Vjlz64iphGyMBl/DCSSL5cPlqQSoG QFnIyOmE3Nq+O4LtrfXQ6VUkCKfb27Apf74fQF5VwhK5fujZgfD0WTOCZpfWzvNSx9 hi+TV46ghkkdluXRLPaqiUTwgnHGDVxfqzTFPLny8o6QuB6XgRjdzFNTfNY6yUfDrE m2m1wQJ28r/MX9OEDeA096JgwvSYS3df7ZMmtK5XsPy2nHem5s8XfuDU8kTkK+tjLg UIAX881i63f/5ULrhFImQlIKzSY3Gr0SXMJA238X687eDSBzWYVPkU4MQJfo+DWb4H 9ShASJVpIMuww== Message-ID: <1d249326-e3dd-9c9d-7b53-2fffeb39bfb4@kernel.org> Date: Fri, 16 Jun 2023 21:13:22 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.1 Subject: Re: [PATCH 07/32] mm: Bring back vmalloc_exec Content-Language: en-US To: Kent Overstreet , Kees Cook Cc: Johannes Thumshirn , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-bcachefs@vger.kernel.org" , Kent Overstreet , Andrew Morton , Uladzislau Rezki , "hch@infradead.org" , "linux-mm@kvack.org" , "linux-hardening@vger.kernel.org" References: <20230509165657.1735798-1-kent.overstreet@linux.dev> <20230509165657.1735798-8-kent.overstreet@linux.dev> <3508afc0-6f03-a971-e716-999a7373951f@wdc.com> <202305111525.67001E5C4@keescook> <202305161401.F1E3ACFAC@keescook> From: Andy Lutomirski In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_HI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/16/23 14:20, Kent Overstreet wrote: > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote: >> For something that small, why not use the text_poke API? > > This looks like it's meant for patching existing kernel text, which > isn't what I want - I'm generating new functions on the fly, one per > btree node. Dynamically generating code is a giant can of worms. Kees touched on a basic security thing: a linear address mapped W+X is a big no-no. And that's just scratching the surface -- ideally we would have a strong protocol for generating code: the code is generated in some extra-secure context, then it's made immutable and double-checked, then it becomes live. (And we would offer this to userspace, some day.) Just having a different address for the W and X aliases is pretty weak. (When x86 modifies itself at boot or for static keys, it changes out the page tables temporarily.) And even beyond security, we have correctness. x86 is a fairly forgiving architecture. If you go back in time about 20 years, modify some code *at the same linear address at which you intend to execute it*, and jump to it, it works. It may even work if you do it through an alias (the manual is vague). But it's not 20 years ago, and you have multiple cores. This does *not* work with multiple CPUs -- you need to serialize on the CPU executing the modified code. On all the but the very newest CPUs, you need to kludge up the serialization, and that's sloooooooooooooow. Very new CPUs have the SERIALIZE instruction, which is merely sloooooow. (The manual is terrible. It's clear that a way to do this without serializing must exist, because that's what happens when code is paged in from a user program.) And remember that x86 is the forgiving architecture. Other architectures have their own rules that may involve all kinds of terrifying cache management. IIRC ARM (32-bit) is really quite nasty in this regard. I've seen some references suggesting that RISC-V has a broken design of its cache management and this is a real mess. x86 low level stuff on Linux gets away with it because the implementation is conservative and very slow, but it's very rarely invoked. eBPF gets away with it in ways that probably no one really likes, but also no one expects eBPF to load programs particularly quickly. You are proposing doing this when a btree node is loaded. You could spend 20 *thousand* cycles, on *each CPU*, the first time you access that node, not to mention the extra branch to decide whether you need to spend those 20k cycles. Or you could use IPIs. Or you could just not do this. I think you should just remove all this dynamic codegen stuff, at least for now. > > I'm working up a new allocator - a (very simple) slab allocator where > you pass a buffer, and it gives you a copy of that buffer mapped > executable, but not writeable. > > It looks like we'll be able to convert bpf, kprobes, and ftrace > trampolines to it; it'll consolidate a fair amount of code (particularly > in bpf), and they won't have to burn a full page per allocation anymore. > > bpf has a neat trick where it maps the same page in two different > locations, one is the executable location and the other is the writeable > location - I'm stealing that. > > external api will be: > > void *jit_alloc(void *buf, size_t len, gfp_t gfp); > void jit_free(void *buf); > void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */ Based on the above, I regret to inform you that jit_update() will either need to sync all cores via IPI or all cores will need to check whether a sync is needed and do it themselves. That IPI could be, I dunno, 500k cycles? 1M cycles? Depends on what cores are asleep at the time. (I have some old Sandy Bridge machines where, if you tick all the boxes wrong, you might spend tens of milliseconds doing this due to power savings gone wrong.) Or are you planning to implement a fancy mostly-lockless thing to track which cores actually need the IPI so you can avoid waking up sleeping cores? Sorry to be a party pooper. --Andy P.S. I have given some thought to how to make a JIT API that was actually (somewhat) performant. It's nontrivial, and it would involve having at least phone calls and possibly actual meetings with people who understand the microarchitecture of various CPUs to get all the details hammered out and documented properly. I don't think it would be efficient for teeny little functions like bcachefs wants, but maybe? That would be even more complex and messy.