Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp15720915rwd; Sun, 25 Jun 2023 23:51:52 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6/o5JLSUGkqEI7EDVg50mBJy3/HSf761CrY19mgzKipHg3mUvuqNtDrrHY6Z9B+8fthlx1 X-Received: by 2002:aa7:cb0c:0:b0:514:8fdd:fc41 with SMTP id s12-20020aa7cb0c000000b005148fddfc41mr17716025edt.15.1687762312645; Sun, 25 Jun 2023 23:51:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687762312; cv=none; d=google.com; s=arc-20160816; b=bCDb6KB74n54VbfTdK9x8Nq8TzPNBVfhtEuVdxXvpYpWXQKtD4+m6bK2Auc5g0ySs4 xvNzt5Xccp8c0QBknou21nNw60CmMRgAec5F9YAvK7QDg+xefNuQKuC9Xh/9DGDqVnbh 2gN7f2M2v5Y+LLo7lJJBxDNP+ijisYymg4ldvLzmQnNusDDTUXw7+5O4EGJiZynTTZKR ZSNOSBjyF7fziN4u4TFHLaHu54MDGoF5Pu0cDLzMxKUDLuTQGapfELx4v7okO/CyqVMG v8nAQK4foDs5WEcguiSyrMcudj+02czRrjZPZmHfMFEtoOoQ6TMXhOuWfYYNPzDsUwNT AUWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=kbHuUC9PNSf+ywfX5+Yd/Il8Kk57XbmnX6bRfQXUlhY=; fh=vmHvluzqpo46zlyXA1hDf3Z7VxGWJgqqvwOJLZ3bY7g=; b=RZh6i+Z2GmnyjT+4ISWl0qSEZ0ajED2UUyUpyuPS1IINil4es1yG1QmSRvs2D6e+es F7jFbyAG9f1y7n7hLxsOvX7m0QFqbRLWMkLrl77mW/5FTfaIxdE+cCfc6yrWShPDiaKz j3jI1Q6J4KXRlc+hk6X9SOaWzwiZU96hUsRKdJLmNdAH2xq7WVNhfjQ+a2JE3sRB7vPU OFDYYOAyYYsPgURQJCUFhAAFTeEa7vxceA4H8XT3MCTnNoObopGj/Z6fIU4tpauruwgJ p9us6yLTVlVd8KefSo2FbOzVokdLPLjDpyzZDE3F/H9y2s8AJ2jEQ/VClUtB8YTTvAz1 6nsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Kqp3cvsR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d7-20020a50fe87000000b0051a48e2d185si2170927edt.286.2023.06.25.23.51.27; Sun, 25 Jun 2023 23:51:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Kqp3cvsR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230111AbjFZGNf (ORCPT + 99 others); Mon, 26 Jun 2023 02:13:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38678 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229586AbjFZGNd (ORCPT ); Mon, 26 Jun 2023 02:13:33 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 19C34EB; Sun, 25 Jun 2023 23:13:32 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6D0DD60C8C; Mon, 26 Jun 2023 06:13:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A64A6C433D9; Mon, 26 Jun 2023 06:13:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1687760010; bh=tIDPiX4oZCQnSjoJrRmXYm4/HW+kneyQ9COWSK2/0ZY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=Kqp3cvsRJ3zrwxvYCz0cNhvxJ+ddOQlZbdAd6MYB7hpQfGaPjaEWx4XRgumHA3asl 2pfxaCvJsapTma27TczMyDfkLmZLMX4LxIpbXhwpJJZRXNZyd0yCBPciRDI7IG26pW ZofYK78Vy6+X23yGoPjG4iHaIJX3xZk1Cf5VyXONw4g0+lhwJyZb3QOZc9UC1abgeN JpHRigJrpulMhEXYrJrzp6iqFIELQ33QjiEoR3RKROk8Rgf0ZADCCTCGDDgSsfyqIA PxNsesDEzXoQ4yNdH5/98H+JDv2HJQTHEWa2ob8wVV0DOdBO7Eu/aKKZ0U8efnWun+ 0ZC+zZqTlmuaA== Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-4f4b2bc1565so3627446e87.2; Sun, 25 Jun 2023 23:13:30 -0700 (PDT) X-Gm-Message-State: AC+VfDx3RS1tSDV4GnIXUXfaas/h7pEg6GTeJf56cXwb5ntnihtFlMH2 4eNkqPkScdfHuPfCm6ZyZQyYLXmlruTm9AAJST8= X-Received: by 2002:a05:6512:705:b0:4f9:ec5e:d624 with SMTP id b5-20020a056512070500b004f9ec5ed624mr1614382lfs.38.1687760008514; Sun, 25 Jun 2023 23:13:28 -0700 (PDT) MIME-Version: 1.0 References: <20230616085038.4121892-1-rppt@kernel.org> <20230616085038.4121892-3-rppt@kernel.org> <20230618080027.GA52412@kernel.org> <20230625161417.GK52412@kernel.org> <90161ac9-3ca0-4c72-b1c4-ab1293e55445@app.fastmail.com> <20230625174257.GL52412@kernel.org> <20230625180741.jrrtkq55c4jrqh3t@moria.home.lan> In-Reply-To: <20230625180741.jrrtkq55c4jrqh3t@moria.home.lan> From: Song Liu Date: Sun, 25 Jun 2023 23:13:15 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() To: Kent Overstreet Cc: Mike Rapoport , Andy Lutomirski , Mark Rutland , Kees Cook , Linux Kernel Mailing List , Andrew Morton , Catalin Marinas , Christophe Leroy , "David S. Miller" , Dinh Nguyen , Heiko Carstens , Helge Deller , Huacai Chen , Luis Chamberlain , Michael Ellerman , Nadav Amit , "Naveen N. Rao" , Palmer Dabbelt , Puranjay Mohan , Rick P Edgecombe , "Russell King (Oracle)" , Steven Rostedt , Thomas Bogendoerfer , Thomas Gleixner , Will Deacon , bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev, netdev@vger.kernel.org, sparclinux@vger.kernel.org, "the arch/x86 maintainers" , pjt@google.com, torvalds@linux-foundation.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jun 25, 2023 at 11:07=E2=80=AFAM Kent Overstreet wrote: > > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote: > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote: > > > > > > > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote: > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote: > > > >> > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote: > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote: > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote: > > > >> >> > From: "Mike Rapoport (IBM)" > > > >> >> > > > > >> >> > module_alloc() is used everywhere as a mean to allocate memor= y for code. > > > >> >> > > > > >> >> > Beside being semantically wrong, this unnecessarily ties all = subsystems > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF t= o modules > > > >> >> > and puts the burden of code allocation to the modules code. > > > >> >> > > > > >> >> > Several architectures override module_alloc() because of vari= ous > > > >> >> > constraints where the executable memory can be located and th= is causes > > > >> >> > additional obstacles for improvements of code allocation. > > > >> >> > > > > >> >> > Start splitting code allocation from modules by introducing > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_f= ree() APIs. > > > >> >> > > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrap= pers for > > > >> >> > module_alloc() and execmem_free() and jit_free() are replacem= ents of > > > >> >> > module_memfree() to allow updating all call sites to use the = new APIs. > > > >> >> > > > > >> >> > The intention semantics for new allocation APIs: > > > >> >> > > > > >> >> > * execmem_text_alloc() should be used to allocate memory that= must reside > > > >> >> > close to the kernel image, like loadable kernel modules and= generated > > > >> >> > code that is restricted by relative addressing. > > > >> >> > > > > >> >> > * jit_text_alloc() should be used to allocate memory for gene= rated code > > > >> >> > when there are no restrictions for the code placement. For > > > >> >> > architectures that require that any code is within certain = distance > > > >> >> > from the kernel image, jit_text_alloc() will be essentially= aliased to > > > >> >> > execmem_text_alloc(). > > > >> >> > > > > >> >> > > > >> >> Is there anything in this series to help users do the appropria= te > > > >> >> synchronization when the actually populate the allocated memory= with > > > >> >> code? See here, for example: > > > >> > > > > >> > This series only factors out the executable allocations from mod= ules and > > > >> > puts them in a central place. > > > >> > Anything else would go on top after this lands. > > > >> > > > >> Hmm. > > > >> > > > >> On the one hand, there's nothing wrong with factoring out common c= ode. On > > > >> the other hand, this is probably the right time to at least start > > > >> thinking about synchronization, at least to the extent that it mig= ht make > > > >> us want to change this API. (I'm not at all saying that this seri= es > > > >> should require changes -- I'm just saying that this is a good time= to > > > >> think about how this should work.) > > > >> > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't a= ctually > > > >> look like the one think in the Linux ecosystem that actually > > > >> intelligently and efficiently maps new text into an address space: > > > >> mmap(). > > > >> > > > >> On x86, you can mmap() an existing file full of executable code PR= OT_EXEC > > > >> and jump to it with minimal synchronization (just the standard imp= licit > > > >> ordering in the kernel that populates the pages before setting up = the > > > >> PTEs and whatever user synchronization is needed to avoid jumping = into > > > >> the mapping before mmap() finishes). It works across CPUs, and th= e only > > > >> possible way userspace can screw it up (for a read-only mapping of > > > >> read-only text, anyway) is to jump to the mapping too early, in wh= ich > > > >> case userspace gets a page fault. Incoherence is impossible, and = no one > > > >> needs to "serialize" (in the SDM sense). > > > >> > > > >> I think the same sequence (from userspace's perspective) works on = other > > > >> architectures, too, although I think more cache management is need= ed on > > > >> the kernel's end. As far as I know, no Linux SMP architecture nee= ds an > > > >> IPI to map executable text into usermode, but I could easily be wr= ong. > > > >> (IIRC RISC-V has very developer-unfriendly icache management, but = I don't > > > >> remember the details.) > > > >> > > > >> Of course, using ptrace or any other FOLL_FORCE to modify text on = x86 is > > > >> rather fraught, and I bet many things do it wrong when userspace i= s > > > >> multithreaded. But not in production because it's mostly not used= in > > > >> production.) > > > >> > > > >> But jit_text_alloc() can't do this, because the order of operation= s > > > >> doesn't match. With jit_text_alloc(), the executable mapping show= s up > > > >> before the text is populated, so there is no atomic change from no= t-there > > > >> to populated-and-executable. Which means that there is an opportu= nity > > > >> for CPUs, speculatively or otherwise, to start filling various cac= hes > > > >> with intermediate states of the text, which means that various > > > >> architectures (even x86!) may need serialization. > > > >> > > > >> For eBPF- and module- like use cases, where JITting/code gen is qu= ite > > > >> coarse-grained, perhaps something vaguely like: > > > >> > > > >> jit_text_alloc() -> returns a handle and an executable virtual add= ress, > > > >> but does *not* map it there > > > >> jit_text_write() -> write to that handle > > > >> jit_text_map() -> map it and synchronize if needed (no sync needed= on > > > >> x86, I think) > > > >> > > > >> could be more efficient and/or safer. > > > >> > > > >> (Modules could use this too. Getting alternatives right might tak= e some > > > >> fiddling, because off the top of my head, this doesn't match how i= t works > > > >> now.) > > > >> > > > >> To make alternatives easier, this could work, maybe (haven't fully > > > >> thought it through): > > > >> > > > >> jit_text_alloc() > > > >> jit_text_map_rw_inplace() -> map at the target address, but RW, !X > > > >> > > > >> write the text and apply alternatives > > > >> > > > >> jit_text_finalize() -> change from RW to RX *and synchronize* > > > >> > > > >> jit_text_finalize() would either need to wait for RCU (possibly ex= tra > > > >> heavy weight RCU to get "serialization") or send an IPI. > > > > > > > > This essentially how modules work now. The memory is allocated RW, = written > > > > and updated with alternatives and then made ROX in the end with set= _memory > > > > APIs. > > > > > > > > The issue with not having the memory mapped X when it's written is = that we > > > > cannot use large pages to map it. One of the goals is to have execu= table > > > > memory mapped with large pages and make code allocator able to divi= de that > > > > page among several callers. > > > > > > > > So the idea was that jit_text_alloc() will have a cache of large pa= ges > > > > mapped ROX, will allocate memory from those caches and there will b= e > > > > jit_update() that uses text poking for writing to that memory. > > > > > > > > Upon allocation of a large page to increase the cache, that large p= age will > > > > be "invalidated" by filling it with breakpoint instructions (e.g in= t3 on > > > > x86) > > > > > > Is this actually valid? In between int3 and real code, there=E2=80= =99s a > > > potential torn read of real code mixed up with 0xcc. > > > > You mean while doing text poking? > > I think we've been getting distracted by text_poke(). text_poke() does > updates via a different virtual address which introduce new > synchroniation wrinkles, but it's not the main issue. > > As _think_ I understand it, the root of the issue is that speculative > execution - and that per Andy, speculative execution doesn't obey memory > barriers. > > I have _not_ dug into the details of how retpolines work and all the > spectre stuff that was going on, but - retpoline uses lfence, doesn't > it? And if speculative execution is the issue here, isn't retpoline what > we need? > > For this particular issue, I'm not sure "invalidate by filling with > illegal instructions" makes sense. For that to work, would the processor > have to execute a serialize operation and a retry on hitting an illegal > instruction - or perhaps we do in the interrupt handler? > > But if filling with illegal instructions does act as a speculation > barrier, then the issue is that a torn read could generate a legal but > incorrect instruction. What is a "torn read" here? I assume it is an instruction read that goes at the wrong instruction boundary (CISC). If this is correct, do we need to handle torn read caused by software bug, or hardware bit flip, or both? Thanks, Song