Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp503720ybl; Wed, 8 Jan 2020 00:43:08 -0800 (PST) X-Google-Smtp-Source: APXvYqwo5ifYBWEs83kEDM78DwWQAKLf1TF7Td4VZM6yqLOVHENH3r5LjwWtK2YSkWL1IxadP3j8 X-Received: by 2002:a9d:6045:: with SMTP id v5mr3183674otj.252.1578472988671; Wed, 08 Jan 2020 00:43:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1578472988; cv=none; d=google.com; s=arc-20160816; b=iX+UK1om3nznhvAWqj0YqgELSv33tKCiqwplykxaXR5Uj66DNNTfm+EM8+Yo/g+MXe 9NUaMCnQE1rcq/tNkIvTj0Lsl9uwqPSJMStZVLa7PhXJPv3VkOTy2+tu6VtO3L9vlw5g HA8hAlFnpnP/CmsUwfFwK6QT50N3VHuuJnMd42IQrUjt9SvKYhEQjyBN0mAaxJ8d9UIZ 8zxvWMxot5HClF/YJG55Vhk+jLywExLokhreCnAnc6xTNjd7BqcdLNbxI2g7UBD9HtBX 6YNGLFjlDgtIC1QORP/RxVJAq5jnU3MPtti5F1td+ADBI1t+39v829qTCcevfaZCQrD1 TrnQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=8isf6T9ZOG2zK/9yLqwLegw87znGma6zfPkY0jFhA9c=; b=SLO+xEJgkqpM2YgPoBYzrYZkIXYIPrjidRSUP7gf87BOViBCdmYi3L0hmrKS+8EY8m SEgSW1bTer4UKFkxazlL6oQaX7Rabv7+CTEP/Jegw6v2wI1EPCGurKNSOT5yQYVXUmE3 S3pCspVjHCA6hssOj4IRamxVcFC56m5cFVwxCvQHcATQVPdDZu9uefCnh0rsTXd42/HH XHUdCXEtcER+pl9PvElEw/e1WO8OlAtbv9zoLuM4jTXE97hlPdwOqZw/innOhE1cMvav fCM21vXcidctmEcjRSvplZZT9ACBSYj/NSW15Yv55gqmQfAx/dvcdooKCTpaF7j03cwW eMIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=M9aBQBAF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u8si1431350otq.262.2020.01.08.00.42.56; Wed, 08 Jan 2020 00:43:08 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=M9aBQBAF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727381AbgAHIl5 (ORCPT + 99 others); Wed, 8 Jan 2020 03:41:57 -0500 Received: from mail.kernel.org ([198.145.29.99]:43154 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726368AbgAHIl4 (ORCPT ); Wed, 8 Jan 2020 03:41:56 -0500 Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com [209.85.128.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 656742075D for ; Wed, 8 Jan 2020 08:41:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578472915; bh=PXWKCUXCMzVMdtKOBwvy5/oEvTcWRUdl1Vb3OkpTR5Q=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=M9aBQBAFTIEBSLVIQXPNQIx5xRl6tnG/w3q7ncmKz18laMS58AwmCyudHt7Qg8sHU wAJInX4TicDjT7rDP7P/3lyWC/WIIRxyDuEly0J4sPtTb/jQ8fHh4KuLct3rHblfVC 00D/ugrWbEdzQ/AqiuuqnUiFnZgoHR2yuW8MSmUg= Received: by mail-wm1-f51.google.com with SMTP id p9so1529386wmc.2 for ; Wed, 08 Jan 2020 00:41:55 -0800 (PST) X-Gm-Message-State: APjAAAXbT4vL4L3IBf8hgleyKwwK74AK2RrzeXToA1Sp4CRsANx5B9C6 FpmnHLS7CjwB5KM61TmoaD3xbiETM+u/tTZbakU+nQ== X-Received: by 2002:a1c:3dc3:: with SMTP id k186mr2335219wma.95.1578472913806; Wed, 08 Jan 2020 00:41:53 -0800 (PST) MIME-Version: 1.0 References: <21bf6bb46544eab79e792980f82520f8fbdae9b5.camel@intel.com> In-Reply-To: From: Andy Lutomirski Date: Wed, 8 Jan 2020 00:41:42 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH bpf-next] bpf: Make trampolines W^X To: "Edgecombe, Rick P" , Dave Hansen , Nadav Amit , Peter Zijlstra Cc: "songliubraving@fb.com" , "linux-kernel@vger.kernel.org" , "bpf@vger.kernel.org" , "keescook@chromium.org" , "jeyu@kernel.org" , "ast@kernel.org" , "kuznet@ms2.inr.ac.ru" , "daniel@iogearbox.net" , "mjg59@google.com" , "thgarnie@chromium.org" , "kpsingh@chromium.org" , "linux-security-module@vger.kernel.org" , "x86@kernel.org" , "revest@chromium.org" , "jannh@google.com" , "namit@vmware.com" , "jackmanb@chromium.org" , "kafai@fb.com" , "yhs@fb.com" , "davem@davemloft.net" , "yoshfuji@linux-ipv6.org" , "mhalcrow@google.com" , "andriin@fb.com" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P wrote: > > =EF=BB=BFCC Nadav and Jessica. > > On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote: >>> On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P >>> wrote: >>> >>> =EF=BB=BFOn Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote: >>>>>>> On Jan 4, 2020, at 8:47 AM, KP Singh wrote: >>>>>> >>>>>> =EF=BB=BFFrom: KP Singh >>>>>> >>>>>> The image for the BPF trampolines is allocated with >>>>>> bpf_jit_alloc_exe_page which marks this allocated page executable. T= his >>>>>> means that the allocated memory is W and X at the same time making i= t >>>>>> susceptible to WX based attacks. >>>>>> >>>>>> Since the allocated memory is shared between two trampolines (the >>>>>> current and the next), 2 pages must be allocated to adhere to W^X an= d >>>>>> the following sequence is obeyed where trampolines are modified: >>>>> >>>>> Can we please do better rather than piling garbage on top of garbage? >>>>> >>>>>> >>>>>> - Mark memory as non executable (set_memory_nx). While module_alloc = for >>>>>> x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, no= t >>>>>> all implementations of module_alloc do so >>>>> >>>>> How about fixing this instead? >>>>> >>>>>> - Mark the memory as read/write (set_memory_rw) >>>>> >>>>> Probably harmless, but see above about fixing it. >>>>> >>>>>> - Modify the trampoline >>>>> >>>>> Seems reasonable. It=E2=80=99s worth noting that this whole approach = is >>>>> suboptimal: >>>>> the =E2=80=9Cmodule=E2=80=9D allocator should really be returning a l= ist of pages to be >>>>> written (not at the final address!) with the actual executable mappin= g to >>>>> be >>>>> materialized later, but that=E2=80=99s a bigger project that you=E2= =80=99re welcome to >>>>> ignore >>>>> for now. (Concretely, it should produce a vmap address with backing = pages >>>>> but >>>>> with the vmap alias either entirely unmapped or read-only. A subseque= nt >>>>> healer >>>>> would, all at once, make the direct map pages RO or not-present and m= ake >>>>> the >>>>> vmap alias RX.) >>>>>> - Mark the memory as read-only (set_memory_ro) >>>>>> - Mark the memory as executable (set_memory_x) >>>>> >>>>> No, thanks. There=E2=80=99s very little excuse for doing two IPI flus= hes when one >>>>> would suffice. >>>>> >>>>> As far as I know, all architectures can do this with a single flush >>>>> without >>>>> races x86 certainly can. The module freeing code gets this sequence >>>>> right. >>>>> Please reuse its mechanism or, if needed, export the relevant interfa= ces. >>> >>> So if I understand this right, some trampolines have been added that ar= e >>> currently set as RWX at modification time AND left that way during runt= ime? >>> The >>> discussion on the order of set_memory_() calls in the commit message ma= de me >>> think that this was just a modification time thing at first. >> >> I=E2=80=99m not sure what the status quo is. >> >> We really ought to have a genuinely good API for allocation and initiali= zation >> of text. We can do so much better than set_memory_blahblah. >> >> FWIW, I have some ideas about making kernel flushes cheaper. It=E2=80=99= s currently >> blocked on finding some time and on tglx=E2=80=99s irqtrace work. >> > > Makes sense to me. I guess there are 6 types of text allocations now: > - These two BPF trampolines > - BPF JITs > - Modules > - Kprobes > - Ftrace > > All doing (or should be doing) pretty much the same thing. I believe Jess= ica had > said at one point that she didn't like all the other features using > module_alloc() as it was supposed to be just for real modules. Where woul= d the > API live? New header? This shouldn=E2=80=99t matter that much. Here are two strawman proposals. All of this is very rough -- the actual data structures and signatures are likely problematic for multiple reasons. --- First proposal --- struct text_allocation { void *final_addr; struct page *pages; int npages; }; int text_alloc(struct text_allocation *out, size_t size); /* now final_addr is not accessible and pages is writable. */ int text_freeze(struct text_allocation *alloc); /* now pages are not accessible and final_addr is RO. Alternatively, pages are RO and final_addr is unmapped. */ int text_finish(struct text_allocation *alloc); /* now final_addr is RX. All done. */ This gets it with just one flush and gives a chance to double-check in case of race attacks from other CPUs. Double-checking is annoying, though. --- Second proposal --- struct text_allocation { void *final_addr; /* lots of opaque stuff including an mm_struct */ /* optional: list of struct page, but this isn't obviously useful */ }; int text_alloc(struct text_allocation *out, size_t size); /* Memory is allocated. There is no way to access it at all right now. The memory is RO or not present in the direct map. */ void __user *text_activate_mapping(struct text_allocation *out); /* Now the text is RW at *user* address given by return value. Preemption is off if required by use_temporary_mm(). Real user memory cannot be accessed. */ void text_deactivate_mapping(struct text_allocation *alloc); /* Now the memory is inaccessible again. */ void text_finalize(struct text_allocation *alloc); /* Now it's RX or XO at the final address. */ Pros of second approach: - Inherently immune to cross-CPU attack. No double-check. - If we ever implement a cache of non-direct-mapped, unaliased pages, then it works with no flushes at all. We could even relax it a bit to allow non-direct-mapped pages that may have RX / XO aliases but no W aliases. - Can easily access without worrying about page boundaries. Cons: - The use of a temporary mm is annoying -- you can't copy from user memory, for example.