Received: by 2002:a25:ca44:0:0:0:0:0 with SMTP id a65csp476901ybg; Tue, 28 Jul 2020 10:34:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwXrA05My0V5pGpzhCxZP9uq8ddFmy230Szu72Drtn0o3HzJMuvLHT7IEUbrvDxD4U2KR/d X-Received: by 2002:aa7:da46:: with SMTP id w6mr28169975eds.261.1595957679272; Tue, 28 Jul 2020 10:34:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595957679; cv=none; d=google.com; s=arc-20160816; b=PdlIaw91en9y2O69kK32wPx92u/YIQc/Y5SYXM051GgvDtGfH7GLXjVTOtjxHTvffT RNDxfdE0gVC+Fqbs93mm8Hcr10jbKOhb3mNpshkw0BVCoLMWLtXc2/nQFx6xAhDwXWTE 6wYuliTtqVZPuh9x0+YIaZxvHMP3XWK7GIBHSUg9O6Ds3qdcPsbV3M8O86mpUW7pqdrb VDVTsfxEeaZaowkcxepZti+VAodPncibchNLqA25pk0hKWDh2aPe5oY+bJaY0NoCttmt HMUIqzBYmQ/TkDj6xwZ6LbQBZzYVjvHU8mznb3dLV1nWs/6ceBOWjvrU++lrJOYgdt/v gnAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=EG+GLGlcd2NBigLt4EVhHeYuVq+sZqszkCiEc9dZsAg=; b=v+hHGtMT9hD0pDdgOtI1hBIzw8ZkjkxDy0FAIBwV6sf19WG0vUltvjWqwKBxteQmQl BaD+Bp9e6So0KSEotsvRwWT0b6wHVy81pKVGH+INoyZ1V6Qi0rTLBWILExRA7iKeNpky tF7ZLUwnwzoJblFjdvroG3gM4RHNpa2KYeum5c3y0+4eTwC62Qo3od0xcjOSe72Jriqf H+rdv/2CzTcjv63f4B9QH6MmjJykj39xyteEBEZ9Z8fLRhPPo1oCICczEjtUy90p2Vg0 kb6VLiuNYht9HyfKmsCCdKpQaPaq7P+xpa5vhpvtLDpofoTlHNb/MYoqZ6ki99hVSylp oL2A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=00hywb4Z; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dt17si371729ejc.714.2020.07.28.10.34.17; Tue, 28 Jul 2020 10:34:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=00hywb4Z; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732015AbgG1RcO (ORCPT + 99 others); Tue, 28 Jul 2020 13:32:14 -0400 Received: from mail.kernel.org ([198.145.29.99]:45770 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731892AbgG1RcN (ORCPT ); Tue, 28 Jul 2020 13:32:13 -0400 Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id CFFB7207F5 for ; Tue, 28 Jul 2020 17:32:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1595957533; bh=9/DO98f4NNb2r87lGlZR0IYODVEjduHFQ1okay382Kw=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=00hywb4ZCFn7lczpn0BiOMeOm9xDGiOqlvQj8/th2fDX/IzE1putRyI2doIv2tI36 gO4lphP9SFo10v3da6m5h1BH07cHRR/okVnVy90sS8zQtt5pSIghUkUXx0ikOM7Nd0 h6O5+kD6MR7Vr5ux5UT9pRxsLHVuGsY6clVQqbF0= Received: by mail-wr1-f48.google.com with SMTP id a14so19107671wra.5 for ; Tue, 28 Jul 2020 10:32:12 -0700 (PDT) X-Gm-Message-State: AOAM530yajwin7RyZjDC6V19W9n6q4ZvVu39H+FwZM5vj5LLCTvwgAk/ gQ+rjXtQwFSdPeUg0oS8tNpCFhO3sbtvwSDs/TTgPw== X-Received: by 2002:a5d:5273:: with SMTP id l19mr25476365wrc.257.1595957531409; Tue, 28 Jul 2020 10:32:11 -0700 (PDT) MIME-Version: 1.0 References: <20200728131050.24443-1-madvenka@linux.microsoft.com> In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com> From: Andy Lutomirski Date: Tue, 28 Jul 2020 10:31:59 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor To: madvenka@linux.microsoft.com Cc: Kernel Hardening , Linux API , linux-arm-kernel , Linux FS Devel , linux-integrity , LKML , LSM List , Oleg Nesterov , X86 ML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote: > > =EF=BB=BFFrom: "Madhavan T. Venkataraman" > > The kernel creates the trampoline mapping without any permissions. When > the trampoline is executed by user code, a page fault happens and the > kernel gets control. The kernel recognizes that this is a trampoline > invocation. It sets up the user registers based on the specified > register context, and/or pushes values on the user stack based on the > specified stack context, and sets the user PC to the requested target > PC. When the kernel returns, execution continues at the target PC. > So, the kernel does the work of the trampoline on behalf of the > application. This is quite clever, but now I=E2=80=99m wondering just how much kernel he= lp is really needed. In your series, the trampoline is an non-executable page. I can think of at least two alternative approaches, and I'd like to know the pros and cons. 1. Entirely userspace: a return trampoline would be something like: 1: pushq %rax pushq %rbc pushq %rcx ... pushq %r15 movq %rsp, %rdi # pointer to saved regs leaq 1b(%rip), %rsi # pointer to the trampoline itself callq trampoline_handler # see below You would fill a page with a bunch of these, possibly compacted to get more per page, and then you would remap as many copies as needed. The 'callq trampoline_handler' part would need to be a bit clever to make it continue to work despite this remapping. This will be *much* faster than trampfd. How much of your use case would it cover? For the inverse, it's not too hard to write a bit of asm to set all registers and jump somewhere. 2. Use existing kernel functionality. Raise a signal, modify the state, and return from the signal. This is very flexible and may not be all that much slower than trampfd. 3. Use a syscall. Instead of having the kernel handle page faults, have the trampoline code push the syscall nr register, load a special new syscall nr into the syscall nr register, and do a syscall. On x86_64, this would be: pushq %rax movq __NR_magic_trampoline, %rax syscall with some adjustment if the stack slot you're clobbering is important. Also, will using trampfd cause issues with various unwinders? I can easily imagine unwinders expecting code to be readable, although this is slowly going away for other reasons. All this being said, I think that the kernel should absolutely add a sensible interface for JITs to use to materialize their code. This would integrate sanely with LSMs and wouldn't require hacks like using files, etc. A cleverly designed JIT interface could function without seriailization IPIs, and even lame architectures like x86 could potentially avoid shootdown IPIs if the interface copied code instead of playing virtual memory games. At its very simplest, this could be: void *jit_create_code(const void *source, size_t len); and the result would be a new anonymous mapping that contains exactly the code requested. There could also be: int jittfd_create(...); that does something similar but creates a memfd. A nicer implementation for short JIT sequences would allow appending more code to an existing JIT region. On x86, an appendable JIT region would start filled with 0xCC, and I bet there's a way to materialize new code into a previously 0xcc-filled virtual page wthout any synchronization. One approach would be to start with: 0xcc 0xcc ... 0xcc and to create a whole new page like: 0xcc ... 0xcc so that the only difference is that some code changed to some more code. Then replace the PTE to swap from the old page to the new page, and arrange to avoid freeing the old page until we're sure it's gone from all TLBs. This may not work if spans a page boundary. The #BP fixup would zap the TLB and retry. Even just directly copying code over some 0xcc bytes almost works, but there's a nasty corner case involving instructions that fetch I$ fetch boundaries. I'm not sure to what extent I$ snooping helps. --Andy