Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp2663390ybt; Tue, 16 Jun 2020 11:40:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJymkTOGJrxDHYkLxo21+SbjmuR3vMq8mM3mQoLRNtpSiVfOxg8uuiommrfZ5pckCeGmjp3G X-Received: by 2002:a17:906:264a:: with SMTP id i10mr4049877ejc.210.1592332810100; Tue, 16 Jun 2020 11:40:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592332810; cv=none; d=google.com; s=arc-20160816; b=b84ImO8X3jogTnLLA/ezE9voxilYmeq+tMDgf4rvb0RWgLQ/8pf8sBORKtPyOUS2GJ 1xWMb7ps/KWcJvPFxR+ISzc0xLgr42rJm4Z8pHdTNauVEoOYxJdd6hR1jfQOUY3rhp5r qyzNukzEyUUIjO6VdxaadzY++YYDufIx4jmA0V9JcldPbk6sYOMnzRJITI6/h1jJrq/8 ieBpIjfaLHJgIibfustCsyZfdgIZyh2dQ018fnWZGkv3oh0EWmWGok4JXoV7sOZ7+pat +7oGyFqqsefVZ0m4sKNE13nMiGn6pJNyRrECTaHN/OwyjFTYfucDmHpCbnMii0t5pDGs fQNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=v2RjpTiuHZ3l0Pxqe/xAX5M0G+FBIdLaE2yGHwIgHUA=; b=rPwT4e1vOp3oz9jgBW7Zo8gcC4xA91Ht6dnJQ4fLdcQea9yGTp1NiUX05z12SXKhZP rhfit2vUHjejlH89x1jl+Boj5B34/fbAF9gR+pijnm5F4GMBCGW5X5Rm3drFb5DMUnEW xHPaIUg3mty9fDrk43V6ylixAP1WSJAFXrjhQ4Ct8Bdxdn0dOd3m00j+vowoP1xDMV5M DuDUY3ViMM6fLAkN5ShwrCiLHThQZwsIoLmD0rl6LygvljAZhYMOQsK59r/WG0pX6EQ8 gAqIb0JpDMh2i+Xl89q8m6zdHtlztk3MrYGKLC3G3/Hot3CogJLD/gEH4q81TIjhbppW LwRA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Q4o9HNHE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d16si11419884ejp.131.2020.06.16.11.39.46; Tue, 16 Jun 2020 11:40:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Q4o9HNHE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730024AbgFPSg7 (ORCPT + 99 others); Tue, 16 Jun 2020 14:36:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729801AbgFPSg6 (ORCPT ); Tue, 16 Jun 2020 14:36:58 -0400 Received: from mail-lj1-x244.google.com (mail-lj1-x244.google.com [IPv6:2a00:1450:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B13CC061573 for ; Tue, 16 Jun 2020 11:36:57 -0700 (PDT) Received: by mail-lj1-x244.google.com with SMTP id x18so24760227lji.1 for ; Tue, 16 Jun 2020 11:36:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=v2RjpTiuHZ3l0Pxqe/xAX5M0G+FBIdLaE2yGHwIgHUA=; b=Q4o9HNHESg+RdwLs/JmMWK1P2LIdIm2ymfk6CbxiOVhznUpF+7/tWHhKrgkyUJd/Ja feN6mXD/nxi3vEtBOzTRmsAwv/kcl4VI44tCrtdNzkbMwGmn4ydIolFBndVx+iOssnV0 K9NcnQccZ7ZZFfE653QizoY/d/iBAYktzJ5Xrgb/PeLd8+laxtA0uTFzTL0FVrj0tO4g IDTndEZaprVDykmvISs2GPkTzFj9RFCAywYDl6NYBoLKD8xL2o0cd64hcsnbnA6mqtmu 6UlHSDXPw8Ph+YZGHkjNv51lIBS6tR4U2tCfbF3LIFWJgRZyrFwj6DdJ6JyZy9xdyNiO zC3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=v2RjpTiuHZ3l0Pxqe/xAX5M0G+FBIdLaE2yGHwIgHUA=; b=t5KBuJAZ8F3dhrMjr6DT4aNJSrDyGOcqMpP7/7B6D+4x/5XtCF218HgH3WZDblyjBJ RpYj5fc2QZnQLZ+NuCoACsL1c+nb1uHWacWUZOijnBOwTSGR880Ci1cAondpNhHEqBUJ iE/MKK44hkv4BvMj4JK/qdJq8TYelF1TJRoRQzQBlJKIxItbEJtzEAKZ1ZY7cgawjU0A s1FCSj8cbHMBPVy3zhVF/JQu1mJX+eeDs5bZAPwecqBzaPM6GcbQ2kxGd+NizZp8azTG ZGY4MnAQU4anPTk3TL6pM8uE1Qai2lFWnDT8028QzjCbmq2vRTvenJ1YfUVpvh9NnUxR /z0A== X-Gm-Message-State: AOAM530aXwMP66bJORmVseZG0J4fqECCWcNGe4kfiNZFo5y/uxyQdwSH EoQFzDDj900Bo1lqQP0bpoBDzGVGc3UAgwbiwuUgfg== X-Received: by 2002:a2e:970c:: with SMTP id r12mr2044805lji.145.1592332615437; Tue, 16 Jun 2020 11:36:55 -0700 (PDT) MIME-Version: 1.0 References: <20200616074934.1600036-1-keescook@chromium.org> <20200616074934.1600036-5-keescook@chromium.org> <202006160757.99FD9B785@keescook> In-Reply-To: <202006160757.99FD9B785@keescook> From: Jann Horn Date: Tue, 16 Jun 2020 20:36:28 +0200 Message-ID: Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps To: Kees Cook Cc: kernel list , Christian Brauner , Sargun Dhillon , Tycho Andersen , "zhujianwei (C)" , Dave Hansen , Matthew Wilcox , Andy Lutomirski , Will Drewry , Shuah Khan , Matt Denton , Chris Palmer , Jeffrey Vander Stoep , Aleksa Sarai , Hehuazhen , "the arch/x86 maintainers" , Linux Containers , linux-security-module , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2020 at 5:49 PM Kees Cook wrote: > On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote: > > Wouldn't it be simpler to use a function that can run a subset of > > seccomp cBPF and bails out on anything that indicates that a syscall's > > handling is complex or on instructions it doesn't understand? For > > syscalls that have a fixed policy, a typical seccomp filter doesn't > > even use any of the BPF_ALU ops, the scratch space, or the X register; > > it just uses something like the following set of operations, which is > > easy to emulate without much code: > > > > BPF_LD | BPF_W | BPF_ABS > > BPF_JMP | BPF_JEQ | BPF_K > > BPF_JMP | BPF_JGE | BPF_K > > BPF_JMP | BPF_JGT | BPF_K > > BPF_JMP | BPF_JA > > BPF_RET | BPF_K > > Initially, I started down this path. It needed a bit of plumbing into > BPF to better control the lifetime of the cBPF "saved original filter" > (normally used by CHECKPOINT_RESTORE uses) I don't think you need that? When a filter is added, you can compute the results of the added individual filter, and then merge the state. > and then I needed to keep > making exceptions (same list you have: ALU, X register, scratch, etc) > in the name of avoiding too much complexity in the emulator. I decided > I'd rather reuse the existing infrastructure to actually execute the > filter (no cBPF copy needed to be saved, no separate code, and full > instruction coverage). If you really think that this bit of emulation is so bad, you could also make a copy of the BPF filter in which you replace all load instructions from syscall arguments with "return NON_CONSTANT_RESULT", and then run that through the normal BPF infrastructure. > > Something like (completely untested): [...] > I didn't actually finish going down the emulator path (I stopped right > around the time I verified that libseccomp does use BPF_ALU -- though > only BPF_AND), so I didn't actually evaluate the filter contents for other > filter builders (i.e. Chrome). > > But, if BPF_ALU | BPF_AND were added to your code above, it would cover > everything libseccomp generates (which covers a lot of the seccomp > filters, e.g. systemd, docker). I just felt funny about an "incomplete" > emulator. > > Though now you've got me looking. It seems this is the core > of Chrome's BPF instruction generation: > https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc > It also uses ALU|AND, but adds JMP|JSET. > > So... that's only 2 more instructions to cover what I think are likely > the two largest seccomp instruction generators. > > > That way, you won't need any of this complicated architecture-specific stuff. > > There are two arch-specific needs, and using a cBPF-subset emulator > just gets rid of the local TLB flush. The other part is distinguishing > the archs. Neither requirement is onerous (TLB flush usually just > needs little more than an extern, arch is already documented in the > per-arch syscall_get_arch()). But it's also somewhat layer-breaking and reliant on very specific assumptions. Normal kernel code doesn't mess around with page table magic, outside of very specific low-level things. And your method would break if the fixed-value members were not all packed together at the start of the structure. And from a hardening perspective: The more code we add that fiddles around with PTEs directly, rather than going through higher-level abstractions, the higher the chance that something gets horribly screwed up. For example, this bit from your patch looks *really* suspect: + preempt_disable(); + set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep)))); + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE); + preempt_enable(); First off, that set_pte_at() is just a memory write; I don't see why you put it inside a preempt_disable() region. But more importantly, sticking a local TLB flush inside a preempt_disable() region with nothing else in there looks really shady. How is that supposed to work? If we migrate from CPU0 to CPU1 directly before this region, and then from CPU1 back to CPU0 directly afterwards, the local TLB flush will have no effect.