Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp2400644ybt; Tue, 16 Jun 2020 05:17:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzI09hdPIdXkYo2eWVh9R7+0IQgyRV3kFjw4a/605pcw5wSX/XIIxvHsWpjMJP4gcqVF0qN X-Received: by 2002:a50:f044:: with SMTP id u4mr2327157edl.226.1592309869326; Tue, 16 Jun 2020 05:17:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592309869; cv=none; d=google.com; s=arc-20160816; b=gsoOxbjUkoYxY++yofbpLoxzWGdancCJo/ycexWPIcbwi5ikV5sZyUGQOb8JJFnjJK YXpqYDg+VLqP90PU6GCQdpkOdSas+WT6JwsUrDikuEJ54CWp+RX7zDhEwNhgibpPr/n6 R3lwDwZNIUiXG1BabJU80FIwNPTC2gSM6BBggEbqC+JD1OA28vxj2D6wGIco2hXXxsgL TIOK2Yl4IVgNrRvj3g7GPyOG4xQIEWObR8YQJrt2YzBdk6Wi7a5BZFsvkjptynNDNJGt BxKBQJz9NAYF3HEMN+OLI2H2wozIwA5hEEcv/bU7R+VYtdhHjkIggqug/Lu2NBn2sb7S FRHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=lId2VqgWnSj5hcyOUSdipcvBeLlaT4UQgOgtGOYEhak=; b=Ws+6LIi8Yg+Ig/UKgEO4Bon8PiBTZDm2wKPJxiedomZi/XxkaKqVCSLy/NwVumdBgv HGkmK0pHGg6NIgdh2UKvdusvtCEOTQ26xMRUEI/YYB5CezwEf6qsgllTwYnNn3LrSS4R T5c2sREUlYbUDwDpGDI6IL92LfuvU48Y1RY6UGRtCLrz8imSx7P38OBXpz2p5ENrIg9X WOap1irq6GSZBKEqLnPjwg8UcMMj0Epk3S/pUEh5+yYciyKfEmKOXG428seARjans/yW CX0ZE9A4L1/usOPpmsFg0er7gdljFjgbGKj/au7i+9mjYthQKyS1jcwOTdBZfNiHaorf VBlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=s1Xvol0h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id gz26si10166769ejb.470.2020.06.16.05.17.25; Tue, 16 Jun 2020 05:17:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=s1Xvol0h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728176AbgFPMPW (ORCPT + 99 others); Tue, 16 Jun 2020 08:15:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41574 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726606AbgFPMPS (ORCPT ); Tue, 16 Jun 2020 08:15:18 -0400 Received: from mail-lj1-x244.google.com (mail-lj1-x244.google.com [IPv6:2a00:1450:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7FFA3C08C5C3 for ; Tue, 16 Jun 2020 05:15:16 -0700 (PDT) Received: by mail-lj1-x244.google.com with SMTP id i3so18695088ljg.3 for ; Tue, 16 Jun 2020 05:15:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=lId2VqgWnSj5hcyOUSdipcvBeLlaT4UQgOgtGOYEhak=; b=s1Xvol0hq1EOoKeJihXOkMd72MbgzGDi/AYBydiHucg2rKev+h0cN2w6F1lmaJLfF7 2LVZS3A1TRtzSbwJjVUdPfbMu275KD7VR8+NJ6MIGv9XrItrc3KQq5z4P5w5LxzRRV0b SKxXhp+NiJbbuoGqXlAp3egwbRyD9nt8nsY551sk3YR1vcXco5WbUHAYFZO6CmEaR8kc J+pOUxODJERuU3o74KnAYANuo6GfzDdgCBtAq+Ot8dNTfFcW+DvGnR1F56ugfzAC1cB1 ZghfX5C3ihyc22918h8plqPmbDXYit3e+p07Im8MqqpB+6cD1mvze8eVe0LIMM7Zbd3d MmKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=lId2VqgWnSj5hcyOUSdipcvBeLlaT4UQgOgtGOYEhak=; b=IqDmp1DgeXumjuYfrgLfXYijOrzDeKwyDVuJWHBtDjo9H7Iqq0Y0Z5CHM+ns9z7tDa BxINBufeX/sYy0jT/X0Mm8U+QdKFrLs3z8zCZ3rH3fcv4XgfBebMq8G5eCV6iY+zx9RJ Rfrr/xP/6QzhbFPLKiGW0V/Z3KMa8XKE4bLGvUbfmwTa1xCa3PzAIalTJykYBw/bjvTr UC9NqBlBbqincKncziO6T7jQeCaB1eBTaIZrhfX4FEAlX8XAIWys9c2sT2z74M7BesyO hKmjCyFtSVibuTOuvN3SgUcvLuKnj32N+4bGucf4EjMMOfeJa9hTMyUjUMIPBwAN8VeC LLcA== X-Gm-Message-State: AOAM5339G8BnRPFMUhbJw/SMl9ImozpT1soMAm9TNha5xkyBqtikvFTO lmDprtAelynFsZ4ITqzO6ZZVszKiQNmQedkcVi4pmg== X-Received: by 2002:a05:651c:c9:: with SMTP id 9mr1390777ljr.365.1592309714391; Tue, 16 Jun 2020 05:15:14 -0700 (PDT) MIME-Version: 1.0 References: <20200616074934.1600036-1-keescook@chromium.org> <20200616074934.1600036-5-keescook@chromium.org> In-Reply-To: <20200616074934.1600036-5-keescook@chromium.org> From: Jann Horn Date: Tue, 16 Jun 2020 14:14:47 +0200 Message-ID: Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps To: Kees Cook Cc: kernel list , Christian Brauner , Sargun Dhillon , Tycho Andersen , "zhujianwei (C)" , Dave Hansen , Matthew Wilcox , Andy Lutomirski , Will Drewry , Shuah Khan , Matt Denton , Chris Palmer , Jeffrey Vander Stoep , Aleksa Sarai , Hehuazhen , "the arch/x86 maintainers" , Linux Containers , linux-security-module , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2020 at 9:49 AM Kees Cook wrote: > One of the most common pain points with seccomp filters has been dealing > with the overhead of processing the filters, especially for "always allow" > or "always reject" cases. While BPF is extremely fast[1], it will always > have overhead associated with it. Additionally, due to seccomp's design, > filters are layered, which means processing time goes up as the number > of filters attached goes up. > > In the past, efforts have been focused on making filter execution complete > in a shorter amount of time. For example, filters were rewritten from > using linear if/then/else syscall search to using balanced binary trees, > or moving tests for syscalls common to the process's workload to the > front of the filter. However, there are limits to this, especially when > some processes are dealing with tens of filters[2], or when some > architectures have a less efficient BPF engine[3]. > > The most common use of seccomp, constructing syscall block/allow-lists, > where syscalls that are always allowed or always rejected (without regard > to any arguments), also tends to produce the most pathological runtime > problems, in that a large number of syscall checks in the filter need > to be performed to come to a determination. > > In order to optimize these cases from O(n) to O(1), seccomp can > use bitmaps to immediately determine the desired action. A critical > observation in the prior paragraph bears repeating: the common case for > syscall tests do not check arguments. For any given filter, there is a > constant mapping from the combination of architecture and syscall to the > seccomp action result. (For kernels/architectures without CONFIG_COMPAT, > there is a single architecture.). As such, it is possible to construct > a mapping of arch/syscall to action, which can be updated as new filters > are attached to a process. > > In order to build this mapping at filter attach time, each filter is > executed for every syscall (under each possible architecture), and > checked for any accesses of struct seccomp_data that are not the "arch" > nor "nr" (syscall) members. If only "arch" and "nr" are examined, then > there is a constant mapping for that syscall, and bitmaps can be updated > accordingly. If any accesses happen outside of those struct members, > seccomp must not bypass filter execution for that syscall, since program > state will be used to determine filter action result. > > During syscall action probing, in order to determine whether other members > of struct seccomp_data are being accessed during a filter execution, > the struct is placed across a page boundary with the "arch" and "nr" > members in the first page, and everything else in the second page. The > "page accessed" flag is cleared in the second page's PTE, and the filter > is run. If the "page accessed" flag appears as set after running the > filter, we can determine that the filter looked beyond the "arch" and > "nr" members, and exclude that syscall from the constant action bitmaps. > > For architectures to support this optimization, they must declare > their architectures for seccomp to see (via SECCOMP_ARCH and > SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient > CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()), > and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig. Wouldn't it be simpler to use a function that can run a subset of seccomp cBPF and bails out on anything that indicates that a syscall's handling is complex or on instructions it doesn't understand? For syscalls that have a fixed policy, a typical seccomp filter doesn't even use any of the BPF_ALU ops, the scratch space, or the X register; it just uses something like the following set of operations, which is easy to emulate without much code: BPF_LD | BPF_W | BPF_ABS BPF_JMP | BPF_JEQ | BPF_K BPF_JMP | BPF_JGE | BPF_K BPF_JMP | BPF_JGT | BPF_K BPF_JMP | BPF_JA BPF_RET | BPF_K Something like (completely untested): /* * Try to statically determine whether @filter will always return a fixed result * when run for syscall @nr under architecture @arch. * Returns true if the result could be determined; if so, the result will be * stored in @action. */ static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch, unsigned int nr, unsigned int *action) { int pc; unsigned int reg_value = 0; for (pc = 0; 1; pc++) { struct sock_filter *insn = &filter[pc]; u16 code = insn->code; u32 k = insn->k; switch (code) { case BPF_LD | BPF_W | BPF_ABS: if (k == offsetof(struct seccomp_data, nr)) { reg_value = nr; } else if (k == offsetof(struct seccomp_data, arch)) { reg_value = arch; } else { return false; /* can't optimize (non-constant value load) */ } break; case BPF_RET | BPF_K: *action = insn->k; return true; /* success: reached return with constant values only */ case BPF_JMP | BPF_JA: pc += insn->k; break; case BPF_JMP | BPF_JEQ | BPF_K: case BPF_JMP | BPF_JGE | BPF_K: case BPF_JMP | BPF_JGT | BPF_K: default: if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) { u16 op = BPF_OP(code); bool op_res; switch (op) { case BPF_JEQ: op_res = reg_value == k; break; case BPF_JGE: op_res = reg_value >= k; break; case BPF_JGT: op_res = reg_value > k; break; default: return false; /* can't optimize (unknown insn) */ } pc += op_res ? insn->jt : insn->jf; break; } return false; /* can't optimize (unknown insn) */ } } } That way, you won't need any of this complicated architecture-specific stuff.