Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp666681ybg; Mon, 1 Jun 2020 11:10:38 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxuV+QD45ybyQ5fi6Qav9XXqxQyj7cWH2zM+JpDGmEkunsQYyuReYVnjfl9pg9bGXTatqCc X-Received: by 2002:aa7:c356:: with SMTP id j22mr9483232edr.59.1591035038225; Mon, 01 Jun 2020 11:10:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591035038; cv=none; d=google.com; s=arc-20160816; b=NLQo7RnloqqFnAnQL5lGiEskKYLG6YiW399xZpRXkyT5mNKgfI5fQc75aV1vLBBBZW 1RTqJk9zfQYCzZUw1ndQcxv3zD3jSZK2GgYgA/jpWeJJlE4W5s1HXRatSIPma/lKkCcI bmUZQWuL7Y47VPYgiWahbWGC4sKbhhFoLacbLCkwLeb/Wg5K/gaePLYe3nrf/OxLaDSL UeFMerZk8OtGNrXrAm4QNkoAXfGAVwYA0HHrizorPlkgTNZn5wA3oFyj8/lKomRk81jh idh7drJ7dS5zAdGQO6pi4toa6IEzwvjaIglMJUM4UlGFEdWtqk80iH/ThnIi/0bS5Lol CBrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:organization:subject:cc:to:from; bh=UUfdpVL6J2XxTQ9bwNGYsW8S6/s18NmTnPMqaILMWQI=; b=u/kzFfIn3OqVGMeX/VsL8HO8xjXySM1K7Bn+DyKvmR6J10gVLBezg40kEYmeBQJAGM AYGjGBn0iyM4pN9V9GPxGKSuRKs2JTKN83OxeDpVbUIE3Yx89ryu8KUVqbJbgRoP+H5B lfcvBy0z2j7EJ/iA4Gc1W/I9/HceboR+dc/VyEsiuwrEtllNA1l7kAjk4CZ0fgRBel53 ZDk/3XbGiFq7pOvPfn5FDNhuxg+wpJiyniaScKcdesT9PzcVFYVg+nzG7r2wJzW97fR2 AGDBIegKZjkWRtw4wtVK0c20K/jXMbloEGifUJbcqhoCBpybdAKwLSepYcR5aORUxv5k 7GAQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j17si63787edh.365.2020.06.01.11.10.14; Mon, 01 Jun 2020 11:10:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729784AbgFASHN (ORCPT + 99 others); Mon, 1 Jun 2020 14:07:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37518 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729750AbgFASGf (ORCPT ); Mon, 1 Jun 2020 14:06:35 -0400 Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e3e3]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFBE7C05BD43 for ; Mon, 1 Jun 2020 11:06:35 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 3AFDB2A2AF4 From: Gabriel Krisman Bertazi To: Andy Lutomirski Cc: Paul Gofman , Linux-MM , LKML , kernel@collabora.com, Thomas Gleixner , Kees Cook , Will Drewry , "H . Peter Anvin" , Zebediah Figura Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas Organization: Collabora References: <85367hkl06.fsf@collabora.com> <079539BF-F301-47BA-AEAD-AED23275FEA1@amacapital.net> <50a9e680-6be1-ff50-5c82-1bf54c7484a9@gmail.com> Date: Mon, 01 Jun 2020 14:06:30 -0400 In-Reply-To: (Andy Lutomirski's message of "Sun, 31 May 2020 14:03:48 -0700") Message-ID: <85y2p664pl.fsf@collabora.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andy Lutomirski writes: > On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski wrote: >> >> >> What if there was a special filter type that ran a BPF program on each >> syscall, and the program was allowed to access user memory to make its >> decisions, e.g. to look at some list of memory addresses. But this >> would explicitly *not* be a security feature -- execve() would remove >> the filter, and the filter's outcome would be one of redirecting >> execution or allowing the syscall. If the "allow" outcome occurs, >> then regular seccomp filters run. Obviously the exact semantics here >> would need some care. > > Let me try to flesh this out a little. > > A task could install a syscall emulation filter (maybe using the > seccomp() syscall, maybe using something else). There would be at > most one such filter per process. Upon doing a syscall, the kernel > will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic > argument translation) and would then invoke the filter. The filter is > an eBPF program (sorry Kees) and, as input, it gets access to the > task's register state and to an indication of which type of syscall > entry this was. This will inherently be rather architecture specific > -- x86 choices could be int80, int80(translated), and syscall64. (We > could expose SYSCALL32 separately, I suppose, but SYSENTER is such a > mess that I'm not sure this would be productive.) The program can > access user memory, and it returns one of two results: allow the > syscall or send SIGSYS. If the program tries to access user memory > and faults, the result is SIGSYS. > > (I would love to do this with cBPF, but I'm not sure how to pull this > off. Accessing user memory is handy for making the lookup flexible > enough to detect Windows vs Linux. It would be *really* nice to > finally settle the unprivileged eBPF subset discussion so that we can > figure out how to make eBPF work here.) > > execve() clears the filter. clone() copies the filter. > > Does this seem reasonable? Is the implementation complexity small > enough? Is the eBPF thing going to be a showstopper? > > Using a signal instead of a bespoke thunk simplifies a lot of thorny > details but is also enough slower that catching all syscalls might be > a performance problem. If we can have something close to the numbers you shared, it seems to be good for us. Using the thunk instead of a signal seems very interesting for performance. Though, I'm not convinced about this not being part of seccomp just because it is not security. The suggestion from Kees to convert seccomp to eBPF filters and stack them would provide similar semantics and reuse the infrastructure. Finnaly, as you said, I'm afraid that eBPF will be a show stopper, unless unpriviledged eBPF becomes a thing. Wine cannot count on CAP_SYS_ADMIN. -- Gabriel Krisman Bertazi