Received: by 2002:a25:ef43:0:0:0:0:0 with SMTP id w3csp2202027ybm; Sun, 31 May 2020 12:01:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzJHILR4zLJkcS9/9X1CjUN0QYJT9FbbqqV2iyaO6C+Zb4lsJPnx6RNxxoLnmdbUsXSJEzL X-Received: by 2002:a17:906:7210:: with SMTP id m16mr15836859ejk.475.1590951712564; Sun, 31 May 2020 12:01:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590951712; cv=none; d=google.com; s=arc-20160816; b=J3Xg5cPcsblgkh4M1FYdH5zRuGewt+RKw24h8tqv6I6X2B/Z8ZhiJxwtRlkZOcMaEt 74te3Dca3OHkCTGT9e/do02VQHlziE7iNgxh9xwdEQsMWP3gogGiDS54YEhA+NJaQgzN CGvHHs+vRQlhHrYQ3TKlnry46/75xr8k5dYX9QsYXEpo2e/TeZaUwxqQkrYF1aA8+CYn O09tR1xBnj1cM6s8shJqTVPOxwAN9VV/Z9gTJlbnkOAoBjHNjfwoxAanYNio6woXttbd 7QZQUZRAx2aBtzLNJS9dbRriPFGAsKXR6iHxrneD0RnUwc5dZZ8wbdxO+3C5dyX7iNEm nVNg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=aqy4dj5ARjaBQ5X3Ka7eZXgk1XyqUOoRwq0aHfMIJo8=; b=c62XdVAO/w8V7tK77K0aDV2NFJNOHwB6jo6wXcgyqiVAF7bq0LgXCKSChcMug00IIn uPUg1V3+blav7rZTFzTpi8ALOZb4PfkmJ66QSebn0mePBrUK8k16HHvvPr2oy72Z3LwR KYMcWjQakp7y+LALAWK6Su8BWnlMHTvw8cEmaGRPjdF/LOoGjoNLPktWCbqbLtEPrl2N JiTYrw6z5F12DKNMJuKdwuaoyOsMiJOjtlfiQSmIsrLANM6/gPA61vt1HiMwpburpqp9 DWPmHkDSL0YYYh9uJQhQIRViYSejC6vHgbpP7glSFjZiP/fBhjtus0ozrkTIe2v99hUL I7Zg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=ZujAwms3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id cx3si9874885edb.547.2020.05.31.12.01.29; Sun, 31 May 2020 12:01:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=ZujAwms3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728371AbgEaS5R (ORCPT + 99 others); Sun, 31 May 2020 14:57:17 -0400 Received: from mail.kernel.org ([198.145.29.99]:41818 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728296AbgEaS5Q (ORCPT ); Sun, 31 May 2020 14:57:16 -0400 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id CC586207D5 for ; Sun, 31 May 2020 18:57:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1590951436; bh=lWjRJ6TmC63MtaoV2GDVh+SjpgJni5oloXOqPMTiCTM=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=ZujAwms3y4kucEwEZ1nuKxm3b82kGZqYvCC+TAJLn2E48hG3tUni4luPyqb2jz4XQ mxX+61zbU4ATW45Q05e6SQPuHi4xxDXjSXtpRbjXJB3JxfbMmkI4d8O7F7dHQFnU8Y ipOtM8e0nERV1WFmuRUiTAVa2oxNPYWotcq1cJ0s= Received: by mail-wm1-f43.google.com with SMTP id r9so8708045wmh.2 for ; Sun, 31 May 2020 11:57:14 -0700 (PDT) X-Gm-Message-State: AOAM531FX+/WzX6P+ygAAe8YhhpBpJlb2rNkgC3EfHbG+q4q/0DB9QMy mAxaHkcTM/dTMf7JxFQ4bfmrxQj7EXC5HCCpsXBnKw== X-Received: by 2002:a1c:7f96:: with SMTP id a144mr18038816wmd.176.1590951433390; Sun, 31 May 2020 11:57:13 -0700 (PDT) MIME-Version: 1.0 References: <85367hkl06.fsf@collabora.com> <079539BF-F301-47BA-AEAD-AED23275FEA1@amacapital.net> <50a9e680-6be1-ff50-5c82-1bf54c7484a9@gmail.com> In-Reply-To: From: Andy Lutomirski Date: Sun, 31 May 2020 11:57:02 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas To: Paul Gofman Cc: Andy Lutomirski , Gabriel Krisman Bertazi , Linux-MM , LKML , kernel@collabora.com, Thomas Gleixner , Kees Cook , Will Drewry , "H . Peter Anvin" , Zebediah Figura Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, May 31, 2020 at 11:36 AM Paul Gofman wrote: > > On 5/31/20 21:10, Andy Lutomirski wrote: > > > > That's not what I meant. I meant that you would set the kernel up to > > redirect *all* syscalls from the thread with the sole exception of one > > syscall instruction in the thunk. This would catch Windows syscalls > > and Linux syscalls. The thunk would determine whether the original > > syscall was Linux or Windows and handle it accordingly. > > > > This may interact poorly with the DRM scheme. The redzone might need > > to be respected, or stack switching might be needed. > > Oh yeah, I see now, thanks. Sure, we could trap every syscall and have a > Seccomp-allowed trampoline for executing native ones with the existing > Seccomp implementation. But this is going to have prohibitive > performance impact. Our present use case specifics is that vast majority > of syscalls do not need to be emulated, they are native. And just a few > go from the Windows application which we need to trap and route to our > handler to let the program continue, while we do not care too much about > the overhead for those few. So the hope was that the kernel can route > that majority of Linux native syscalls inside with the minor overhead. > I've read the suggestion to use SECCOMP_RET_USER_NOTIF instead of > SECCOMP_RET_TRAP, is handling the trap this way supposed to be much > quicker than handling the sigsys from SECCOMP_RET_TRAP? More > specifically, would not SECCOMP_RET_USER_NOTIF effectively serialize all > the syscalls waiting in a single queue for processing, while > SECCOMP_RET_TRAP can be processed without exclusive locking? > > Using SECCOMP_RET_USER_NOTIF is likely to be considerably more expensive than my scheme. On a non-PTI system, my approach will add a few tens of ns to each syscall. On a PTI system, it will be worse. But using any kind of notifier for all syscalls will cause a context switch to a different user program for each syscall, and that will be much slower. I think that the implementation may well want to live in seccomp, but doing this as a seccomp filter isn't quite right. It's not a security thing -- it's an emulation thing. Seccomp is all about making inescapable sandboxes, but that's not what you're doing at all, and the fact that seccomp filters are preserved across execve() sounds like it'll be annoying for you. What if there was a special filter type that ran a BPF program on each syscall, and the program was allowed to access user memory to make its decisions, e.g. to look at some list of memory addresses. But this would explicitly *not* be a security feature -- execve() would remove the filter, and the filter's outcome would be one of redirecting execution or allowing the syscall. If the "allow" outcome occurs, then regular seccomp filters run. Obviously the exact semantics here would need some care.