Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp4710537pxa; Mon, 10 Aug 2020 16:29:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwgyBTyn4JFbAQ6/NvPCbFZLKjQB6euqJ1sXvZeSIEYLzk3UYDbkZXAJENU6JLWN9HNWJP3 X-Received: by 2002:a17:906:ce37:: with SMTP id sd23mr24906995ejb.272.1597102167430; Mon, 10 Aug 2020 16:29:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597102167; cv=none; d=google.com; s=arc-20160816; b=B/1KjmQNbxwDf5y6ozvKz46/ZbsfWwty6a9/rGDFm/HMKbV24HN0OZhRI3cqKfGS3H 2xQdlCQUstm6LEbJBN9sFHSuPiX20gE1xygqh9VAVNqP+zWdd1Q9wG8a2YsjYCcfJGpW s7gZBQ/noj1v1Xt1JjeLUPWPE0mdZ5UyHxr6+XXPI26AkUVbGiMiq6wVIIAEGxPt91y8 kbXYIDeHD+kXpQ/SRuWUcoIloNd9oFym3jzJej1o7LoT6vleWBc70V/vD/OgOQGJUfjR WZOLb0Rn30Lrcmu9oPRoW5iB7RnZySMF2uV7l1fppamtyXsyQMs2R1Gw6cbs3mLHoM+e EMVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=0qpgwtAfj777qhFAWhAeaAoGTcssDcBpcpOwQvM2vaU=; b=UY+fO1Y3qIH9NkbzM6plK8RV9eZxZqWdOjw6SdBW+1XD2lDdZ94QbBDY7f7jcNW+9+ H7e4xlHTYo/KiBeBMXPToLIaz8evTZEarRYzfjNk5hNw9XzNY24ZqPGgpBeHmNJOobqD GSMkAQh7RRnWB7Bvh5loHw6KVo7cv03pkchBvdKHLzoCEFUB3KOLW8Knm7ff0us57Gc0 sd/bC9p+4eRLeu9gffIOAl10FNCpC18SUHJgfLmP2Uh+fv4ySAybWD228codJJeiAL2B dsETIlxqYbBZ2mDQPa0qEwNi15rz2laFF6uCPJJrwPdtj2lfsTrp+fQBHhR7QDxwZ3RK IUdQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u6si6558291eje.234.2020.08.10.16.28.51; Mon, 10 Aug 2020 16:29:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728023AbgHJX1T (ORCPT + 99 others); Mon, 10 Aug 2020 19:27:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53444 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728018AbgHJX1S (ORCPT ); Mon, 10 Aug 2020 19:27:18 -0400 Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e3e3]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72A92C06174A; Mon, 10 Aug 2020 16:27:18 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 2E52A2948A6 From: Gabriel Krisman Bertazi To: luto@kernel.org, tglx@linutronix.de Cc: keescook@chromium.org, x86@kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, willy@infradead.org, linux-kselftest@vger.kernel.org, shuah@kernel.org, Gabriel Krisman Bertazi , kernel@collabora.com Subject: [PATCH v5 9/9] doc: Document Syscall User Dispatch Date: Mon, 10 Aug 2020 19:26:36 -0400 Message-Id: <20200810232636.1415588-10-krisman@collabora.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200810232636.1415588-1-krisman@collabora.com> References: <20200810232636.1415588-1-krisman@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Explain the interface, provide some background and security notes. Signed-off-by: Gabriel Krisman Bertazi --- .../admin-guide/syscall-user-dispatch.rst | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst new file mode 100644 index 000000000000..96616660fded --- /dev/null +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Syscall User Dispatch +===================== + +Background +---------- + +Compatibility layers like Wine need a way to efficiently emulate system +calls of only a part of their process - the part that has the +incompatible code - while being able to execute native syscalls without +a high performance penalty on the native part of the process. Seccomp +falls short on this task, since it has limited support to efficiently +filter syscalls based on memory regions, and it doesn't support removing +filters. Therefore a new mechanism is necessary. + +Syscall User Dispatch brings the filtering of the syscall dispatcher +address back to userspace. The application is in control of a flip +switch, indicating the current personality of the process. A +multiple-personality application can then flip the switch without +invoking the kernel, when crossing the compatibility layer API +boundaries, to enable/disable the syscall redirection and execute +syscalls directly (disabled) or send them to be emulated in userspace +through a SIGSYS. + +The goal of this design is to provide very quick compatibility layer +boundary crosses, which is achieved by not executing a syscall to change +personality every time the compatibility layer executes. Instead, a +userspace memory region exposed to the kernel indicates the current +personality, and the application simply modifies that variable to +configure the mechanism. + +There is a relatively high cost associated with handling signals on most +architectures, like x86, but at least for Wine, syscalls issued by +native Windows code are currently not known to be a performance problem, +since they are quite rare, at least for modern gaming applications. + +Since this mechanism is designed to capture syscalls issued by +non-native applications, it must function on syscalls whose invocation +ABI is completely unexpected to Linux. Syscall User Dispatch, therefore +doesn't rely on any of the syscall ABI to make the filtering. It uses +only the syscall dispatcher address and the userspace key. + +Interface +--------- + +A process can setup this mechanism on supported kernels +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl: + + prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , [selector]) + + is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and +disable the mechanism globally for that thread. When +PR_SYS_DISPATCH_OFF is used, the other fields must be zero. + + and delimit a closed memory region interval from +which syscalls are always executed directly, regardless of the userspace +selector. This provides a fast path for the C library, which includes +the most common syscall dispatchers in the native code applications, and +also provides a way for the signal handler to return without triggering +a nested SIGSYS on (rt_)sigreturn. Users of this interface should make +sure that at least the signal trampoline code is included in this +region. In addition, for syscalls that implement the trampoline code on +the vDSO, that trampoline is never intercepted. + +[selector] is a pointer to a char-sized region in the process memory +region, that provides a quick way to enable disable syscall redirection +thread-wide, without the need to invoke the kernel directly. selector +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other +value should terminate the program with a SIGSYS. + +Security Notes +-------------- + +Syscall User Dispatch provides functionality for compatibility layers to +quickly capture system calls issued by a non-native part of the +application, while not impacting the Linux native regions of the +process. It is not a mechanism for sandboxing system calls, and it +should not be seen as a security mechanism, since it is trivial for a +malicious application to subvert the mechanism by jumping to an allowed +dispatcher region prior to executing the syscall, or to discover the +address and modify the selector value. If the use case requires any +kind of security sandboxing, Seccomp should be used instead. + +Any fork or exec of the existing process resets the mechanism to +PR_SYS_DISPATCH_OFF. -- 2.28.0