Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp553012pxb; Wed, 18 Nov 2020 11:01:53 -0800 (PST) X-Google-Smtp-Source: ABdhPJzpRlTjEFTztQP2mX1N7lVL75LqLfrRdyVbZq7URop64DdvYiOPNxwn4Igjju4GZGDUcZcd X-Received: by 2002:a17:906:1614:: with SMTP id m20mr24833965ejd.258.1605726113347; Wed, 18 Nov 2020 11:01:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605726113; cv=none; d=google.com; s=arc-20160816; b=XQ6HGtAplIT11y6krp4fI6FDoRGg00lb75aR6woYRQFqY6Unbl6XDwJMtsKljlRvR9 Uri5VV/9Hxzl/doEHWwIXIYdGoo2gO3q3np/ua/yBrggcgUWBWmWEp2tALfB9rG25RIA vTbdG4rwBc1Ca3+t9E5CKuwxCUAYSBkLtoK+65o1Cbh9GEEsjqKJoNA+QqxG7KrMtRck B9+jcalr+LVbVwJ+DxCq6Mo5rib4ntuC2RySgeimjYv6HdmLsvUKmdrxTeSQ6EvkC9YU a8FpzzteyS0JCIeLDWpK8iOucO7gcZWVWsG0Y1mi4628buyuPtFaq34Cm36OiSrkW2bv FGaw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:date:subject :cc:to:from; bh=dXGJo02C5IbLmlO8hgl+nyEb6pyJBaDnKPhIoBP8JaY=; b=wRDTTWAT6YWpqu1Acn25nf+4aGLZgO9FwGwjANpvD5iKp/c6Jozbv3Zcz9vqRMiVOz haH7nJiNIjeKxPiBv0Zl/yKaw7G8dKU0EJvkpOm+288smgZMnClWNfhgHtvrlklyKgbC Etp6Jq5CN0t0mXS0mLQ9qDvenHgd8V+av5GZIod/K+9hOFiWZ41sI+g19FWYzc4ktxrG /Ti0mJXDSC/fVesolYi0foPW08opoUyAeqeDProLfrZA0yh7cIyZLFHoHzibus5zHEuv dQHiZZcnPeqNJDE1aFWK9XNeC2x6ngfnzRu0LWIcFfybU+TNyfEicT6Z/xlVc6fMUjgo 9rxA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id rn28si16160154ejb.585.2020.11.18.11.01.30; Wed, 18 Nov 2020 11:01:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726780AbgKRS5c (ORCPT + 99 others); Wed, 18 Nov 2020 13:57:32 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:37858 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726316AbgKRS5b (ORCPT ); Wed, 18 Nov 2020 13:57:31 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 543081F44B45 From: Gabriel Krisman Bertazi To: libc-alpha@sourceware.org Cc: Florian Weimer , linux-kernel@vger.kernel.org Subject: Kernel prctl feature for syscall interception and emulation Date: Wed, 18 Nov 2020 13:57:26 -0500 Message-ID: <873616v6g9.fsf@collabora.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, I'm proposing a kernel patch for a feature I'm calling Syscall User Dispatch (SUD). It is a mechanism to efficiently redirect system calls of only part of a binary back to userspace to be emulated by a compatibility layer. The patchset is close to being accepted, but Florian suggested the feature might pose some constraints on glibc, and requested I raise the discussion here. The problem I am trying to solve is that modern Windows games running over Wine are issuing Windows system calls directly from the Windows code, without going through the "WinAPI", which doesn't give Wine a chance to emulate the library calls and implement the behavior. As a result, Windows syscalls reache the Linux kernel, and the kernel has no context to differentiate them from native syscalls coming from the Wine side, since it cannot trust the ABI, not even syscall numbers to be something sane. Historically, Windows applications were very respectful of the WinAPI, not bypassing it, but we are seeing modern applications like games doing it more often for reasons, I believe, of DRM. It is worth mentioning that, by design, Wine and the Windows application run on the same process space, so we really cannot just filter specific threads or the entire application. We need some kind of filter executed on each system call. Now, the obvious way to solve this problem would be cBPF filtering memory regions, through Seccomp. The main problem with that approach is the performance of executing a large cBPF filter. The goal is to run games, and we observed the Seccomp filter become a bottleneck, since we have many, many memory areas that need to be checked by cBPF. In addition, seccomp, as a security mechanism, doesn't support some filter update operations, like removing them. Another approaches were explored, like making a new mode out of seccomp, but the kernel community preferred to make it a separate, self-contained mechanism. Other solutions, like (live) patching the Windows application are out of question, as they would trip DRM and anti-cheat protection mechanisms. The SUD interface I proposed to the kernel community is self-contained and exposed as a prctl option. It lets userspace define a switch variable per-thread that, when set, will raise a SIGSYS for any syscall attempted. The idea is that Wine can just flip this switch efficiently before delivering control to the Windows portions of the binary, and flip it back off when it needs to execute native syscalls. It is important for us that the switch flip doesn't require a syscall, for performance reasons. The interface also lets userspace define a "dispatcher region" from where any syscalls are always executed, regardless of the selector variable. This is important for the return of the SIGSYS directly to a Windows segment, where we need to execute the signal return trampoline with the selector blocked. Ideally, Wine would simply define this dispatcher region as the entire libc code segment, and just use the selector to safe-guard against Linux libraries issuing syscalls by themselves (they exist). I think my questions to libc are: what are the constraints, if any, that libc would face with this new interface? I expected this to be completely invisible to libc. In addition, are there any problems you foresee with the current interface? Finally, I don't think it makes sense to bother you immediately with the kernel implementation patches, but if you want to see the them, they are archived in the link below. I can also share them directly on this ML if you request it. https://lkml.org/lkml/2020/11/17/2347 Nevertheless, I think it is useful the share the final patch, that has the in-tree documentation for the interface, which I inlined in this message. Thanks. -- >8 -- Subject: [PATCH v7 7/7] docs: Document Syscall User Dispatch Explain the interface, provide some background and security notes. Signed-off-by: Gabriel Krisman Bertazi Reviewed-by: Kees Cook --- .../admin-guide/syscall-user-dispatch.rst | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst new file mode 100644 index 000000000000..e2fb36926f97 --- /dev/null +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -0,0 +1,87 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Syscall User Dispatch +===================== + +Background +---------- + +Compatibility layers like Wine need a way to efficiently emulate system +calls of only a part of their process - the part that has the +incompatible code - while being able to execute native syscalls without +a high performance penalty on the native part of the process. Seccomp +falls short on this task, since it has limited support to efficiently +filter syscalls based on memory regions, and it doesn't support removing +filters. Therefore a new mechanism is necessary. + +Syscall User Dispatch brings the filtering of the syscall dispatcher +address back to userspace. The application is in control of a flip +switch, indicating the current personality of the process. A +multiple-personality application can then flip the switch without +invoking the kernel, when crossing the compatibility layer API +boundaries, to enable/disable the syscall redirection and execute +syscalls directly (disabled) or send them to be emulated in userspace +through a SIGSYS. + +The goal of this design is to provide very quick compatibility layer +boundary crosses, which is achieved by not executing a syscall to change +personality every time the compatibility layer executes. Instead, a +userspace memory region exposed to the kernel indicates the current +personality, and the application simply modifies that variable to +configure the mechanism. + +There is a relatively high cost associated with handling signals on most +architectures, like x86, but at least for Wine, syscalls issued by +native Windows code are currently not known to be a performance problem, +since they are quite rare, at least for modern gaming applications. + +Since this mechanism is designed to capture syscalls issued by +non-native applications, it must function on syscalls whose invocation +ABI is completely unexpected to Linux. Syscall User Dispatch, therefore +doesn't rely on any of the syscall ABI to make the filtering. It uses +only the syscall dispatcher address and the userspace key. + +Interface +--------- + +A process can setup this mechanism on supported kernels +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl: + + prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , [selector]) + + is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and +disable the mechanism globally for that thread. When +PR_SYS_DISPATCH_OFF is used, the other fields must be zero. + + and delimit a closed memory region interval +from which syscalls are always executed directly, regardless of the +userspace selector. This provides a fast path for the C library, which +includes the most common syscall dispatchers in the native code +applications, and also provides a way for the signal handler to return +without triggering a nested SIGSYS on (rt_)sigreturn. Users of this +interface should make sure that at least the signal trampoline code is +included in this region. In addition, for syscalls that implement the +trampoline code on the vDSO, that trampoline is never intercepted. + +[selector] is a pointer to a char-sized region in the process memory +region, that provides a quick way to enable disable syscall redirection +thread-wide, without the need to invoke the kernel directly. selector +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other +value should terminate the program with a SIGSYS. + +Security Notes +-------------- + +Syscall User Dispatch provides functionality for compatibility layers to +quickly capture system calls issued by a non-native part of the +application, while not impacting the Linux native regions of the +process. It is not a mechanism for sandboxing system calls, and it +should not be seen as a security mechanism, since it is trivial for a +malicious application to subvert the mechanism by jumping to an allowed +dispatcher region prior to executing the syscall, or to discover the +address and modify the selector value. If the use case requires any +kind of security sandboxing, Seccomp should be used instead. + +Any fork or exec of the existing process resets the mechanism to +PR_SYS_DISPATCH_OFF. -- 2.29.2 -- Gabriel Krisman Bertazi