Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp522806pxb; Thu, 19 Nov 2020 07:15:43 -0800 (PST) X-Google-Smtp-Source: ABdhPJxQidqxE0+jErnI+p9/Ef2uw+6lXbE1HkWTUyX6q2b40lWvLBWp9QNI/0rYBXfArksX09O8 X-Received: by 2002:a17:906:3547:: with SMTP id s7mr27933724eja.70.1605798943180; Thu, 19 Nov 2020 07:15:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605798943; cv=none; d=google.com; s=arc-20160816; b=X0SODGfygj2ff0eRWsFxAp2DBXZe++hvd6afVZ5UnvldEKNv9I4PikE4epeZA4nAU2 MsZSjl3BGyHB3T6Sy9oHiVGsC388PNcIHmbkRayp6w9Ai7ckkAjx+Y4cwEnI2mVzA+1C Ejs4p0FgAqRAChrkY3+1Jvk3GAd2qZ7D1t4wHQ7A/C/Ft+7amIuYLY6MGS4wTlcVY9po oxunswDGUN+Zds/WAh1BxwkwoKs8qqu0wz6L6s4WKuNthR514liqkZmRCS/DjwY5Xjf+ sbinY/r3XxO2QDy7ZIjA4ItUruzaC+ysqiHFnlpcvRbqgJREF5TsGvEg30kvUB/WcI7t WiVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=Ko9UJ5uR0zNlbjDb0N0/XPOAjtihiozBBa3ko6uGAJc=; b=ItlNpZ5wkgjXEBVE5m72kfxaqBWdgxxQOgpFhzJzGgHFuxRm7T2ZsGz9TLnUWRY2w4 cm4IiPI6uEkSN0hpUbnXUhRfhkwq/TwgJPwGOjV2SMgT5Y5KocNybBY8uS8eP+guxyYr 5VfiUoINe9e34AgqJ4dSZpVhdrKnZlHn+P1l+sxbcVSaFqb3ymL/1VLV81smwMWt/DdW t2lDRgVCh2XA7MrU+gK/hHXDbfp70wDzXTQm7D1hwvYfukpzBB2CK1YbfmbABhTzwNVo 3A9g5yMbNfXe8IuhnZekHlE1gLicGS+VT+nfLKH9Dkqlr6hqz54xMyjaLyCrma7JTrgk StOQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k25si800740ejr.647.2020.11.19.07.15.16; Thu, 19 Nov 2020 07:15:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726791AbgKSPNT (ORCPT + 99 others); Thu, 19 Nov 2020 10:13:19 -0500 Received: from brightrain.aerifal.cx ([216.12.86.13]:48266 "EHLO brightrain.aerifal.cx" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726474AbgKSPNT (ORCPT ); Thu, 19 Nov 2020 10:13:19 -0500 Date: Thu, 19 Nov 2020 10:13:18 -0500 From: Rich Felker To: Gabriel Krisman Bertazi Cc: libc-alpha@sourceware.org, Florian Weimer , linux-kernel@vger.kernel.org Subject: Re: Kernel prctl feature for syscall interception and emulation Message-ID: <20201119151317.GF534@brightrain.aerifal.cx> References: <873616v6g9.fsf@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <873616v6g9.fsf@collabora.com> User-Agent: Mutt/1.5.21 (2010-09-15) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote: > Hi, > > I'm proposing a kernel patch for a feature I'm calling Syscall User > Dispatch (SUD). It is a mechanism to efficiently redirect system calls > of only part of a binary back to userspace to be emulated by a > compatibility layer. The patchset is close to being accepted, but > Florian suggested the feature might pose some constraints on glibc, and > requested I raise the discussion here. > > The problem I am trying to solve is that modern Windows games running > over Wine are issuing Windows system calls directly from the Windows > code, without going through the "WinAPI", which doesn't give Wine a > chance to emulate the library calls and implement the behavior. As a > result, Windows syscalls reache the Linux kernel, and the kernel has > no context to differentiate them from native syscalls coming from the > Wine side, since it cannot trust the ABI, not even syscall numbers to be > something sane. Historically, Windows applications were very respectful > of the WinAPI, not bypassing it, but we are seeing modern applications > like games doing it more often for reasons, I believe, of DRM. > > It is worth mentioning that, by design, Wine and the Windows application > run on the same process space, so we really cannot just filter specific > threads or the entire application. We need some kind of filter executed > on each system call. > > Now, the obvious way to solve this problem would be cBPF filtering > memory regions, through Seccomp. The main problem with that approach is > the performance of executing a large cBPF filter. The goal is to run > games, and we observed the Seccomp filter become a bottleneck, since we > have many, many memory areas that need to be checked by cBPF. In > addition, seccomp, as a security mechanism, doesn't support some filter > update operations, like removing them. Another approaches were > explored, like making a new mode out of seccomp, but the kernel > community preferred to make it a separate, self-contained mechanism. > Other solutions, like (live) patching the Windows application are out > of question, as they would trip DRM and anti-cheat protection > mechanisms. > > The SUD interface I proposed to the kernel community is self-contained > and exposed as a prctl option. It lets userspace define a switch > variable per-thread that, when set, will raise a SIGSYS for any syscall > attempted. The idea is that Wine can just flip this switch efficiently > before delivering control to the Windows portions of the binary, and > flip it back off when it needs to execute native syscalls. It is > important for us that the switch flip doesn't require a syscall, for > performance reasons. The interface also lets userspace define a > "dispatcher region" from where any syscalls are always executed, > regardless of the selector variable. This is important for the return > of the SIGSYS directly to a Windows segment, where we need to execute > the signal return trampoline with the selector blocked. Ideally, Wine > would simply define this dispatcher region as the entire libc code > segment, and just use the selector to safe-guard against Linux libraries > issuing syscalls by themselves (they exist). > > I think my questions to libc are: what are the constraints, if any, that > libc would face with this new interface? I expected this to be > completely invisible to libc. In addition, are there any problems you > foresee with the current interface? > > Finally, I don't think it makes sense to bother you immediately with > the kernel implementation patches, but if you want to see the them, > they are archived in the link below. I can also share them directly on > this ML if you request it. > > https://lkml.org/lkml/2020/11/17/2347 > > Nevertheless, I think it is useful the share the final patch, that has > the in-tree documentation for the interface, which I inlined in this > message. SIGSYS (or signal handling in general) is not the right way to do this. It has all the same problems that came up in seccomp filtering with SIGSYS, and which were solved by user_notif mode (running the interception in a separate thread rather than an async context interrupting the syscall. In fact I wouldn't be surprised if what you want can already be done with reasonable efficiency using seccomp user_notif. The default-intercept and excepting libc code segment is also bogus, and will break stuff, including vdso syscall mechanism on i386 and any code outside libc that makes its own syscalls from asm. If you need to tag regions to control interception, it should be tagging the emulated Windows guest code, which is bounded and you have full control over, rather than the host code, which is unbounded and includes any libraries that get linked indirectly by Wine. But I'm skeptical that doing any new kernel-side logic for tagging is needed. Seccomp already lets you filter on instruction pointer so you can install filters that will trigger user_notif just for guest code, then let you execute the emulation in the watcher thread and skip the actual syscall in the watched thread. Rich