Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp629070pxb; Thu, 19 Nov 2020 09:42:56 -0800 (PST) X-Google-Smtp-Source: ABdhPJwDBwpNTjz6hvXe3MdYp7UT+D8kImuinNw0XyTPsTN9qhPyqwaWDM6GxuBUUpkK+bg2Y+hZ X-Received: by 2002:a17:906:5a8f:: with SMTP id l15mr29134913ejq.419.1605807776128; Thu, 19 Nov 2020 09:42:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605807776; cv=none; d=google.com; s=arc-20160816; b=n2v5oRDLRKwvcz90xk9jr12Hfx4S/UTU6fOB8WVwObGjsunDJ/HGzHom0oJEb0xFzS TwXy5bS7PGUKWpcC4y1z7J20TZC1TnIKUcXojDWDJfd4S1bABqPib4F2BnQ3A5U2cYfD Sc6ukzuFi2r/7o1wHzUUq/HKcTj890iRD/dj9APMjKNRcf0l6sreZXM25uHfZsWFLIE/ V9zGMab9hIVKSEqlHToodnTzXtIjfO6D6bqPoRBI25he5ht+iaC0UXwo048Bj3X0t7t6 X0qfSdlCqUwgQ41gDFVZWlr/HzBbMfIFnHbGyvpoQDrVynpYOMnxnHzdtr9jgLJsaVg/ 1+nA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=UEJJP1JAOgqjDH98p7/XMbEnVIztsMgI8Fz8LJvZXhg=; b=yavHOi4zuktKM36kSy64sIptmkYOHelabK7pc1yVL+jXU2isW+BKe6w1I/aUpYt12z bC47UQQWHpI/PDX9czpOSTiBzjYrmv646WvTwTYwY/Rd1+4f5R8aB2II1Y+8N+nhWOfE bdXrGBj+BEfz7HSkjtIKeWHWrzz5vK82UZFSkQDmdg9OPUamfCC+HUUqe9KZkcQ/tHxt 37ZolzWEfMqB84B6glf+lT3ezuCjmOaP9GPzUV1hIzapR0GSjupxhJqXo5P39RsvVbi/ MXOqIaPxedqhK8SsAL9qSgK4uCm5IBjuSRKC4RgvOacuRR9ztSI7OXH5AWIUz7Mut8ic 6n+g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a18si220967edy.568.2020.11.19.09.42.32; Thu, 19 Nov 2020 09:42:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728262AbgKSRjs (ORCPT + 99 others); Thu, 19 Nov 2020 12:39:48 -0500 Received: from brightrain.aerifal.cx ([216.12.86.13]:48366 "EHLO brightrain.aerifal.cx" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729737AbgKSRjp (ORCPT ); Thu, 19 Nov 2020 12:39:45 -0500 Date: Thu, 19 Nov 2020 12:39:42 -0500 From: Rich Felker To: Gabriel Krisman Bertazi Cc: libc-alpha@sourceware.org, Florian Weimer , linux-kernel@vger.kernel.org, Paul Gofman Subject: Re: Kernel prctl feature for syscall interception and emulation Message-ID: <20201119173938.GJ534@brightrain.aerifal.cx> References: <873616v6g9.fsf@collabora.com> <20201119151317.GF534@brightrain.aerifal.cx> <87h7pltj9p.fsf@collabora.com> <20201119162801.GH534@brightrain.aerifal.cx> <87eekpmeux.fsf@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87eekpmeux.fsf@collabora.com> User-Agent: Mutt/1.5.21 (2010-09-15) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 19, 2020 at 12:32:54PM -0500, Gabriel Krisman Bertazi wrote: > Rich Felker writes: > > > On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote: > >> Rich Felker writes: > >> > >> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote: > >> > >> [...] > >> > >> > > >> > SIGSYS (or signal handling in general) is not the right way to do > >> > this. It has all the same problems that came up in seccomp filtering > >> > with SIGSYS, and which were solved by user_notif mode (running the > >> > interception in a separate thread rather than an async context > >> > interrupting the syscall. In fact I wouldn't be surprised if what you > >> > want can already be done with reasonable efficiency using seccomp > >> > user_notif. > >> > >> Hi Rich, > >> > >> User_notif was raised in the kernel discussion and we had experimented > >> with it, but the latency of user_notif is even worse than what we can do > >> right now with other seccomp actions. > > > > Is there a compelling argument that the latency matters here? What > > syscalls are windows binaries making like this? Is there a reason you > > can't do something like intercepting the syscall with seccomp the > > first time it happens, then rewriting the code not to use a direct > > syscall on future invocations? > > We can't do any code rewriting without tripping DRM protections and > anti-cheating mechanisms. I think you could if you maintained separate versions of the code for read vs exec access ala some oldschool hardening tricks, but maybe that's not compatible with windows code (or with 64-bit mode?). Actually it's rather impressive that an DRM/anti-cheat mess works on Wine at all.. > I should correct myself here. While it is true that user_notif is > slower than other seccomp actions, this is not a problem in itself. The > frequency of syscalls that need to be emulated is much smaller than > regular syscalls, and the performance problem actually appears due to > the filtering. I should investigate user_notif more, but I don't oppose > SUD doing user_notif instead of SIGSYS. I will raise that with Wine > developers and the kernel community. Thanks! Avoiding repetition of the SIGSYS pitfall would be a good thing. > >> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall > >> return to a userspace thunk, but the understanding among Wine developers > >> is that SIGSYS is enough for their emulation needs. > > > > It might work for Wine needs, if Wine can guarantee it will never be > > running code with signals blocked and some other constraints, but then > > you end up with a mechanism that's designed just for Wine and that > > will have gratuitous reasons it's not usable elsewhere. That does not > > seem appropriate for inclusion in kernel. > > > >> > The default-intercept and excepting libc code segment is also bogus, > >> > and will break stuff, including vdso syscall mechanism on i386 and any > >> > code outside libc that makes its own syscalls from asm. If you need to > >> > tag regions to control interception, it should be tagging the emulated > >> > Windows guest code, which is bounded and you have full control over, > >> > rather than the host code, which is unbounded and includes any > >> > libraries that get linked indirectly by Wine. > >> > >> The vdso trampoline, for the architectures that have it, is solved by > >> the kernel implementation, who makes sure that region is allowed. > > > > I guess that works but it's ugly and assumes particular policy goals > > matching Wine's rather than being a general mechanism. > > > >> The Linux code is not bounded, but the dispatcher region main goal is to > >> support trampolines outside of the vdso case. The correct userspace > >> implementation requires flipping the selector on any Windows/Linux code > >> boundary cross, exactly because other libraries can issue syscalls > >> directly. The fact that libc is not the only one issuing syscalls is > >> the exact reason we need something more complex than a few seccomp > >> filters. > > > > I don't think this is correct. Rather than listing all the host > > library code ranges to allow, you just list all the guest Windows code > > ranges to intercept. Wine knows them by virtue of being the loader for > > them. This all seems really easy to do with seccomp with a very small > > filter. > > The Windows code is not completely loaded at initialization time. It > also has dynamic libraries loaded later. yes, wine knows the memory > regions, but there is no guarantee there is a small number of segments > or that the full picture is known at any given moment. Yes, I didn't mean it was known statically at init time (although maybe it can be; see below) just that all the code doing the loading is under Wine's control (vs having system dynamic linker doing stuff it can't reliably see, which is the case with host libraries). > >> > But I'm skeptical that doing any new kernel-side logic for tagging is > >> > needed. Seccomp already lets you filter on instruction pointer so you > >> > can install filters that will trigger user_notif just for guest code, > >> > then let you execute the emulation in the watcher thread and skip the > >> > actual syscall in the watched thread. > >> > >> As I mentioned, we can check IP in seccomp and write filters. But this > >> has two problems: > >> > >> 1) Performance. seccomp filters use cBPF which means 32bit comparisons, > >> no maps and a very limited instruction set. We need to generate > >> boundary checks for each memory segment. The filter becomes very large > >> very quickly and becomes a observable bottleneck. > > > > This sounds like you're doing something wrong. Range checking is O(log > > n) and n cannot be large enough to make log n significant. If you do > > it with a linear search rather than binary then of course it's slow. > > And SUD is O(1). The filtering overhead is the big point here. The OK, but for practical purposes O(log n) == O(1). > >> 2) Seccomp filters cannot be removed. And we'd need to update them > >> frequently. > > > > What are the updating requirements? > > As far as I understand (I'm not a wine developer), they need to remove > and modify filters. Given seccomp is a security feature, It would be a > hard sell to support these operations. We discussed this on the kernel > list. > > > I'm not sure if Windows code is properly PIC or not, but if it is, > > then you just do your own address assignment in a single huge range > > (first allocated with PROT_NONE, then MAP_FIXED over top of it) so > > that a single static range check suffices. > > I'm Cc'ing some wine developers who can assist with this point. Great! Rich