Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp567777pxb; Thu, 19 Nov 2020 08:17:54 -0800 (PST) X-Google-Smtp-Source: ABdhPJy6z02kwITlS6Lylb4+gZaz6SLMXdh/D2u2uBO8HelB+KR9NzUcFFSYV54n0pq3iXrxLqub X-Received: by 2002:a50:9f6c:: with SMTP id b99mr14440844edf.90.1605802674190; Thu, 19 Nov 2020 08:17:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605802674; cv=none; d=google.com; s=arc-20160816; b=M74p8XBcS1vkjvVSE6jfqRl7hOE5m0PkPcrmYN6inFZrk32T3QG1pxLV5hkYSH0CW3 phyM/8nkEscc+/1HwgDiULQTaiTQndiO9xkhrA3D/wUzj1Cq08LbtzrWxk4AXVZVv8vR QdUw/ZNDspqKu4lKZ3vHvQ+Wpl3vplukOrAtT9plCIdGcST7cHiCHmUSUqxAnjVMkMh5 nFFaKINhtcrNMngZiCC8gkvVWnCMTO1g4f25AnMVXGSV8kfYjiVUSvPv7EgsL46o1Up+ VmlY98NNK2dpwYY97jfHjI+AMVQfMtrp793bYBF/+/IuEcagwgyQqeTKIentuXOMe7m8 cpbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:organization:subject:cc:to:from; bh=IDGlUPzwMUHBp1f6Tt/odVEXST5Dd0SKYGkEt0CAr+c=; b=WxCD2rMb41EfUXNhwmg/ikuC/WziJl69wFuPv7M4UY+DBzs9eY+Izr1Ontzyg1igG/ eVU3bSw71Xuta9iK+i0wvZaNOC7ECEQpFPYSVqJmWoBuHsCU0ydhu76O8762QNHO6Kbo 5xhwdONXgjEeomWaBLPCA5b9wkziI3t5Oy5ZPa8sL8S+t1oD7fz9yBPLj3NjrSaoDbjI Z4DiYMUHEx0OZU3/I1lQlNVLSgS2JOeA+bB5p4FPUYpNZ/o/V0rzJTDSDg4w4teUTqxn 0Zl4nhIkdINSmPL5sqTl/aRIlsT/uM1fXRwxmNcDQcCEvIJpFkMhL7Or4fKzf59nkVgc y8XQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 33si106600edq.243.2020.11.19.08.17.31; Thu, 19 Nov 2020 08:17:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728062AbgKSQPw (ORCPT + 99 others); Thu, 19 Nov 2020 11:15:52 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:50780 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727741AbgKSQPv (ORCPT ); Thu, 19 Nov 2020 11:15:51 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id DFCF21F45A30 From: Gabriel Krisman Bertazi To: Rich Felker Cc: libc-alpha@sourceware.org, Florian Weimer , linux-kernel@vger.kernel.org Subject: Re: Kernel prctl feature for syscall interception and emulation Organization: Collabora References: <873616v6g9.fsf@collabora.com> <20201119151317.GF534@brightrain.aerifal.cx> Date: Thu, 19 Nov 2020 11:15:46 -0500 In-Reply-To: <20201119151317.GF534@brightrain.aerifal.cx> (Rich Felker's message of "Thu, 19 Nov 2020 10:13:18 -0500") Message-ID: <87h7pltj9p.fsf@collabora.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rich Felker writes: > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote: [...] > > SIGSYS (or signal handling in general) is not the right way to do > this. It has all the same problems that came up in seccomp filtering > with SIGSYS, and which were solved by user_notif mode (running the > interception in a separate thread rather than an async context > interrupting the syscall. In fact I wouldn't be surprised if what you > want can already be done with reasonable efficiency using seccomp > user_notif. Hi Rich, User_notif was raised in the kernel discussion and we had experimented with it, but the latency of user_notif is even worse than what we can do right now with other seccomp actions. Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall return to a userspace thunk, but the understanding among Wine developers is that SIGSYS is enough for their emulation needs. > The default-intercept and excepting libc code segment is also bogus, > and will break stuff, including vdso syscall mechanism on i386 and any > code outside libc that makes its own syscalls from asm. If you need to > tag regions to control interception, it should be tagging the emulated > Windows guest code, which is bounded and you have full control over, > rather than the host code, which is unbounded and includes any > libraries that get linked indirectly by Wine. The vdso trampoline, for the architectures that have it, is solved by the kernel implementation, who makes sure that region is allowed. The Linux code is not bounded, but the dispatcher region main goal is to support trampolines outside of the vdso case. The correct userspace implementation requires flipping the selector on any Windows/Linux code boundary cross, exactly because other libraries can issue syscalls directly. The fact that libc is not the only one issuing syscalls is the exact reason we need something more complex than a few seccomp filters. Flipping the selector on every boundary crosses is fine for performance, since we don't go into the kernel. But if we can avoid checking it from kernelspace, that's an optimization, which is what I meant by the dispatcher region allowing the more parts of the glibc code. That's just an optimization, but not strictly necessary for correctness. I still don't think anything is broken here. > But I'm skeptical that doing any new kernel-side logic for tagging is > needed. Seccomp already lets you filter on instruction pointer so you > can install filters that will trigger user_notif just for guest code, > then let you execute the emulation in the watcher thread and skip the > actual syscall in the watched thread. As I mentioned, we can check IP in seccomp and write filters. But this has two problems: 1) Performance. seccomp filters use cBPF which means 32bit comparisons, no maps and a very limited instruction set. We need to generate boundary checks for each memory segment. The filter becomes very large very quickly and becomes a observable bottleneck. 2) Seccomp filters cannot be removed. And we'd need to update them frequently. -- Gabriel Krisman Bertazi