Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp6523984imu; Wed, 14 Nov 2018 02:57:13 -0800 (PST) X-Google-Smtp-Source: AJdET5f0YMjpHt7WcSCzV5Z3WFv+y6+XKD3lKGf6ev1ht+tm2DUelVA0i0kcKH6Wrp6YfyBmzTFF X-Received: by 2002:a63:4a0a:: with SMTP id x10mr1260748pga.237.1542193033328; Wed, 14 Nov 2018 02:57:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542193033; cv=none; d=google.com; s=arc-20160816; b=G+Rj4DRmXJhLBmExxMF1Q6JKGNw1E8kUTQu+CQfluY9d0cHaZElYxzTVPQS8ptyUGv HiuchLlKhNvpiRZeNtfF6zLKnlp8vNCR79X18bhcE6GNAxqBKPaWMyFwKzg6hTWnKwRZ zf3XPO4NAfcMKGP0Lc0vtmr64UU+rQaFYypvurFwzz1BWafe1U0v4ki+a95ihHRJnKu1 sxk4M+RxeE9P3qBhwV6/h0jvr2QpuTgh0yKkrnImN8YS7w3Rwj61dkspHRfwBBs4VBBG IKsm434Z9qgfjIl2kTQD1QgxGdbN3BEP1hXn0mJBXVDsjXiSxjBXlJ2GVeAxTmoDd9xZ Ljxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=UaXPv8zxXDFFh8AKVSWnGSDqyaWp0+BXy6vM4N9UjrI=; b=ZyM2WJPdjlhp/ts03s82EunRX7Fsv9R7KbxbvNECesEwRTQgRaBoqD1mVv77Z8L++W BZhyCkvr1vvqtCuPgcYkUNJdgVTEpG3MT4lgt2pA3q9i7n8yxzl2JkQgo4/TRmjV5zCu Vyp9g2+YYEA4uY93KYofmEbjVnSKwRhYMKXev+7xi8z/1pVM+xzOukdQhaZ4TDvZATWJ iE3waR1Mw8uHaHQqEavWs2LhjhJ5E63HVBYOJ2Vt1MqMD2bWLPxW10/tdcCrlJv8DBgU EPWrJ+2GxO2a5jbPaMlYZvTEQ+PwjenrMpX6mH76qrc2iZ5GVKPTIWKw1zc4He3QSwU/ zfCg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n5si21061122pgl.485.2018.11.14.02.56.58; Wed, 14 Nov 2018 02:57:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732413AbeKNU5k (ORCPT + 99 others); Wed, 14 Nov 2018 15:57:40 -0500 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:41806 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727883AbeKNU5k (ORCPT ); Wed, 14 Nov 2018 15:57:40 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 884FA80D; Wed, 14 Nov 2018 02:54:55 -0800 (PST) Received: from e103592.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 96E4B3F718; Wed, 14 Nov 2018 02:54:53 -0800 (PST) Date: Wed, 14 Nov 2018 10:54:51 +0000 From: Dave Martin To: Andy Lutomirski Cc: Daniel Colascione , Florian Weimer , "Michael Kerrisk (man-pages)" , linux-kernel , Joel Fernandes , Linux API , Willy Tarreau , Vlastimil Babka , Carlos O'Donell , "libc-alpha@sourceware.org" Subject: Re: Official Linux system wrapper library? Message-ID: <20181114105449.GK3505@e103592.cambridge.arm.com> References: <877ehjx447.fsf@oldenburg.str.redhat.com> <875zx2vhpd.fsf@oldenburg.str.redhat.com> <20181113193859.GJ3505@e103592.cambridge.arm.com> <69B07026-5E8B-47FC-9313-E51E899FAFB0@amacapital.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <69B07026-5E8B-47FC-9313-E51E899FAFB0@amacapital.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 13, 2018 at 12:58:39PM -0800, Andy Lutomirski wrote: > > > On Nov 13, 2018, at 11:39 AM, Dave Martin wrote: > > > > On Mon, Nov 12, 2018 at 05:19:14AM -0800, Daniel Colascione wrote: > > > > [...] > > > >> We can learn something from how Windows does things. On that system, > >> what we think of as "libc" is actually two parts. (More, actually, but > >> I'm simplifying.) At the lowest level, you have the semi-documented > >> ntdll.dll, which contains raw system call wrappers and arcane > >> kernel-userland glue. On top of ntdll live the "real" libc > >> (msvcrt.dll, kernel32.dll, etc.) that provide conventional > >> application-level glue. The tight integration between ntdll.dll and > >> the kernel allows Windows to do very impressive things. (For example, > >> on x86_64, Windows has no 32-bit ABI as far as the kernel is > >> concerned! You can still run 32-bit programs though, and that works > >> via ntdll.dll essentially shimming every system call and switching the > >> processor between long and compatibility mode as needed.) Normally, > >> you'd use the higher-level capabilities, but if you need something in > >> ntdll (e.g., if you're Cygwin) nothing stops your calling into the > >> lower-level system facilities directly. ntdll is tightly bound to the > >> kernel; the higher-level libc, not so. > >> > >> We should adopt a similar approach. Shipping a lower-level > >> "liblinux.so" tightly bound to the kernel would not only let the > >> kernel bypass glibc's "editorial discretion" in exposing new > >> facilities to userspace, but would also allow for tighter user-kernel > >> integration that one can achieve with a simplistic syscall(2)-style > >> escape hatch. (For example, for a long time now, I've wanted to go > >> beyond POSIX and improve the system's signal handling API, and this > >> improvement requires userspace cooperation.) The vdso is probably too > >> small and simplistic to serve in this role; I'd want a real library. > > > > Can you expand on your reasoning here? > > > > Playing devil's advocate: > > > > If the library is just exposing the syscall interface, I don't see > > why it _couldn't_ fit into the vdso (or something vdso-like). > > > > If a separate library, I'd be concerned that it would accumulate > > value-add bloat over time, and the kernel ABI may start to creep since > > most software wouldn't invoke the kernel directly any more. Even if > > it's maintained in the kernel tree, its existence as an apparently > > standalone component may encourage forking, leading to a potential > > compatibility mess. > > > > The vdso approach would mean we can guarantee that the library is > > available and up to date at runtime, and may make it easier to keep > > what's in it down to sane essentials. > > Hmm. Putting on my vDSO hat: > > The vDSO could provide all kinds of nifty things. Better exception > handling comes to mind. But it has two major limitations that severely > restrict what it can do: > > - It can’t allocate memory. We probably want to keep it that way. > > - It can’t use TLS. Solving this without genuinely awful ABI issues > may be extremely hard. We *could* require callers to pass a thread > pointer in, I suppose. > > Also, if we make the vDSO stateful, CRIU is going to have a blast. We > might need to expose explicit save and restore abilities. > > As a straw man use case, it would be neat if DSOs (or the loader, > maybe) could register a list of exception fixups per DSO. The kernel > could consult these lists before delivering a signal. ISTM it wouldn’t > be so crazy if the vDSO handled registration, although it could uses > syscalls as well. If the vDSO did it, it would need somewhere to put > the lists. Fair points, though this is rather what I meant by "sane essentials". Because there are strict limits on what can be done in the vDSO, it may be more bloat-resistant and more conservatively maintained. This might provide a way to push some dumb compatibility kludge code that receives little ongoing maintenance outside the privilege wall, whereas it has to sit in the kernel proper today. In theory we could opt to advertise new syscalls only via vDSO entry points, and not maintain __NR_xxx values for them (which may or may not upset ptrace users.) Anyway, I digress... Cheers ---Dave