Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4408241imu; Mon, 12 Nov 2018 10:29:35 -0800 (PST) X-Google-Smtp-Source: AJdET5fzga63I1C/vi4WKdjhB+5x2I6uU9BbXIU+xPi/yLujZjrUDEJJiGFhpaMFRu8wqJsfFKmw X-Received: by 2002:a65:530c:: with SMTP id m12mr1722190pgq.224.1542047375436; Mon, 12 Nov 2018 10:29:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542047375; cv=none; d=google.com; s=arc-20160816; b=x711xKAQpHoRcM6PKHerhyXosbqr5XpT0lnkOCn75Orwm/EM29szgP6gADcmrq8UYI qpZBGMcq2Kf6vLmDDJRW4w5M6yGBIc28G2L4IJQ7gCxPvMswKAU1e9IAMiEg6NGAZ9dt SBZ+33DRWh+MwQf96q27xllmpOVFdbRWWqHMwHDEVNOnqwzVt0cPzMzcmk+weuMyUHLs wK4oesLivllE0F0Hn76IbX0G4pmq5qg6UwpFsKW0DvpAwFTg7hmUzewrOu48PE7qKVj2 o37KSXNZKaykpCPAQB0LRt1Z+NCeFVV9BEC36Oa6ts/TbDTdG+4AOPlSq5IP/xrrPKHG XZHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature; bh=11/wTjcHLuYFAncO8Jd3+kuLG5NsXlYTTFxAWm6bKIs=; b=JgZrsL8hdCq7GD6Xz8+iYZcGebV6EuFhRLvFQm/HpTsBV9+9rSg01BouLnjFBkK06P DT8U2LbCdH9pMilMwZb6sYm77WEyFsGAr1OCyhJIoNwk/3RILwCf9/XJQ0wijPmCYXA5 /sbRlyEr6LHPToipXD5LzJm+8RmaCHxGH9hMUVS678oF4QKl2/k01k3OTWnmfb1zpwZ2 4pIdx3vAa+iMCyZ75vY/M7vmGLf9lMxB6Jfi0Xq/SCYq2bMm/HqQC06DFSBglkHSFItz FzBcS7DUoQ4kHI0HMOs6bOocIB1qCzuFpSuBznVxb3SLh0pKe3j8Z3vkm7MtM7rxGB4E lRBA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=rAV39Scw; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l6si8805496pgg.592.2018.11.12.10.29.20; Mon, 12 Nov 2018 10:29:35 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=rAV39Scw; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730348AbeKMEXY (ORCPT + 99 others); Mon, 12 Nov 2018 23:23:24 -0500 Received: from mail-vk1-f196.google.com ([209.85.221.196]:45539 "EHLO mail-vk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729936AbeKMEXY (ORCPT ); Mon, 12 Nov 2018 23:23:24 -0500 Received: by mail-vk1-f196.google.com with SMTP id n126so2175775vke.12 for ; Mon, 12 Nov 2018 10:28:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=11/wTjcHLuYFAncO8Jd3+kuLG5NsXlYTTFxAWm6bKIs=; b=rAV39ScwxJZ3gG+D5YaUeP2LjivVxjQSiaZqEkYFfZOD63bm1YBzOuNl4YjP4H+bp3 rVkzSwIxCD28VhElVukXHUZo6Wp6iHtAnxx7vjhzUY7Q3cnl7MYGJXl3aQV30iHoEHI4 mrqdmZx2kwnQhei9iKUjZwF2QScNSs6FQoe7JZko4TEreemXMblmXJeglCVaooJa5dcf 5pnHN2OKhaMZIZjBwHyyMZb0dSJnCopzOf+xpNjrxluOAp+rwnHjBfz7aQ+zlV94uBkZ 7ImULbmRFwxOE6QizGBN+LFyXyUUkse/0u6oX1j79TRhsASvAQONAzf/G5cz+1Y88YEq XluQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=11/wTjcHLuYFAncO8Jd3+kuLG5NsXlYTTFxAWm6bKIs=; b=V/xVLwHwnX6oECIxCLCcFFYjGDHM7LPypoGHspoLm3/cbQPgR1CvNcpAnleofgFKtj 4QVA9Wx6TDrPtILMNbgnY1AucKi+s42Rgi2dRX3Giwj0cnnFMpEbT8xuB6am8NcJ+eD7 iUXaCvk2VwhUIwRpghkRv6snoFfLMVjGFlzK90p6o39K6tEidOnQnylDgQsR6edAPPLp SChu0hGJaUOSr7V2eXKVIzjMIgTB0+YYFxQvUdo/m4aWASPXHwxk97+74xsPY5QVvoB0 RSRjSTPqJr8+l4gByCDEAwxzOz6ISH55ssRZi8NeeCd/6xTO0xLh7E/iA3yctDswigPx jnEA== X-Gm-Message-State: AGRZ1gKRTAZxRvlZfu/RJCqydjlGtSuDFvXLX0c37K+HsK7chk+kuda9 bihuYig5WU3kAII7362TpvvpuNj9ZJi+/bEKLX2q6A== X-Received: by 2002:a1f:cd3:: with SMTP id 202mr873433vkm.50.1542047337767; Mon, 12 Nov 2018 10:28:57 -0800 (PST) MIME-Version: 1.0 Received: by 2002:a67:f48d:0:0:0:0:0 with HTTP; Mon, 12 Nov 2018 10:28:56 -0800 (PST) In-Reply-To: References: <877ehjx447.fsf@oldenburg.str.redhat.com> <875zx2vhpd.fsf@oldenburg.str.redhat.com> From: Daniel Colascione Date: Mon, 12 Nov 2018 10:28:56 -0800 Message-ID: Subject: Re: Official Linux system wrapper library? To: Zack Weinberg Cc: Florian Weimer , "Michael Kerrisk (man-pages)" , Linux Kernel Mailing List , Joel Fernandes , Linux API , Willy Tarreau , Vlastimil Babka , "Carlos O'Donell" , GNU C Library Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 12, 2018 at 9:24 AM, Zack Weinberg wrote: > Daniel Colascione wrote: >> >> If the kernel provides a system call, libc should provide a C wrapper >> >> for it, even if in the opinion of the libc maintainers, that system >> >> call is flawed. > > I would like to state general support for this principle; in fact, I > seriously considered preparing patches that made exactly this change, > about a year ago, posting them, and calling for objections. Then > $dayjob ate all my hacking time (and is still doing so, alas). > > Nonetheless I do think there are exceptions, such as those that are > completely obsolete (bdflush, socketcall) and those that cannot be > used without stomping on glibc's own data structures (set_robust_list > is the only one of these I know about off the top of my head, but > there may well be others). If people want to stomp over glibc's data structures, let them. Maybe a particular program, for whatever reason, wants to avoid glibc mutexes entirely and do its own synchronization. It should be possible to cleanly separate the users on a per-thread basis. Besides, adhering to the principle that all system functionality is provided is worth it even if (in the case of bdflush) there's not a compelling use right now. Consider bdflush: in kernel debugging, hijacking "useless" system calls and setting breakpoints on them or temporarily wiring them to custom functionality is sometimes useful, and there's no particular reason to *prevent* a program from calling one of these routines, especially since there's little cost to providing a wrapper and noticeable value in completeness itself. > Daniel Colascione wrote: >> We can learn something from how Windows does things. On that system, >> what we think of as "libc" is actually two parts. (More, actually, but >> I'm simplifying.) At the lowest level, you have the semi-documented >> ntdll.dll, which contains raw system call wrappers and arcane >> kernel-userland glue. On top of ntdll live the "real" libc >> (msvcrt.dll, kernel32.dll, etc.) that provide conventional >> application-level glue. > > This is an appealing idea at first sight; there are several other > constituencies for it besides frustrated kernel hackers, such as > alternative system programming languages (Rust, Go) that want to > minimize dependencies on legacy "C library" functionality. If we > could find a clean way to do it, I would support it. > > The trouble is that "raw system call wrappers and arcane > kernel-userland glue" turns out to be a lot more code, with a lot more > tentacles in both directions, than you might think. If you compare > the sizes of the text sections of `ntdll.dll` and `libc.so.6` you will > notice that the former is _bigger_. The reason for this, as far as I > can determine (without any access to Microsoft's internal > documentation or source code ;-) is that ntdll.dll contains the > dynamic linker-equivalent, a basic memory allocator, the stack > unwinder, and a good chunk of the core thread library. (It also has > stuff in it that's needed by programs that run early during boot and > can't use kernel32.dll, but that's not our problem.) I don't think > this is an accident or an engineering compromise. It is necessary for > the dynamic loader to understand threads, and the thread library to > understand shared library semantics. Sure, but I'm not proposing talking about including threads or dynamic library loading in the minimal kernel glue library we're discussing. That ntdll includes this functionality (and a thread pool, and various other gunk) works for Windows, but it's not a necessary consequence of our adopting a layering model that the lowest of *our* layers include what the lowest layer on Windows includes. As I mentioned above, there's room for a "minimal" kernel interface library that actually touches relatively little of glibc's concerns. > A hypothetical equivalent liblinuxabi.so.1 would > have to do the same. It depends on what you put into the library. Basic system call wrappers and potential future userspace glue. The ABI I'm proposing doesn't have to look like POSIX --- for example, it can indicate error returns via a separate out parameter. (This approach is cleaner anyway.) As for pthread cancelation? All there's required is to mark a range of PC values as "after cancel check, before syscall instruction". The Linux ABI library could export a function that libc could use, passing in a program counter value, to determine whether PC (extracted from ucontext_t in a signal handler) were immediately before a cancellation check. What about off_t differences? Again, it doesn't matter. From the *kernel's* point of view, there's one width of offset parameter per system call per architecture. The library I'm proposing would expose this parameter literally. If a higher-level libc wants to use a preprocessor switch to conditionally support different offset widths, that's fine, but there's no reason that a more literal kernel interface library would have to do that. > And that means you wouldn't get as much > decoupling from the C and POSIX standards -- both of which specify at > least part of those semantics -- as you want, and we would still be > having these arguments. For example, it would be every bit as > troublesome for liblinuxabi.so.1 to export set_robust_list as it would > be for libc.so.6 to do that. Why? Such an exported function would cause no trouble until called, and there are legitimate reasons for calling such a function. Not everyone, as mentioned, wants to write a program that relies on libc. > You might be able to get out of most of the tangle by putting the > dynamic loader in a separate process I don't think that's a workable approach. The creation of a separate process is a very observable side effect, and it seems unexpected that something as simple as cat(1) would have this side effect. If anything, parts of the dynamic linker should move into the *kernel* to support things like applying relocations to clean pages, but that's a separate discussion. > and that's _also_ an appealing > idea for several other reasons, but it would still need to understand > some of the thread-related data structures within the processes it > manipulated, so I don't think it would help enough to be worth it (in > a complete greenfields design where I get to ignore POSIX and rewrite > the kernel API from scratch, now, that might be a different story). > > On a larger note, the fundamental complaint here is a project process > / communication complaint. We haven't been communicating enough with > the kernel team, fair criticism. We can do better. But the > communication has to go both ways. When, for instance, we tell you > that membarrier needs to have its semantics nailed down in terms of > the C++17 memory model, that actually needs to happen I think you can think of membarrier as upgrading signal fences to thread fences. > And, because this is a process / communication problem, you cannot > expect there to be a purely technical fix. Your position appears, > from where I'm sitting, to be something like "if we split glibc into > two pieces, then you and us will never have to talk to each other > again" which, I'm sorry, I can't see that working out in the long run. > >> (For example, for a long time now, I've wanted to go >> beyond POSIX and improve the system's signal handling API, and this >> improvement requires userspace cooperation.) > > This is also an appealing notion, but the first step should be to > eliminate all of the remaining uses for asynchronous signals: for > instance, give us process handles already! Once a program only ever > needs to call sigaction() to deal with > SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP, then we can think about > inventing a better replacement for that scenario. I too want process handles. (See my other patches.) But that's besides the point. This stance in the paragraph I've quoted is another example of glibc's misplaced idealism. As I've elaborated elsewhere, people use signals for many purposes today. The current signals API is extremely difficult to use correctly in a process in which multiple unrelated components want to take advantage of signal-handling functionality. Users deserve a cleaner, modern, and safe API. It's not productive withhold improvements to the signal API and gate them on unrelated features like process handles merely because, in the personal judgement of the glibc maintainers, developers should use signals for fewer things. This attitude is an unwarranted imposition on the entire ecosystem. It should be possible to innovate in this area without these blockers, one way or another.