Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp987636ybn; Wed, 2 Oct 2019 09:10:32 -0700 (PDT) X-Google-Smtp-Source: APXvYqyr5U3NYS1sa1T+P5HBUP99SkT5sxsCmAzyQeAIDh4QicCp+dKvmGOYCy43VGB2CPanmwrW X-Received: by 2002:a05:6402:2d8:: with SMTP id b24mr4659745edx.222.1570032632572; Wed, 02 Oct 2019 09:10:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570032632; cv=none; d=google.com; s=arc-20160816; b=zvYShoeLV/uYqOio9FX0t/QbJzoYBjWaJ/GU6Hj2Sw517MuL0buh7dd/BggiblggIt Z7JQW7puJwTp9PAWZaZkI4uNXZ+u5rnmfsLWa+GRd750nYzZIoam8brbvxYjQ3lKo6na mhSCHG4qXiU0r90D+Qc3x83oQD9M0DmJNU7tLESeudulWrEy0O7/3wIfpjqnqJkdtDkB /mAWhwPBYPfxMq+rbrdjXdi9nuLnKXJt0FTlIc3HsqQfrYwHs4YoN635hayyiieUkVbA xUlOfZp6maODuPDaGvXSWdt1Jn1AmJzZZUW827vAuVS5aE1O4xml91Kucg2bYN+723as Hdiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=Khh2tk0WumF8GLA70YIxdvs+gISMGm3JM+NJes5aDvQ=; b=oONq+hg02hRcFpRNqlskbAr7YrFQGcNzI1ByK3pC8i3V/yf99b07lwJETZ6loUeVfd pjqCWSp887kyQOuYeZ88Y/33FutrmggvaUUbAzAEzy8IhYXNwGgzNv4Vl2ll82+jAPW+ bMS/7IZd7CBB65anlPqYUrKnvwqIF+WPpYOidu3v3QNx4V1igNz+8npGO/kfadX7kzdr o+XkVtvWuPhdCXMep58wWTtAaekl/urYgUZLOu9plBTxkpLbrXPE27b6LfVcCqaW0Rl6 aVVjQWfaSMk3+plFFn/DqBXhJI2fyoqYrnFijEka7x0CA+Mo1PVLD2AMjni5eeXz7TvZ o2iA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x8si13130558edd.282.2019.10.02.09.10.03; Wed, 02 Oct 2019 09:10:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727289AbfJBPOr (ORCPT + 99 others); Wed, 2 Oct 2019 11:14:47 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:49931 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725893AbfJBPOr (ORCPT ); Wed, 2 Oct 2019 11:14:47 -0400 Received: from [213.220.153.21] (helo=localhost.localdomain) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1iFgKy-00035s-2J; Wed, 02 Oct 2019 15:14:44 +0000 From: Christian Brauner To: linux-kernel@vger.kernel.org Cc: libc-alpha@sourceware.org, Jonathan Corbet , Christian Brauner , Federico Vaga , Alessia Mantegazza , Guillaume Dore , linux-doc@vger.kernel.org, Christian Brauner , Aleksa Sarai Subject: [PATCH] Documentation: update about adding syscalls Date: Wed, 2 Oct 2019 17:14:37 +0200 Message-Id: <20191002151437.5367-1-christian.brauner@ubuntu.com> X-Mailer: git-send-email 2.23.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add additional information on how to ensure that syscalls with structure arguments are extensible and add a section about naming conventions to follow when adding revised versions of already existing syscalls. Co-Developed-by: Aleksa Sarai Signed-off-by: Aleksa Sarai Signed-off-by: Christian Brauner --- Documentation/process/adding-syscalls.rst | 82 +++++++++++++++++++---- 1 file changed, 70 insertions(+), 12 deletions(-) diff --git a/Documentation/process/adding-syscalls.rst b/Documentation/process/adding-syscalls.rst index 1c3a840d06b9..93e0221fbb9a 100644 --- a/Documentation/process/adding-syscalls.rst +++ b/Documentation/process/adding-syscalls.rst @@ -79,7 +79,7 @@ flags, and reject the system call (with ``EINVAL``) if it does:: For more sophisticated system calls that involve a larger number of arguments, it's preferred to encapsulate the majority of the arguments into a structure that is passed in by pointer. Such a structure can cope with future extension -by including a size argument in the structure:: +by either including a size argument in the structure:: struct xyzzy_params { u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ @@ -87,20 +87,56 @@ by including a size argument in the structure:: u64 param_2; u64 param_3; }; + int sys_xyzzy(struct xyzzy_params __user *uarg); + /* in case of -E2BIG, p->size is set to the in-kernel size and thus all + extensions after that offset are unsupported. */ -As long as any subsequently added field, say ``param_4``, is designed so that a -zero value gives the previous behaviour, then this allows both directions of -version mismatch: +or by including a separate argument that specifies the size:: - - To cope with a later userspace program calling an older kernel, the kernel - code should check that any memory beyond the size of the structure that it - expects is zero (effectively checking that ``param_4 == 0``). - - To cope with an older userspace program calling a newer kernel, the kernel - code can zero-extend a smaller instance of the structure (effectively - setting ``param_4 = 0``). + struct xyzzy_params { + u32 param_1; + u64 param_2; + u64 param_3; + }; + /* userspace sets @usize = sizeof(struct xyzzy_params) */ + int sys_xyzzy(struct xyzzy_params __user *uarg, size_t usize); + /* in case of -E2BIG, userspace has to attempt smaller @usize values + to figure out which extensions are unsupported. */ + +Which model you choose to use is down to personal taste. However, please only +pick one (for a counter-example, see :manpage:`sched_getattr(2)`). + +Then, any extensions can be implemented by appending fields to the structure. +However, all new fields must be designed such that their zero value results in +identical behaviour to the pre-extension syscall. This allows for compatibility +between different-vintage kernels and userspace, no matter which is newer: + + - If the userspace is newer, then the older kernel can check whether the + fields it doesn't understand are zero-valued. If they are, then it can + safely ignore them (since any future extensions will be backwards-compatible + as described above). If they aren't, then the kernel doesn't support the + feature and can give an error (``-E2BIG`` is traditional). + + - If the userspace is older, then the kernel can treat all extensions that + userspace is unaware of as having their zero-value (and effectively + zero-extend the userspace structure when copying it for in-kernel usage). -See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in -``kernel/events/core.c``) for an example of this approach. + - If they're the same version, just use the structure as-is. + +As with the simpler flag-only syscalls, you must always check that any unknown +values for flag-like parameters in the passed structure are zeroed. + +It is also critical to ensure your syscall handles larger-sized arguments from +the outset, otherwise userspace will have to do additional (fairly pointless) +fallbacks for some old kernels. An example where this mistake was made is +:manpage:`rt_sigprocmask(2)` (where any unknown-sized arguments are +unconditionally rejected). + +To help implement this correctly, there is a helper function +``copy_struct_from_user()`` which handles the compatibility requirements for +you. For examples using this helper, see :manpage:`perf_event_open(2)` (which +uses the embedded-size model) and :manpage:`clone3(2)` (which uses the +separate-argument model). Designing the API: Other Considerations @@ -173,6 +209,28 @@ registers. (This concern does not apply if the arguments are part of a structure that's passed in by pointer.) +Designing the API: Revisions of syscalls +----------------------------------------------- + +Syscalls that were not designed to be extensible or syscalls that use a flag +argument for extensions running out of bits (e.g. :manpage:`clone(2)`) +sometimes need to be replaced. + +If the revised syscall provides a superset (or a reasonably large subset, such +as when a feature that turned out to be a design mistake is dropped) of the +features of the old syscall it is common practice to give it the same name with +a number appended. Examples for this include ``dup2``/``dup3``, +``epoll_create``/``epoll_create1`` and others. + +For some syscalls the appended number indicates the number of arguments +(``accept``/``accept4``) for others the number of the revision +(``clone``/``clone3``, ``epoll_create``/``epoll_create1``). Recent discussions +around new syscalls (such as ``clone3``) have pointed to a consensus around +naming syscalls revisions by appending the number of the revision to the +syscall name. That means, if you were to post a revised version of the +``openat`` syscall it should be named ``openat2``. + + Proposing the API ----------------- -- 2.23.0