Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754421AbbG3LR0 (ORCPT ); Thu, 30 Jul 2015 07:17:26 -0400 Received: from mail-vk0-f54.google.com ([209.85.213.54]:35605 "EHLO mail-vk0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753819AbbG3LRX (ORCPT ); Thu, 30 Jul 2015 07:17:23 -0400 X-Greylist: delayed 364 seconds by postgrey-1.27 at vger.kernel.org; Thu, 30 Jul 2015 07:17:23 EDT MIME-Version: 1.0 In-Reply-To: <20150730083831.GA22182@gmail.com> References: <1438242731-27756-1-git-send-email-drysdale@google.com> <1438242731-27756-2-git-send-email-drysdale@google.com> <20150730083831.GA22182@gmail.com> From: David Drysdale Date: Thu, 30 Jul 2015 12:10:58 +0100 Message-ID: Subject: Re: [PATCHv2 1/1] Documentation: describe how to add a system call To: Ingo Molnar Cc: Linux API , Michael Kerrisk , Andrew Morton , Arnd Bergmann , Shuah Khan , Jonathan Corbet , Eric B Munson , Randy Dunlap , Andrea Arcangeli , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Oleg Nesterov , Linus Torvalds , Greg Kroah-Hartman , Andy Lutomirski , Al Viro , Rusty Russell , Peter Zijlstra , Vivek Goyal , Alexei Starovoitov , David Herrmann , "Theodore Ts'o" , Kees Cook , Milosz Tanski , Fam Zheng , Josh Triplett , Mathieu Desnoyers , linux-doc@vger.kernel.org, "linux-kernel@vger.kernel.org" , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6764 Lines: 156 On Thu, Jul 30, 2015 at 9:38 AM, Ingo Molnar wrote: > > * David Drysdale wrote: > >> +Designing the API >> +----------------- >> + >> +A new system call forms part of the API of the kernel, and has to be supported >> +indefinitely. As such, it's a very good idea to explicitly discuss the >> +interface on the kernel mailing list, and to plan for future extensions of the >> +interface. In particular: >> + >> + **Include a flags argument for every new system call** > > Sorry, but I think that's bad avice, because even a 'flags' field is inflexible > and stupid in many cases - it fosters an 'ioctl' kind of design. > >> +The syscall table is littered with historical examples where this wasn't done, >> +together with the corresponding follow-up system calls (eventfd/eventfd2, >> +dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2), so >> +learn from the history of the kernel and include a flags argument from the >> +start. > > The syscall table is also littered with system calls that have an argument space > considerably larger than what 6 parameters can express, where various 'flags' are > used to bring in different parts of new APIs, in a rather messy way. > > The right approach IMHO is to think about how extensible a system call is expected > to be, and to plan accordingly. > > If you are anywhere close to 6 parameters, you should not introduce 'flags' but > you should _reduce_ the number of parameters to a clean essential of 2 or 3 > parameters and should shuffle parameters out to a separate 'parameters/attributes' > structure that is passed in by pointer: > > SYSCALL_DEFINE2(syscall, int, fd, struct params __user *, params); > > And it's the design of 'struct params' that determines future flexibility of the > interface. A very flexible approach is to not use flags but a 'size' argument: > > struct params { > u32 size; > u32 param_1; > u64 param_2; > u64 param_3; > }; > > Where 'size' is set by user-space to the size of 'struct params' known to it at > build time: > > params->size = sizeof(*params); > > In the normal case the kernel will get param->size == sizeof(*params) as known to > the kernel. > > When the system call is extended in the future on the kernel side, with 'u64 > param_4', then the structure expands from an old size of 24 to a new size of 32 > bytes. The following scenarios might occur: > > - the common case: new user-space calls the new kernel code, ->size is 32 on both > sides. > > - old binaries might call the kernel with params->size == 24, in which case the > kernel sets the new fields to 0. The new feature should be written > accordingly, so that a value of 0 means the old behavior. > > - new binaries might run on old kernels, with params->size == 32. In this case > the old kernel will check that all the new fields it does not know about are > set to 0 - if they are nonzero (if the new feature is used) it returns with > -ENOSYS or -EINVAL. > > With this approach we have both backwards and forwards binary compatibility: new > binaries will run on old kernels just fine, even if they have ->size set to 32, as > long as they make use of the features. > > This design simplifies application design considerably: as new code can mostly > forget about old ABIs, there's no multiple versions to be taken care of, there's > just a single 'struct param' known to both sides, and there's no version skew. > > We are using such a design in perf_event_open(), see perf_copy_attr() in > kernel/events/core.c. And yes, ironically that system call still has a historic > 'flags' argument, but it's not used anymore for extension: we've made over 30 > extensions to the ABI in the last 3 years, which would have been impossible with a > 'flags' approach. > > Thanks, > > Ingo Fair point, there are other ways to ensure extendability that just the simple flags approach -- and as you say, for more sophisticated interfaces flags might hinder more than they help. How about the attached text as an attempt to cover both bases? Thanks, David ------------------ Designing the API: Planning for Extension ----------------------------------------- A new system call forms part of the API of the kernel, and has to be supported indefinitely. As such, it's a very good idea to explicitly discuss the interface on the kernel mailing list, and it's important to plan for future extensions of the interface. (The syscall table is littered with historical examples where this wasn't done, together with the corresponding follow-up system calls -- eventfd/eventfd2, dup2/dup3, inotify_init/inotify_init1, pipe/pipe2, renameat/renameat2 -- so learn from the history of the kernel and plan for extensions from the start.) For simpler system calls that only take a couple of arguments, the preferred way to allow for future extensibility is to include a flags argument to the system call. To make sure that userspace programs can safely use flags between kernel versions, check whether the flags value holds any unknown flags, and reject the sycall (with EINVAL) if it does: if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) return -EINVAL; (If no flags values are used yet, check that the flags argument is zero.) For more sophisticated system calls that involve a larger number of arguments, it's preferred to encapsulate the majority of the arguments into a structure that is passed in by pointer. Such a structure can cope with future extension by including a size argument in the structure: struct xyzzy_params { u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ u32 param_1; u64 param_2; u64 param_3; }; As long as any subsequently added field, say param_4, is designed so that a zero value gives the previous behaviour, then this allows both directions of version mismatch: - To cope with a later userspace program calling an older kernel, the kernel code should check that any memory beyond the size of the structure that it expects is zero (effectively checking that param_4 == 0). - To cope with an older userspace program calling a newer kernel, the kernel code can zero-extend a smaller instance of the structure (effectively setting param_4 = 0). See perf_event_open(2) and the perf_copy_attr() function (in kernel/events/core.c) for an example of this approach. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/